Aspects of probabilistic modelling for data analysis

Computer technologies have revolutionised the processing of information and the search for knowledge. With the ever increasing computational power, it is becoming possible to tackle new data analysis applications as diverse as mining the Internet resources, analysing drugs effects on the organism or...

Full description

Bibliographic Details
Main Author: Delannay, Nicolas
Format: Others
Language:en
Published: Universite catholique de Louvain 2007
Subjects:
Online Access:http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10122007-121326/
id ndltd-BICfB-oai-ucl.ac.be-ETDUCL-BelnUcetd-10122007-121326
record_format oai_dc
spelling ndltd-BICfB-oai-ucl.ac.be-ETDUCL-BelnUcetd-10122007-1213262013-01-07T15:42:05Z Aspects of probabilistic modelling for data analysis Delannay, Nicolas Data mining Machine learning Bayesian statistics Robust statistics Regression NIR spectroscopy Collaborative filtering Computer technologies have revolutionised the processing of information and the search for knowledge. With the ever increasing computational power, it is becoming possible to tackle new data analysis applications as diverse as mining the Internet resources, analysing drugs effects on the organism or assisting wardens with autonomous video detection techniques. Fundamentally, the principle of any data analysis task is to fit a model which encodes well the dependencies (or patterns) present in the data. However, the difficulty is precisely to define such proper model when data are noisy, dependencies are highly stochastic and there is no simple physical rule to represent them. The aim of this work is to discuss the principles, the advantages and weaknesses of the probabilistic modelling framework for data analysis. The main idea of the framework is to model dispersion of data as well as uncertainty about the model itself by probability distributions. Three data analysis tasks are presented and for each of them the discussion is based on experimental results from real datasets. The first task considers the problem of linear subspaces identification. We show how one can replace a Gaussian noise model by a Student-t noise to make the identification more robust to atypical samples and still keep the learning procedure simple. The second task is about regression applied more specifically to near-infrared spectroscopy datasets. We show how spectra should be pre-processed before entering the regression model. We then analyse the validity of the Bayesian model selection principle for this application (and in particular within the Gaussian Process formulation) and compare this principle to the resampling selection scheme. The final task considered is Collaborative Filtering which is related to applications such as recommendation for e-commerce and text mining. This task is illustrative of the way how intuitive considerations can guide the design of the model and the choice of the probability distributions appearing in it. We compare the intuitive approach with a simpler matrix factorisation approach. Universite catholique de Louvain 2007-10-23 text application/pdf http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10122007-121326/ http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10122007-121326/ en unrestricted J'accepte que le texte de la thèse (ci-après l'oeuvre), sous réserve des parties couvertes par la confidentialité, soit publié dans le recueil électronique des thèses UCL. A cette fin, je donne licence à l'UCL : - le droit de fixer et de reproduire l'oeuvre sur support électronique : logiciel ETD/db - le droit de communiquer l'oeuvre au public Cette licence, gratuite et non exclusive, est valable pour toute la durée de la propriété littéraire et artistique, y compris ses éventuelles prolongations, et pour le monde entier. Je conserve tous les autres droits pour la reproduction et la communication de la thèse, ainsi que le droit de l'utiliser dans de futurs travaux. Je certifie avoir obtenu, conformément à la législation sur le droit d'auteur et aux exigences du droit à l'image, toutes les autorisations nécessaires à la reproduction dans ma thèse d'images, de textes, et/ou de toute oeuvre protégés par le droit d'auteur, et avoir obtenu les autorisations nécessaires à leur communication à des tiers. Au cas où un tiers est titulaire d'un droit de propriété intellectuelle sur tout ou partie de ma thèse, je certifie avoir obtenu son autorisation écrite pour l'exercice des droits mentionnés ci-dessus.
collection NDLTD
language en
format Others
sources NDLTD
topic Data mining
Machine learning
Bayesian statistics
Robust statistics
Regression
NIR spectroscopy
Collaborative filtering
spellingShingle Data mining
Machine learning
Bayesian statistics
Robust statistics
Regression
NIR spectroscopy
Collaborative filtering
Delannay, Nicolas
Aspects of probabilistic modelling for data analysis
description Computer technologies have revolutionised the processing of information and the search for knowledge. With the ever increasing computational power, it is becoming possible to tackle new data analysis applications as diverse as mining the Internet resources, analysing drugs effects on the organism or assisting wardens with autonomous video detection techniques. Fundamentally, the principle of any data analysis task is to fit a model which encodes well the dependencies (or patterns) present in the data. However, the difficulty is precisely to define such proper model when data are noisy, dependencies are highly stochastic and there is no simple physical rule to represent them. The aim of this work is to discuss the principles, the advantages and weaknesses of the probabilistic modelling framework for data analysis. The main idea of the framework is to model dispersion of data as well as uncertainty about the model itself by probability distributions. Three data analysis tasks are presented and for each of them the discussion is based on experimental results from real datasets. The first task considers the problem of linear subspaces identification. We show how one can replace a Gaussian noise model by a Student-t noise to make the identification more robust to atypical samples and still keep the learning procedure simple. The second task is about regression applied more specifically to near-infrared spectroscopy datasets. We show how spectra should be pre-processed before entering the regression model. We then analyse the validity of the Bayesian model selection principle for this application (and in particular within the Gaussian Process formulation) and compare this principle to the resampling selection scheme. The final task considered is Collaborative Filtering which is related to applications such as recommendation for e-commerce and text mining. This task is illustrative of the way how intuitive considerations can guide the design of the model and the choice of the probability distributions appearing in it. We compare the intuitive approach with a simpler matrix factorisation approach.
author Delannay, Nicolas
author_facet Delannay, Nicolas
author_sort Delannay, Nicolas
title Aspects of probabilistic modelling for data analysis
title_short Aspects of probabilistic modelling for data analysis
title_full Aspects of probabilistic modelling for data analysis
title_fullStr Aspects of probabilistic modelling for data analysis
title_full_unstemmed Aspects of probabilistic modelling for data analysis
title_sort aspects of probabilistic modelling for data analysis
publisher Universite catholique de Louvain
publishDate 2007
url http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10122007-121326/
work_keys_str_mv AT delannaynicolas aspectsofprobabilisticmodellingfordataanalysis
_version_ 1716393738680401920