Aspects of probabilistic modelling for data analysis
Computer technologies have revolutionised the processing of information and the search for knowledge. With the ever increasing computational power, it is becoming possible to tackle new data analysis applications as diverse as mining the Internet resources, analysing drugs effects on the organism or...
Main Author: | |
---|---|
Format: | Others |
Language: | en |
Published: |
Universite catholique de Louvain
2007
|
Subjects: | |
Online Access: | http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10122007-121326/ |
id |
ndltd-BICfB-oai-ucl.ac.be-ETDUCL-BelnUcetd-10122007-121326 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-BICfB-oai-ucl.ac.be-ETDUCL-BelnUcetd-10122007-1213262013-01-07T15:42:05Z Aspects of probabilistic modelling for data analysis Delannay, Nicolas Data mining Machine learning Bayesian statistics Robust statistics Regression NIR spectroscopy Collaborative filtering Computer technologies have revolutionised the processing of information and the search for knowledge. With the ever increasing computational power, it is becoming possible to tackle new data analysis applications as diverse as mining the Internet resources, analysing drugs effects on the organism or assisting wardens with autonomous video detection techniques. Fundamentally, the principle of any data analysis task is to fit a model which encodes well the dependencies (or patterns) present in the data. However, the difficulty is precisely to define such proper model when data are noisy, dependencies are highly stochastic and there is no simple physical rule to represent them. The aim of this work is to discuss the principles, the advantages and weaknesses of the probabilistic modelling framework for data analysis. The main idea of the framework is to model dispersion of data as well as uncertainty about the model itself by probability distributions. Three data analysis tasks are presented and for each of them the discussion is based on experimental results from real datasets. The first task considers the problem of linear subspaces identification. We show how one can replace a Gaussian noise model by a Student-t noise to make the identification more robust to atypical samples and still keep the learning procedure simple. The second task is about regression applied more specifically to near-infrared spectroscopy datasets. We show how spectra should be pre-processed before entering the regression model. We then analyse the validity of the Bayesian model selection principle for this application (and in particular within the Gaussian Process formulation) and compare this principle to the resampling selection scheme. The final task considered is Collaborative Filtering which is related to applications such as recommendation for e-commerce and text mining. This task is illustrative of the way how intuitive considerations can guide the design of the model and the choice of the probability distributions appearing in it. We compare the intuitive approach with a simpler matrix factorisation approach. Universite catholique de Louvain 2007-10-23 text application/pdf http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10122007-121326/ http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10122007-121326/ en unrestricted J'accepte que le texte de la thèse (ci-après l'oeuvre), sous réserve des parties couvertes par la confidentialité, soit publié dans le recueil électronique des thèses UCL. A cette fin, je donne licence à l'UCL : - le droit de fixer et de reproduire l'oeuvre sur support électronique : logiciel ETD/db - le droit de communiquer l'oeuvre au public Cette licence, gratuite et non exclusive, est valable pour toute la durée de la propriété littéraire et artistique, y compris ses éventuelles prolongations, et pour le monde entier. Je conserve tous les autres droits pour la reproduction et la communication de la thèse, ainsi que le droit de l'utiliser dans de futurs travaux. Je certifie avoir obtenu, conformément à la législation sur le droit d'auteur et aux exigences du droit à l'image, toutes les autorisations nécessaires à la reproduction dans ma thèse d'images, de textes, et/ou de toute oeuvre protégés par le droit d'auteur, et avoir obtenu les autorisations nécessaires à leur communication à des tiers. Au cas où un tiers est titulaire d'un droit de propriété intellectuelle sur tout ou partie de ma thèse, je certifie avoir obtenu son autorisation écrite pour l'exercice des droits mentionnés ci-dessus. |
collection |
NDLTD |
language |
en |
format |
Others
|
sources |
NDLTD |
topic |
Data mining Machine learning Bayesian statistics Robust statistics Regression NIR spectroscopy Collaborative filtering |
spellingShingle |
Data mining Machine learning Bayesian statistics Robust statistics Regression NIR spectroscopy Collaborative filtering Delannay, Nicolas Aspects of probabilistic modelling for data analysis |
description |
Computer technologies have revolutionised the processing of information and the search for knowledge. With the ever increasing computational power, it is becoming possible to tackle new data analysis applications as diverse as mining the Internet resources, analysing drugs effects on the organism or assisting wardens with autonomous video detection techniques.
Fundamentally, the principle of any data analysis task is to fit a model which encodes well the dependencies (or patterns) present in the data. However, the difficulty is precisely to define such proper model when data are noisy, dependencies are highly stochastic and there is no simple physical rule to represent them.
The aim of this work is to discuss the principles, the advantages and weaknesses of the probabilistic modelling framework for data analysis. The main idea of the framework is to model dispersion of data as well as uncertainty about the model itself by probability distributions. Three data analysis tasks are presented and for each of them the discussion is based on experimental results from real datasets.
The first task considers the problem of linear subspaces identification. We show how one can replace a Gaussian noise model by a Student-t noise to make the identification more robust to atypical samples and still keep the learning procedure simple. The second task is about regression applied more specifically to near-infrared spectroscopy datasets. We show how spectra should be pre-processed before entering the regression model. We then analyse the validity of the Bayesian model selection principle for this application (and in particular within the Gaussian Process formulation) and compare this principle to the resampling selection scheme. The final task considered is Collaborative Filtering which is related to applications such as recommendation for e-commerce and text mining. This task is illustrative of the way how intuitive considerations can guide the design of the model and the choice of the probability distributions appearing in it. We compare the intuitive approach with a simpler matrix factorisation approach. |
author |
Delannay, Nicolas |
author_facet |
Delannay, Nicolas |
author_sort |
Delannay, Nicolas |
title |
Aspects of probabilistic modelling for data analysis |
title_short |
Aspects of probabilistic modelling for data analysis |
title_full |
Aspects of probabilistic modelling for data analysis |
title_fullStr |
Aspects of probabilistic modelling for data analysis |
title_full_unstemmed |
Aspects of probabilistic modelling for data analysis |
title_sort |
aspects of probabilistic modelling for data analysis |
publisher |
Universite catholique de Louvain |
publishDate |
2007 |
url |
http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10122007-121326/ |
work_keys_str_mv |
AT delannaynicolas aspectsofprobabilisticmodellingfordataanalysis |
_version_ |
1716393738680401920 |