Aspects of probabilistic modelling for data analysis

Computer technologies have revolutionised the processing of information and the search for knowledge. With the ever increasing computational power, it is becoming possible to tackle new data analysis applications as diverse as mining the Internet resources, analysing drugs effects on the organism or...

Full description

Bibliographic Details
Main Author:	Delannay, Nicolas
Format:	Others
Language:	en
Published:	Universite catholique de Louvain 2007
Subjects:	Data mining Machine learning Bayesian statistics Robust statistics Regression NIR spectroscopy Collaborative filtering
Online Access:	http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10122007-121326/

id	ndltd-BICfB-oai-ucl.ac.be-ETDUCL-BelnUcetd-10122007-121326
record_format	oai_dc
spelling	ndltd-BICfB-oai-ucl.ac.be-ETDUCL-BelnUcetd-10122007-1213262013-01-07T15:42:05Z Aspects of probabilistic modelling for data analysis Delannay, Nicolas Data mining Machine learning Bayesian statistics Robust statistics Regression NIR spectroscopy Collaborative filtering Computer technologies have revolutionised the processing of information and the search for knowledge. With the ever increasing computational power, it is becoming possible to tackle new data analysis applications as diverse as mining the Internet resources, analysing drugs effects on the organism or assisting wardens with autonomous video detection techniques. Fundamentally, the principle of any data analysis task is to fit a model which encodes well the dependencies (or patterns) present in the data. However, the difficulty is precisely to define such proper model when data are noisy, dependencies are highly stochastic and there is no simple physical rule to represent them. The aim of this work is to discuss the principles, the advantages and weaknesses of the probabilistic modelling framework for data analysis. The main idea of the framework is to model dispersion of data as well as uncertainty about the model itself by probability distributions. Three data analysis tasks are presented and for each of them the discussion is based on experimental results from real datasets. The first task considers the problem of linear subspaces identification. We show how one can replace a Gaussian noise model by a Student-t noise to make the identification more robust to atypical samples and still keep the learning procedure simple. The second task is about regression applied more specifically to near-infrared spectroscopy datasets. We show how spectra should be pre-processed before entering the regression model. We then analyse the validity of the Bayesian model selection principle for this application (and in particular within the Gaussian Process formulation) and compare this principle to the resampling selection scheme. The final task considered is Collaborative Filtering which is related to applications such as recommendation for e-commerce and text mining. This task is illustrative of the way how intuitive considerations can guide the design of the model and the choice of the probability distributions appearing in it. We compare the intuitive approach with a simpler matrix factorisation approach. Universite catholique de Louvain 2007-10-23 text application/pdf http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10122007-121326/ http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10122007-121326/ en unrestricted J'accepte que le texte de la thèse (ci-après l'oeuvre), sous réserve des parties couvertes par la confidentialité, soit publié dans le recueil électronique des thèses UCL. A cette fin, je donne licence à l'UCL : - le droit de fixer et de reproduire l'oeuvre sur support électronique : logiciel ETD/db - le droit de communiquer l'oeuvre au public Cette licence, gratuite et non exclusive, est valable pour toute la durée de la propriété littéraire et artistique, y compris ses éventuelles prolongations, et pour le monde entier. Je conserve tous les autres droits pour la reproduction et la communication de la thèse, ainsi que le droit de l'utiliser dans de futurs travaux. Je certifie avoir obtenu, conformément à la législation sur le droit d'auteur et aux exigences du droit à l'image, toutes les autorisations nécessaires à la reproduction dans ma thèse d'images, de textes, et/ou de toute oeuvre protégés par le droit d'auteur, et avoir obtenu les autorisations nécessaires à leur communication à des tiers. Au cas où un tiers est titulaire d'un droit de propriété intellectuelle sur tout ou partie de ma thèse, je certifie avoir obtenu son autorisation écrite pour l'exercice des droits mentionnés ci-dessus.
collection	NDLTD
language	en
format	Others
sources	NDLTD
topic	Data mining Machine learning Bayesian statistics Robust statistics Regression NIR spectroscopy Collaborative filtering
spellingShingle	Data mining Machine learning Bayesian statistics Robust statistics Regression NIR spectroscopy Collaborative filtering Delannay, Nicolas Aspects of probabilistic modelling for data analysis
description	Computer technologies have revolutionised the processing of information and the search for knowledge. With the ever increasing computational power, it is becoming possible to tackle new data analysis applications as diverse as mining the Internet resources, analysing drugs effects on the organism or assisting wardens with autonomous video detection techniques. Fundamentally, the principle of any data analysis task is to fit a model which encodes well the dependencies (or patterns) present in the data. However, the difficulty is precisely to define such proper model when data are noisy, dependencies are highly stochastic and there is no simple physical rule to represent them. The aim of this work is to discuss the principles, the advantages and weaknesses of the probabilistic modelling framework for data analysis. The main idea of the framework is to model dispersion of data as well as uncertainty about the model itself by probability distributions. Three data analysis tasks are presented and for each of them the discussion is based on experimental results from real datasets. The first task considers the problem of linear subspaces identification. We show how one can replace a Gaussian noise model by a Student-t noise to make the identification more robust to atypical samples and still keep the learning procedure simple. The second task is about regression applied more specifically to near-infrared spectroscopy datasets. We show how spectra should be pre-processed before entering the regression model. We then analyse the validity of the Bayesian model selection principle for this application (and in particular within the Gaussian Process formulation) and compare this principle to the resampling selection scheme. The final task considered is Collaborative Filtering which is related to applications such as recommendation for e-commerce and text mining. This task is illustrative of the way how intuitive considerations can guide the design of the model and the choice of the probability distributions appearing in it. We compare the intuitive approach with a simpler matrix factorisation approach.
author	Delannay, Nicolas
author_facet	Delannay, Nicolas
author_sort	Delannay, Nicolas
title	Aspects of probabilistic modelling for data analysis
title_short	Aspects of probabilistic modelling for data analysis
title_full	Aspects of probabilistic modelling for data analysis
title_fullStr	Aspects of probabilistic modelling for data analysis
title_full_unstemmed	Aspects of probabilistic modelling for data analysis
title_sort	aspects of probabilistic modelling for data analysis
publisher	Universite catholique de Louvain
publishDate	2007
url	http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-10122007-121326/
work_keys_str_mv	AT delannaynicolas aspectsofprobabilisticmodellingfordataanalysis
_version_	1716393738680401920

Aspects of probabilistic modelling for data analysis

Similar Items