High-dimensional data analysis : optimal metrics and feature selection

High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumpti...

Full description

Bibliographic Details
Main Author:	François, Damien
Format:	Others
Language:	en
Published:	Universite catholique de Louvain 2007
Subjects:	High-dimensional data Machine learning Data analysis Data mining Artificial intelligence
Online Access:	http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/

id	ndltd-BICfB-oai-ucl.ac.be-ETDUCL-BelnUcetd-01152007-162739
record_format	oai_dc
spelling	ndltd-BICfB-oai-ucl.ac.be-ETDUCL-BelnUcetd-01152007-1627392013-01-07T15:42:00Z High-dimensional data analysis : optimal metrics and feature selection François, Damien High-dimensional data Machine learning Data analysis Data mining Artificial intelligence High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumptions made while developing the classical data analysis tools are not transposable to high-dimensional data. For instance, many tools rely on the Euclidean distance, to compare data elements. But the Euclidean distance concentrates in high-dimensional spaces: all distances between data elements seem identical. The Euclidean distance is furthermore incapable of identifying important attributes from irrelevant ones. This thesis therefore focuses the choice of a relevant distance function to compare high-dimensional data and the selection of the relevant attributes. In Part One of the thesis, the phenomenon of the concentration of the distances is considered, and its consequences on data analysis tools are studied. It is shown that for nearest neighbours search, the Euclidean distance and the Gaussian kernel, both heavily used, may not be appropriate; it is thus proposed to use Fractional metrics and Generalised Gaussian kernels. Part Two of this thesis focuses on the problem of feature selection in the case of a large number of initial features. Two methods are proposed to (1) reduce the computational burden of feature selection process and (2) cope with the instability induced by high correlation between features that often appear with high-dimensional data. Most of the concepts studied and presented in this thesis are illustrated on chemometric data, and more particularly on spectral data, with the objective of inferring a physical or chemical property of a material by analysis the spectrum of the light it reflects. Universite catholique de Louvain 2007-01-10 text application/pdf http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/ http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/ en unrestricted J'accepte que le texte de la thèse (ci-après l'oeuvre), sous réserve des parties couvertes par la confidentialité, soit publié dans le recueil électronique des thèses UCL. A cette fin, je donne licence à l'UCL : - le droit de fixer et de reproduire l'oeuvre sur support électronique : logiciel ETD/db - le droit de communiquer l'oeuvre au public Cette licence, gratuite et non exclusive, est valable pour toute la durée de la propriété littéraire et artistique, y compris ses éventuelles prolongations, et pour le monde entier. Je conserve tous les autres droits pour la reproduction et la communication de la thèse, ainsi que le droit de l'utiliser dans de futurs travaux. Je certifie avoir obtenu, conformément à la législation sur le droit d'auteur et aux exigences du droit à l'image, toutes les autorisations nécessaires à la reproduction dans ma thèse d'images, de textes, et/ou de toute oeuvre protégés par le droit d'auteur, et avoir obtenu les autorisations nécessaires à leur communication à des tiers. Au cas où un tiers est titulaire d'un droit de propriété intellectuelle sur tout ou partie de ma thèse, je certifie avoir obtenu son autorisation écrite pour l'exercice des droits mentionnés ci-dessus.
collection	NDLTD
language	en
format	Others
sources	NDLTD
topic	High-dimensional data Machine learning Data analysis Data mining Artificial intelligence
spellingShingle	High-dimensional data Machine learning Data analysis Data mining Artificial intelligence François, Damien High-dimensional data analysis : optimal metrics and feature selection
description	High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumptions made while developing the classical data analysis tools are not transposable to high-dimensional data. For instance, many tools rely on the Euclidean distance, to compare data elements. But the Euclidean distance concentrates in high-dimensional spaces: all distances between data elements seem identical. The Euclidean distance is furthermore incapable of identifying important attributes from irrelevant ones. This thesis therefore focuses the choice of a relevant distance function to compare high-dimensional data and the selection of the relevant attributes. In Part One of the thesis, the phenomenon of the concentration of the distances is considered, and its consequences on data analysis tools are studied. It is shown that for nearest neighbours search, the Euclidean distance and the Gaussian kernel, both heavily used, may not be appropriate; it is thus proposed to use Fractional metrics and Generalised Gaussian kernels. Part Two of this thesis focuses on the problem of feature selection in the case of a large number of initial features. Two methods are proposed to (1) reduce the computational burden of feature selection process and (2) cope with the instability induced by high correlation between features that often appear with high-dimensional data. Most of the concepts studied and presented in this thesis are illustrated on chemometric data, and more particularly on spectral data, with the objective of inferring a physical or chemical property of a material by analysis the spectrum of the light it reflects.
author	François, Damien
author_facet	François, Damien
author_sort	François, Damien
title	High-dimensional data analysis : optimal metrics and feature selection
title_short	High-dimensional data analysis : optimal metrics and feature selection
title_full	High-dimensional data analysis : optimal metrics and feature selection
title_fullStr	High-dimensional data analysis : optimal metrics and feature selection
title_full_unstemmed	High-dimensional data analysis : optimal metrics and feature selection
title_sort	high-dimensional data analysis : optimal metrics and feature selection
publisher	Universite catholique de Louvain
publishDate	2007
url	http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/
work_keys_str_mv	AT francoisdamien highdimensionaldataanalysisoptimalmetricsandfeatureselection
_version_	1716393619567411200

High-dimensional data analysis : optimal metrics and feature selection

Similar Items