High-dimensional data analysis : optimal metrics and feature selection
High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumpti...
Main Author: | |
---|---|
Format: | Others |
Language: | en |
Published: |
Universite catholique de Louvain
2007
|
Subjects: | |
Online Access: | http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/ |
id |
ndltd-BICfB-oai-ucl.ac.be-ETDUCL-BelnUcetd-01152007-162739 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-BICfB-oai-ucl.ac.be-ETDUCL-BelnUcetd-01152007-1627392013-01-07T15:42:00Z High-dimensional data analysis : optimal metrics and feature selection François, Damien High-dimensional data Machine learning Data analysis Data mining Artificial intelligence High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumptions made while developing the classical data analysis tools are not transposable to high-dimensional data. For instance, many tools rely on the Euclidean distance, to compare data elements. But the Euclidean distance concentrates in high-dimensional spaces: all distances between data elements seem identical. The Euclidean distance is furthermore incapable of identifying important attributes from irrelevant ones. This thesis therefore focuses the choice of a relevant distance function to compare high-dimensional data and the selection of the relevant attributes. In Part One of the thesis, the phenomenon of the concentration of the distances is considered, and its consequences on data analysis tools are studied. It is shown that for nearest neighbours search, the Euclidean distance and the Gaussian kernel, both heavily used, may not be appropriate; it is thus proposed to use Fractional metrics and Generalised Gaussian kernels. Part Two of this thesis focuses on the problem of feature selection in the case of a large number of initial features. Two methods are proposed to (1) reduce the computational burden of feature selection process and (2) cope with the instability induced by high correlation between features that often appear with high-dimensional data. Most of the concepts studied and presented in this thesis are illustrated on chemometric data, and more particularly on spectral data, with the objective of inferring a physical or chemical property of a material by analysis the spectrum of the light it reflects. Universite catholique de Louvain 2007-01-10 text application/pdf http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/ http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/ en unrestricted J'accepte que le texte de la thèse (ci-après l'oeuvre), sous réserve des parties couvertes par la confidentialité, soit publié dans le recueil électronique des thèses UCL. A cette fin, je donne licence à l'UCL : - le droit de fixer et de reproduire l'oeuvre sur support électronique : logiciel ETD/db - le droit de communiquer l'oeuvre au public Cette licence, gratuite et non exclusive, est valable pour toute la durée de la propriété littéraire et artistique, y compris ses éventuelles prolongations, et pour le monde entier. Je conserve tous les autres droits pour la reproduction et la communication de la thèse, ainsi que le droit de l'utiliser dans de futurs travaux. Je certifie avoir obtenu, conformément à la législation sur le droit d'auteur et aux exigences du droit à l'image, toutes les autorisations nécessaires à la reproduction dans ma thèse d'images, de textes, et/ou de toute oeuvre protégés par le droit d'auteur, et avoir obtenu les autorisations nécessaires à leur communication à des tiers. Au cas où un tiers est titulaire d'un droit de propriété intellectuelle sur tout ou partie de ma thèse, je certifie avoir obtenu son autorisation écrite pour l'exercice des droits mentionnés ci-dessus. |
collection |
NDLTD |
language |
en |
format |
Others
|
sources |
NDLTD |
topic |
High-dimensional data Machine learning Data analysis Data mining Artificial intelligence |
spellingShingle |
High-dimensional data Machine learning Data analysis Data mining Artificial intelligence François, Damien High-dimensional data analysis : optimal metrics and feature selection |
description |
High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumptions made while developing the classical data analysis tools are not transposable to high-dimensional data. For instance, many tools rely on the Euclidean distance, to compare data elements. But the Euclidean distance concentrates in high-dimensional spaces: all distances between data elements seem identical. The Euclidean distance is furthermore incapable of identifying important attributes from irrelevant ones. This thesis therefore focuses the choice of a relevant distance function to compare high-dimensional data and the selection of the relevant attributes. In Part One of the thesis, the phenomenon of the concentration of the distances is considered, and its consequences on data analysis tools are studied. It is shown that for nearest neighbours search, the Euclidean distance and the Gaussian kernel, both heavily used, may not be appropriate; it is thus proposed to use Fractional metrics and Generalised Gaussian kernels. Part Two of this thesis focuses on the problem of feature selection in the case of a large number of initial features. Two methods are proposed to (1) reduce the computational burden of feature selection process and (2) cope with the instability induced by high correlation between features that often appear with high-dimensional data. Most of the concepts studied and presented in this thesis are illustrated on chemometric data, and more particularly on spectral data, with the objective of inferring a physical or chemical property of a material by analysis the spectrum of the light it reflects. |
author |
François, Damien |
author_facet |
François, Damien |
author_sort |
François, Damien |
title |
High-dimensional data analysis : optimal metrics and feature selection |
title_short |
High-dimensional data analysis : optimal metrics and feature selection |
title_full |
High-dimensional data analysis : optimal metrics and feature selection |
title_fullStr |
High-dimensional data analysis : optimal metrics and feature selection |
title_full_unstemmed |
High-dimensional data analysis : optimal metrics and feature selection |
title_sort |
high-dimensional data analysis : optimal metrics and feature selection |
publisher |
Universite catholique de Louvain |
publishDate |
2007 |
url |
http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/ |
work_keys_str_mv |
AT francoisdamien highdimensionaldataanalysisoptimalmetricsandfeatureselection |
_version_ |
1716393619567411200 |