High-dimensional data analysis : optimal metrics and feature selection

High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumpti...

Full description

Bibliographic Details
Main Author: François, Damien
Format: Others
Language:en
Published: Universite catholique de Louvain 2007
Subjects:
Online Access:http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/
id ndltd-BICfB-oai-ucl.ac.be-ETDUCL-BelnUcetd-01152007-162739
record_format oai_dc
spelling ndltd-BICfB-oai-ucl.ac.be-ETDUCL-BelnUcetd-01152007-1627392013-01-07T15:42:00Z High-dimensional data analysis : optimal metrics and feature selection François, Damien High-dimensional data Machine learning Data analysis Data mining Artificial intelligence High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumptions made while developing the classical data analysis tools are not transposable to high-dimensional data. For instance, many tools rely on the Euclidean distance, to compare data elements. But the Euclidean distance concentrates in high-dimensional spaces: all distances between data elements seem identical. The Euclidean distance is furthermore incapable of identifying important attributes from irrelevant ones. This thesis therefore focuses the choice of a relevant distance function to compare high-dimensional data and the selection of the relevant attributes. In Part One of the thesis, the phenomenon of the concentration of the distances is considered, and its consequences on data analysis tools are studied. It is shown that for nearest neighbours search, the Euclidean distance and the Gaussian kernel, both heavily used, may not be appropriate; it is thus proposed to use Fractional metrics and Generalised Gaussian kernels. Part Two of this thesis focuses on the problem of feature selection in the case of a large number of initial features. Two methods are proposed to (1) reduce the computational burden of feature selection process and (2) cope with the instability induced by high correlation between features that often appear with high-dimensional data. Most of the concepts studied and presented in this thesis are illustrated on chemometric data, and more particularly on spectral data, with the objective of inferring a physical or chemical property of a material by analysis the spectrum of the light it reflects. Universite catholique de Louvain 2007-01-10 text application/pdf http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/ http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/ en unrestricted J'accepte que le texte de la thèse (ci-après l'oeuvre), sous réserve des parties couvertes par la confidentialité, soit publié dans le recueil électronique des thèses UCL. A cette fin, je donne licence à l'UCL : - le droit de fixer et de reproduire l'oeuvre sur support électronique : logiciel ETD/db - le droit de communiquer l'oeuvre au public Cette licence, gratuite et non exclusive, est valable pour toute la durée de la propriété littéraire et artistique, y compris ses éventuelles prolongations, et pour le monde entier. Je conserve tous les autres droits pour la reproduction et la communication de la thèse, ainsi que le droit de l'utiliser dans de futurs travaux. Je certifie avoir obtenu, conformément à la législation sur le droit d'auteur et aux exigences du droit à l'image, toutes les autorisations nécessaires à la reproduction dans ma thèse d'images, de textes, et/ou de toute oeuvre protégés par le droit d'auteur, et avoir obtenu les autorisations nécessaires à leur communication à des tiers. Au cas où un tiers est titulaire d'un droit de propriété intellectuelle sur tout ou partie de ma thèse, je certifie avoir obtenu son autorisation écrite pour l'exercice des droits mentionnés ci-dessus.
collection NDLTD
language en
format Others
sources NDLTD
topic High-dimensional data
Machine learning
Data analysis
Data mining
Artificial intelligence
spellingShingle High-dimensional data
Machine learning
Data analysis
Data mining
Artificial intelligence
François, Damien
High-dimensional data analysis : optimal metrics and feature selection
description High-dimensional data are everywhere: texts, sounds, spectra, images, etc. are described by thousands of attributes. However, many data analysis tools at disposal (coming from statistics, artificial intelligence, etc.) were designed for low-dimensional data. Many of the explicit or implicit assumptions made while developing the classical data analysis tools are not transposable to high-dimensional data. For instance, many tools rely on the Euclidean distance, to compare data elements. But the Euclidean distance concentrates in high-dimensional spaces: all distances between data elements seem identical. The Euclidean distance is furthermore incapable of identifying important attributes from irrelevant ones. This thesis therefore focuses the choice of a relevant distance function to compare high-dimensional data and the selection of the relevant attributes. In Part One of the thesis, the phenomenon of the concentration of the distances is considered, and its consequences on data analysis tools are studied. It is shown that for nearest neighbours search, the Euclidean distance and the Gaussian kernel, both heavily used, may not be appropriate; it is thus proposed to use Fractional metrics and Generalised Gaussian kernels. Part Two of this thesis focuses on the problem of feature selection in the case of a large number of initial features. Two methods are proposed to (1) reduce the computational burden of feature selection process and (2) cope with the instability induced by high correlation between features that often appear with high-dimensional data. Most of the concepts studied and presented in this thesis are illustrated on chemometric data, and more particularly on spectral data, with the objective of inferring a physical or chemical property of a material by analysis the spectrum of the light it reflects.
author François, Damien
author_facet François, Damien
author_sort François, Damien
title High-dimensional data analysis : optimal metrics and feature selection
title_short High-dimensional data analysis : optimal metrics and feature selection
title_full High-dimensional data analysis : optimal metrics and feature selection
title_fullStr High-dimensional data analysis : optimal metrics and feature selection
title_full_unstemmed High-dimensional data analysis : optimal metrics and feature selection
title_sort high-dimensional data analysis : optimal metrics and feature selection
publisher Universite catholique de Louvain
publishDate 2007
url http://edoc.bib.ucl.ac.be:81/ETD-db/collection/available/BelnUcetd-01152007-162739/
work_keys_str_mv AT francoisdamien highdimensionaldataanalysisoptimalmetricsandfeatureselection
_version_ 1716393619567411200