CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis

Background: The clustering of data produced by liquid chromatography coupled to mass spectrometry analyses (LC-MS data) has recently gained interest to extract meaningful chemical or biological patterns. However, recent instrumental pipelines deliver data which size, dimensionality and expected numb...

Full description

Bibliographic Details
Main Authors: Burger, T. (Author), Fortin, T. (Author), Guibert, R. (Author), Hesse, A.-M (Author), Kraut, A. (Author), Permiakova, O. (Author)
Format: Article
Language:English
Published: BioMed Central Ltd 2021
Subjects:
Online Access:View Fulltext in Publisher
LEADER 03584nam a2200661Ia 4500
001 10.1186-s12859-021-03969-0
008 220427s2021 CNT 000 0 und d
020 |a 14712105 (ISSN) 
245 1 0 |a CHICKN: extraction of peptide chromatographic elution profiles from large scale mass spectrometry data by means of Wasserstein compressive hierarchical cluster analysis 
260 0 |b BioMed Central Ltd  |c 2021 
856 |z View Fulltext in Publisher  |u https://doi.org/10.1186/s12859-021-03969-0 
520 3 |a Background: The clustering of data produced by liquid chromatography coupled to mass spectrometry analyses (LC-MS data) has recently gained interest to extract meaningful chemical or biological patterns. However, recent instrumental pipelines deliver data which size, dimensionality and expected number of clusters are too large to be processed by classical machine learning algorithms, so that most of the state-of-the-art relies on single pass linkage-based algorithms. Results: We propose a clustering algorithm that solves the powerful but computationally demanding kernel k-means objective function in a scalable way. As a result, it can process LC-MS data in an acceptable time on a multicore machine. To do so, we combine three essential features: a compressive data representation, Nyström approximation and a hierarchical strategy. In addition, we propose new kernels based on optimal transport, which interprets as intuitive similarity measures between chromatographic elution profiles. Conclusions: Our method, referred to as CHICKN, is evaluated on proteomics data produced in our lab, as well as on benchmark data coming from the literature. From a computational viewpoint, it is particularly efficient on raw LC-MS data. From a data analysis viewpoint, it provides clusters which differ from those resulting from state-of-the-art methods, while achieving similar performances. This highlights the complementarity of differently principle algorithms to extract the best from complex LC-MS data. © 2021, The Author(s). 
650 0 4 |a algorithm 
650 0 4 |a Algorithms 
650 0 4 |a chemistry 
650 0 4 |a Chromatography, Liquid 
650 0 4 |a cluster analysis 
650 0 4 |a Cluster Analysis 
650 0 4 |a Data Compression 
650 0 4 |a Data mining 
650 0 4 |a Data representations 
650 0 4 |a Essential features 
650 0 4 |a Hierarchical clustering 
650 0 4 |a Hierarchical strategies 
650 0 4 |a Hierarchical systems 
650 0 4 |a information processing 
650 0 4 |a K-means clustering 
650 0 4 |a Large-scale cluster analysis 
650 0 4 |a Learning algorithms 
650 0 4 |a liquid chromatography 
650 0 4 |a Liquid chromatography 
650 0 4 |a Liquid chromatography 
650 0 4 |a Machine learning 
650 0 4 |a mass spectrometry 
650 0 4 |a Mass spectrometry 
650 0 4 |a Mass spectrometry 
650 0 4 |a Mass Spectrometry 
650 0 4 |a Mass spectrometry analysis 
650 0 4 |a Mass spectrometry data 
650 0 4 |a Multi-core machines 
650 0 4 |a Objective functions 
650 0 4 |a Optimal transport 
650 0 4 |a peptide 
650 0 4 |a Peptides 
650 0 4 |a procedures 
650 0 4 |a proteomics 
650 0 4 |a Proteomics 
650 0 4 |a Proteomics 
650 0 4 |a Proteomics 
650 0 4 |a State-of-the-art methods 
650 0 4 |a Wasserstein kernel 
700 1 |a Burger, T.  |e author 
700 1 |a Fortin, T.  |e author 
700 1 |a Guibert, R.  |e author 
700 1 |a Hesse, A.-M.  |e author 
700 1 |a Kraut, A.  |e author 
700 1 |a Permiakova, O.  |e author 
773 |t BMC Bioinformatics