Application of chemometrics for the robust analysis of chemical and biochemical data

In the last two decades chemometrics has become an essential tool for the experimental biologist and chemist. The level of contribution varies strongly depending on the type of research performed. Therefore, chemometrics may be used to interpret and explain results, to compare experimental data with...

Full description

Bibliographic Details
Main Author: Gromski, Piotr Sebastian
Published: University of Manchester 2015
Subjects:
543
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.654801
id ndltd-bl.uk-oai-ethos.bl.uk-654801
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-6548012017-07-25T03:25:45ZApplication of chemometrics for the robust analysis of chemical and biochemical dataGromski, Piotr Sebastian2015In the last two decades chemometrics has become an essential tool for the experimental biologist and chemist. The level of contribution varies strongly depending on the type of research performed. Therefore, chemometrics may be used to interpret and explain results, to compare experimental data with real-word ‘unseen’ data, to accurately detect certain chemical vapour, to identify cancerous related metabolites, to identify and rank potentially relevant/important variables or simply just for a pictorial interpretation and understanding of the results. Whilst many chemometrics methods are well-established in the area of chemistry and metabolomics many scientists are still using them with what is often referred to as a ‘black-box’ approach, that is without prior knowledge of the methods and well-recognised statistical properties. This lack of knowledge is thanks to the wide availability of powerful computers and – perhaps more notably – up-to-date, easy to use and reliable software. The main aim of this study is to reduce this gap by providing extensive demonstration of several approaches applied at different stages of the data analysis pipeline highlighting the importance of appropriate method selection. The comparisons are based both on chemical and biochemical (metabolomics) data and construct a firm basis for the researchers in terms of understanding of chemometric methods and the influence of parameter selection. Consequently, in this thesis the exploration and comparison of different approaches employed for various statistical steps are investigated. These include pre-treatment steps such as dealing with missing data and scaling. First, different substitution of missing values and their influence on unsupervised and supervised learning have been compared, where it has been shown that metabolites that display skewness in distribution can have a significant impact on the replacement approach. The scaling approaches were compared in terms of effect on classification accuracy for variety of metabolomics data sets. It was shown that the most standard option which is autoscaling is not always the best. In the next step a comparison of various variable selection methods which are commonly used for the analysis of chemical data has been carried out. The results revealed that random forests, with its variable selection techniques, and support vector machines, combined with recursive feature elimination as a variable selection method, displayed the best results in comparison to other approaches. Moreover, in this study a double cross-validation procedure was applied to minimize the consequence of over-fitting. Finally, seven different algorithms and two model validation procedures based on either 10-fold cross-validation or bootstrapping were investigated in order to allow direct comparison between different classification approaches.543ChemometricsUniversity of Manchesterhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.654801https://www.research.manchester.ac.uk/portal/en/theses/application-of-chemometrics-for-the-robust-analysis-of-chemical-and-biochemical-data(3049006f-e218-4286-83a8-e1fd85004366).htmlElectronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 543
Chemometrics
spellingShingle 543
Chemometrics
Gromski, Piotr Sebastian
Application of chemometrics for the robust analysis of chemical and biochemical data
description In the last two decades chemometrics has become an essential tool for the experimental biologist and chemist. The level of contribution varies strongly depending on the type of research performed. Therefore, chemometrics may be used to interpret and explain results, to compare experimental data with real-word ‘unseen’ data, to accurately detect certain chemical vapour, to identify cancerous related metabolites, to identify and rank potentially relevant/important variables or simply just for a pictorial interpretation and understanding of the results. Whilst many chemometrics methods are well-established in the area of chemistry and metabolomics many scientists are still using them with what is often referred to as a ‘black-box’ approach, that is without prior knowledge of the methods and well-recognised statistical properties. This lack of knowledge is thanks to the wide availability of powerful computers and – perhaps more notably – up-to-date, easy to use and reliable software. The main aim of this study is to reduce this gap by providing extensive demonstration of several approaches applied at different stages of the data analysis pipeline highlighting the importance of appropriate method selection. The comparisons are based both on chemical and biochemical (metabolomics) data and construct a firm basis for the researchers in terms of understanding of chemometric methods and the influence of parameter selection. Consequently, in this thesis the exploration and comparison of different approaches employed for various statistical steps are investigated. These include pre-treatment steps such as dealing with missing data and scaling. First, different substitution of missing values and their influence on unsupervised and supervised learning have been compared, where it has been shown that metabolites that display skewness in distribution can have a significant impact on the replacement approach. The scaling approaches were compared in terms of effect on classification accuracy for variety of metabolomics data sets. It was shown that the most standard option which is autoscaling is not always the best. In the next step a comparison of various variable selection methods which are commonly used for the analysis of chemical data has been carried out. The results revealed that random forests, with its variable selection techniques, and support vector machines, combined with recursive feature elimination as a variable selection method, displayed the best results in comparison to other approaches. Moreover, in this study a double cross-validation procedure was applied to minimize the consequence of over-fitting. Finally, seven different algorithms and two model validation procedures based on either 10-fold cross-validation or bootstrapping were investigated in order to allow direct comparison between different classification approaches.
author Gromski, Piotr Sebastian
author_facet Gromski, Piotr Sebastian
author_sort Gromski, Piotr Sebastian
title Application of chemometrics for the robust analysis of chemical and biochemical data
title_short Application of chemometrics for the robust analysis of chemical and biochemical data
title_full Application of chemometrics for the robust analysis of chemical and biochemical data
title_fullStr Application of chemometrics for the robust analysis of chemical and biochemical data
title_full_unstemmed Application of chemometrics for the robust analysis of chemical and biochemical data
title_sort application of chemometrics for the robust analysis of chemical and biochemical data
publisher University of Manchester
publishDate 2015
url http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.654801
work_keys_str_mv AT gromskipiotrsebastian applicationofchemometricsfortherobustanalysisofchemicalandbiochemicaldata
_version_ 1718504839794655232