Multivariate linear QSPR/QSAR models: Rigorous evaluation of variable selection for PLS

Basic chemometric methods for making empirical regression models for QSPR/QSAR are briefly described from a user's point of view. Emphasis is given to PLS regression, simple variable selection and a careful and cautious evaluation of the performance of PLS models by repeated double cross valida...

Full description

Bibliographic Details
Main Authors: Kurt Varmuza, Peter Filzmoser, Matthias Dehmer
Format: Article
Language:English
Published: Elsevier 2013-02-01
Series:Computational and Structural Biotechnology Journal
Subjects:
PLS
Online Access:http://journals.sfu.ca/rncsb/index.php/csbj/article/view/csbj.201302007
id doaj-8bca0619c52042cb89b4192bed1c1907
record_format Article
spelling doaj-8bca0619c52042cb89b4192bed1c19072020-11-24T23:47:31ZengElsevierComputational and Structural Biotechnology Journal2001-03702013-02-0156e201302007Multivariate linear QSPR/QSAR models: Rigorous evaluation of variable selection for PLSKurt VarmuzaPeter FilzmoserMatthias DehmerBasic chemometric methods for making empirical regression models for QSPR/QSAR are briefly described from a user's point of view. Emphasis is given to PLS regression, simple variable selection and a careful and cautious evaluation of the performance of PLS models by repeated double cross validation (rdCV). A demonstration example is worked out for QSPR models that predict gas chromatographic retention indices (values between 197 and 504 units) of 209 polycyclic aromatic compounds (PAC) from molecular descriptors generated by Dragon software. Most favorable models were obtained from data sets containing also descriptors from 3D structures with all H-atoms (computed by Corina software), using stepwise variable selection (reducing 2688 descriptors to a subset of 22). The final QSPR model has typical prediction errors for the retention index of +12 units (95% tolerance interval, for test set objects). Programs and data are provided as supplementary material for the open source R software environment.http://journals.sfu.ca/rncsb/index.php/csbj/article/view/csbj.201302007molecular descriptorsPLSvariable selectioncross validationsoftware R
collection DOAJ
language English
format Article
sources DOAJ
author Kurt Varmuza
Peter Filzmoser
Matthias Dehmer
spellingShingle Kurt Varmuza
Peter Filzmoser
Matthias Dehmer
Multivariate linear QSPR/QSAR models: Rigorous evaluation of variable selection for PLS
Computational and Structural Biotechnology Journal
molecular descriptors
PLS
variable selection
cross validation
software R
author_facet Kurt Varmuza
Peter Filzmoser
Matthias Dehmer
author_sort Kurt Varmuza
title Multivariate linear QSPR/QSAR models: Rigorous evaluation of variable selection for PLS
title_short Multivariate linear QSPR/QSAR models: Rigorous evaluation of variable selection for PLS
title_full Multivariate linear QSPR/QSAR models: Rigorous evaluation of variable selection for PLS
title_fullStr Multivariate linear QSPR/QSAR models: Rigorous evaluation of variable selection for PLS
title_full_unstemmed Multivariate linear QSPR/QSAR models: Rigorous evaluation of variable selection for PLS
title_sort multivariate linear qspr/qsar models: rigorous evaluation of variable selection for pls
publisher Elsevier
series Computational and Structural Biotechnology Journal
issn 2001-0370
publishDate 2013-02-01
description Basic chemometric methods for making empirical regression models for QSPR/QSAR are briefly described from a user's point of view. Emphasis is given to PLS regression, simple variable selection and a careful and cautious evaluation of the performance of PLS models by repeated double cross validation (rdCV). A demonstration example is worked out for QSPR models that predict gas chromatographic retention indices (values between 197 and 504 units) of 209 polycyclic aromatic compounds (PAC) from molecular descriptors generated by Dragon software. Most favorable models were obtained from data sets containing also descriptors from 3D structures with all H-atoms (computed by Corina software), using stepwise variable selection (reducing 2688 descriptors to a subset of 22). The final QSPR model has typical prediction errors for the retention index of +12 units (95% tolerance interval, for test set objects). Programs and data are provided as supplementary material for the open source R software environment.
topic molecular descriptors
PLS
variable selection
cross validation
software R
url http://journals.sfu.ca/rncsb/index.php/csbj/article/view/csbj.201302007
work_keys_str_mv AT kurtvarmuza multivariatelinearqsprqsarmodelsrigorousevaluationofvariableselectionforpls
AT peterfilzmoser multivariatelinearqsprqsarmodelsrigorousevaluationofvariableselectionforpls
AT matthiasdehmer multivariatelinearqsprqsarmodelsrigorousevaluationofvariableselectionforpls
_version_ 1725489366505095168