Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning

Feature selection (FS, i.e., selection of a subset of predictor variables) is essential in high-dimensional datasets to prevent overfitting of prediction/classification models and reduce computation time and resources. In genomics, FS allows identifying relevant markers and designing low-density SNP...

Full description

Bibliographic Details
Main Authors:	Miriam Piles, Rob Bergsma, Daniel Gianola, Hélène Gilbert, Llibertat Tusell
Format:	Article
Language:	English
Published:	Frontiers Media S.A. 2021-02-01
Series:	Frontiers in Genetics
Subjects:	feature selection stability machine learning genomic prediction SNP pigs
Online Access:	https://www.frontiersin.org/articles/10.3389/fgene.2021.611506/full

id	doaj-3ec0ddb3325d42559c95749ab4ca9cf6
record_format	Article
spelling	doaj-3ec0ddb3325d42559c95749ab4ca9cf62021-02-22T15:24:23ZengFrontiers Media S.A.Frontiers in Genetics1664-80212021-02-011210.3389/fgene.2021.611506611506Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine LearningMiriam Piles0Rob Bergsma1Daniel Gianola2Daniel Gianola3Hélène Gilbert4Llibertat Tusell5Llibertat Tusell6Animal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Barcelona, SpainTopigs Norsvin Research Center, Beuningen, NetherlandsDepartment of Animal Sciences, University of Wisconsin-Madison, Madison, WI, United StatesDepartment of Dairy Science, University of Wisconsin-Madison, Madison, WI, United StatesGenPhySE, INRAE, Université de Toulouse, Castanet-Tolosan, FranceAnimal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Barcelona, SpainGenPhySE, INRAE, Université de Toulouse, Castanet-Tolosan, FranceFeature selection (FS, i.e., selection of a subset of predictor variables) is essential in high-dimensional datasets to prevent overfitting of prediction/classification models and reduce computation time and resources. In genomics, FS allows identifying relevant markers and designing low-density SNP chips to evaluate selection candidates. In this research, several univariate and multivariate FS algorithms combined with various parametric and non-parametric learners were applied to the prediction of feed efficiency in growing pigs from high-dimensional genomic data. The objective was to find the best combination of feature selector, SNP subset size, and learner leading to accurate and stable (i.e., less sensitive to changes in the training data) prediction models. Genomic best linear unbiased prediction (GBLUP) without SNP pre-selection was the benchmark. Three types of FS methods were implemented: (i) filter methods: univariate (univ.dtree, spearcor) or multivariate (cforest, mrmr), with random selection as benchmark; (ii) embedded methods: elastic net and least absolute shrinkage and selection operator (LASSO) regression; (iii) combination of filter and embedded methods. Ridge regression, support vector machine (SVM), and gradient boosting (GB) were applied after pre-selection performed with the filter methods. Data represented 5,708 individual records of residual feed intake to be predicted from the animal’s own genotype. Accuracy (stability of results) was measured as the median (interquartile range) of the Spearman correlation between observed and predicted data in a 10-fold cross-validation. The best prediction in terms of accuracy and stability was obtained with SVM and GB using 500 or more SNPs [0.28 (0.02) and 0.27 (0.04) for SVM and GB with 1,000 SNPs, respectively]. With larger subset sizes (1,000–1,500 SNPs), the filter method had no influence on prediction quality, which was similar to that attained with a random selection. With 50–250 SNPs, the FS method had a huge impact on prediction quality: it was very poor for tree-based methods combined with any learner, but good and similar to what was obtained with larger SNP subsets when spearcor or mrmr were implemented with or without embedded methods. Those filters also led to very stable results, suggesting their potential use for designing low-density SNP chips for genome-based evaluation of feed efficiency.https://www.frontiersin.org/articles/10.3389/fgene.2021.611506/fullfeature selectionstabilitymachine learninggenomic predictionSNPpigs
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Miriam Piles Rob Bergsma Daniel Gianola Daniel Gianola Hélène Gilbert Llibertat Tusell Llibertat Tusell
spellingShingle	Miriam Piles Rob Bergsma Daniel Gianola Daniel Gianola Hélène Gilbert Llibertat Tusell Llibertat Tusell Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning Frontiers in Genetics feature selection stability machine learning genomic prediction SNP pigs
author_facet	Miriam Piles Rob Bergsma Daniel Gianola Daniel Gianola Hélène Gilbert Llibertat Tusell Llibertat Tusell
author_sort	Miriam Piles
title	Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning
title_short	Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning
title_full	Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning
title_fullStr	Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning
title_full_unstemmed	Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning
title_sort	feature selection stability and accuracy of prediction models for genomic prediction of residual feed intake in pigs using machine learning
publisher	Frontiers Media S.A.
series	Frontiers in Genetics
issn	1664-8021
publishDate	2021-02-01
description	Feature selection (FS, i.e., selection of a subset of predictor variables) is essential in high-dimensional datasets to prevent overfitting of prediction/classification models and reduce computation time and resources. In genomics, FS allows identifying relevant markers and designing low-density SNP chips to evaluate selection candidates. In this research, several univariate and multivariate FS algorithms combined with various parametric and non-parametric learners were applied to the prediction of feed efficiency in growing pigs from high-dimensional genomic data. The objective was to find the best combination of feature selector, SNP subset size, and learner leading to accurate and stable (i.e., less sensitive to changes in the training data) prediction models. Genomic best linear unbiased prediction (GBLUP) without SNP pre-selection was the benchmark. Three types of FS methods were implemented: (i) filter methods: univariate (univ.dtree, spearcor) or multivariate (cforest, mrmr), with random selection as benchmark; (ii) embedded methods: elastic net and least absolute shrinkage and selection operator (LASSO) regression; (iii) combination of filter and embedded methods. Ridge regression, support vector machine (SVM), and gradient boosting (GB) were applied after pre-selection performed with the filter methods. Data represented 5,708 individual records of residual feed intake to be predicted from the animal’s own genotype. Accuracy (stability of results) was measured as the median (interquartile range) of the Spearman correlation between observed and predicted data in a 10-fold cross-validation. The best prediction in terms of accuracy and stability was obtained with SVM and GB using 500 or more SNPs [0.28 (0.02) and 0.27 (0.04) for SVM and GB with 1,000 SNPs, respectively]. With larger subset sizes (1,000–1,500 SNPs), the filter method had no influence on prediction quality, which was similar to that attained with a random selection. With 50–250 SNPs, the FS method had a huge impact on prediction quality: it was very poor for tree-based methods combined with any learner, but good and similar to what was obtained with larger SNP subsets when spearcor or mrmr were implemented with or without embedded methods. Those filters also led to very stable results, suggesting their potential use for designing low-density SNP chips for genome-based evaluation of feed efficiency.
topic	feature selection stability machine learning genomic prediction SNP pigs
url	https://www.frontiersin.org/articles/10.3389/fgene.2021.611506/full
work_keys_str_mv	AT miriampiles featureselectionstabilityandaccuracyofpredictionmodelsforgenomicpredictionofresidualfeedintakeinpigsusingmachinelearning AT robbergsma featureselectionstabilityandaccuracyofpredictionmodelsforgenomicpredictionofresidualfeedintakeinpigsusingmachinelearning AT danielgianola featureselectionstabilityandaccuracyofpredictionmodelsforgenomicpredictionofresidualfeedintakeinpigsusingmachinelearning AT danielgianola featureselectionstabilityandaccuracyofpredictionmodelsforgenomicpredictionofresidualfeedintakeinpigsusingmachinelearning AT helenegilbert featureselectionstabilityandaccuracyofpredictionmodelsforgenomicpredictionofresidualfeedintakeinpigsusingmachinelearning AT llibertattusell featureselectionstabilityandaccuracyofpredictionmodelsforgenomicpredictionofresidualfeedintakeinpigsusingmachinelearning AT llibertattusell featureselectionstabilityandaccuracyofpredictionmodelsforgenomicpredictionofresidualfeedintakeinpigsusingmachinelearning
_version_	1724256626404229120

Feature Selection Stability and Accuracy of Prediction Models for Genomic Prediction of Residual Feed Intake in Pigs Using Machine Learning

Similar Items