An exploration of overfitting in feature-rich bioinformatics learning methods

碩士 === 國立陽明大學 === 生物醫學資訊研究所 === 106 === Computational learning methods have been applied to bioinformatics problems for decades. Many such attempts come with satisfactory prediction performance by introducing a great number of structural or sequence attributes. However, overfitting problems are hard...

Full description

Bibliographic Details
Main Authors:	Hao-Hsuan Shih, 施皓軒
Other Authors:	Kuo-Bin Li
Format:	Others
Language:	en_US
Published:	2018
Online Access:	http://ndltd.ncl.edu.tw/handle/8a2cwb

id	ndltd-TW-106YM005114050
record_format	oai_dc
spelling	ndltd-TW-106YM0051140502019-09-19T03:30:14Z http://ndltd.ncl.edu.tw/handle/8a2cwb An exploration of overfitting in feature-rich bioinformatics learning methods 探討在高維度生物資訊學習方法的過適問題 Hao-Hsuan Shih 施皓軒碩士國立陽明大學生物醫學資訊研究所 106 Computational learning methods have been applied to bioinformatics problems for decades. Many such attempts come with satisfactory prediction performance by introducing a great number of structural or sequence attributes. However, overfitting problems are hard to observed at once in model building. There are various factors causing overfitting in machine learning. Cross-validation, a common method to evaluate model performances, can help users to monitor whether overfitting happens. With independent data, models are also able to be examined. This study is to address the question that whether or not overfitting is a factor behind the sometimes exceptionally high classification performance and to discover causes of overfitting. Three protein sequence datasets and three bioinformatics sequence descriptors are used in the study. The descriptors include (1) Scoring Card Method (SCM) , (2) Multi-scale Local Descriptor method (MLD), and (3) Distance Frequency method (DF). SCM calculate weights of dipeptides with Genetic Algorithm optimized. MLD builds different combination from divided sequences and involves the composition, transition and the distribution of residues. DF considers the frequency of the distance in the same property. The datasets are (1) the protein interaction dataset (hub vs end proteins) collected from Human Protein Reference Database (HPRD), (2) the protein localization dataset (chloroplast vs mitochondrial proteins) assembled from datasets of MultiLoc, dataset of MultiP and Mammalian Protein Localization Database, and (3) the DNA and RNA binding protein dataset provided in the study of Peled. We study the problem using the traditional strategy involving a cross-validation as well as an independent testing stage. In results, SCM has the greatest tendency to overfit, possibly due to that the its fitness function is improperly designed to match the final prediction performance. MLD is another approach exhibiting apparent overfitting, and is likely held accountable by the large number of sequence features including some useless information. Finally, DF is slightly less prone to overfitting because of conducting PCA that contains meaningful information. In addition, we discovered that bias of data also affects prediction performance. In summary, overfitting is suspected to occur in all three feature-rich methods, to different extent. This work also descries couples of issues that lead to overfitting. Kuo-Bin Li 李國彬 2018 學位論文 ; thesis 27 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	碩士 === 國立陽明大學 === 生物醫學資訊研究所 === 106 === Computational learning methods have been applied to bioinformatics problems for decades. Many such attempts come with satisfactory prediction performance by introducing a great number of structural or sequence attributes. However, overfitting problems are hard to observed at once in model building. There are various factors causing overfitting in machine learning. Cross-validation, a common method to evaluate model performances, can help users to monitor whether overfitting happens. With independent data, models are also able to be examined. This study is to address the question that whether or not overfitting is a factor behind the sometimes exceptionally high classification performance and to discover causes of overfitting. Three protein sequence datasets and three bioinformatics sequence descriptors are used in the study. The descriptors include (1) Scoring Card Method (SCM) , (2) Multi-scale Local Descriptor method (MLD), and (3) Distance Frequency method (DF). SCM calculate weights of dipeptides with Genetic Algorithm optimized. MLD builds different combination from divided sequences and involves the composition, transition and the distribution of residues. DF considers the frequency of the distance in the same property. The datasets are (1) the protein interaction dataset (hub vs end proteins) collected from Human Protein Reference Database (HPRD), (2) the protein localization dataset (chloroplast vs mitochondrial proteins) assembled from datasets of MultiLoc, dataset of MultiP and Mammalian Protein Localization Database, and (3) the DNA and RNA binding protein dataset provided in the study of Peled. We study the problem using the traditional strategy involving a cross-validation as well as an independent testing stage. In results, SCM has the greatest tendency to overfit, possibly due to that the its fitness function is improperly designed to match the final prediction performance. MLD is another approach exhibiting apparent overfitting, and is likely held accountable by the large number of sequence features including some useless information. Finally, DF is slightly less prone to overfitting because of conducting PCA that contains meaningful information. In addition, we discovered that bias of data also affects prediction performance. In summary, overfitting is suspected to occur in all three feature-rich methods, to different extent. This work also descries couples of issues that lead to overfitting.
author2	Kuo-Bin Li
author_facet	Kuo-Bin Li Hao-Hsuan Shih 施皓軒
author	Hao-Hsuan Shih 施皓軒
spellingShingle	Hao-Hsuan Shih 施皓軒 An exploration of overfitting in feature-rich bioinformatics learning methods
author_sort	Hao-Hsuan Shih
title	An exploration of overfitting in feature-rich bioinformatics learning methods
title_short	An exploration of overfitting in feature-rich bioinformatics learning methods
title_full	An exploration of overfitting in feature-rich bioinformatics learning methods
title_fullStr	An exploration of overfitting in feature-rich bioinformatics learning methods
title_full_unstemmed	An exploration of overfitting in feature-rich bioinformatics learning methods
title_sort	exploration of overfitting in feature-rich bioinformatics learning methods
publishDate	2018
url	http://ndltd.ncl.edu.tw/handle/8a2cwb
work_keys_str_mv	AT haohsuanshih anexplorationofoverfittinginfeaturerichbioinformaticslearningmethods AT shīhàoxuān anexplorationofoverfittinginfeaturerichbioinformaticslearningmethods AT haohsuanshih tàntǎozàigāowéidùshēngwùzīxùnxuéxífāngfǎdeguòshìwèntí AT shīhàoxuān tàntǎozàigāowéidùshēngwùzīxùnxuéxífāngfǎdeguòshìwèntí AT haohsuanshih explorationofoverfittinginfeaturerichbioinformaticslearningmethods AT shīhàoxuān explorationofoverfittinginfeaturerichbioinformaticslearningmethods
_version_	1719252686806712320

An exploration of overfitting in feature-rich bioinformatics learning methods

Similar Items