An exploration of overfitting in feature-rich bioinformatics learning methods

碩士 === 國立陽明大學 === 生物醫學資訊研究所 === 106 === Computational learning methods have been applied to bioinformatics problems for decades. Many such attempts come with satisfactory prediction performance by introducing a great number of structural or sequence attributes. However, overfitting problems are hard...

Full description

Bibliographic Details
Main Authors: Hao-Hsuan Shih, 施皓軒
Other Authors: Kuo-Bin Li
Format: Others
Language:en_US
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/8a2cwb
id ndltd-TW-106YM005114050
record_format oai_dc
spelling ndltd-TW-106YM0051140502019-09-19T03:30:14Z http://ndltd.ncl.edu.tw/handle/8a2cwb An exploration of overfitting in feature-rich bioinformatics learning methods 探討在高維度生物資訊學習方法的過適問題 Hao-Hsuan Shih 施皓軒 碩士 國立陽明大學 生物醫學資訊研究所 106 Computational learning methods have been applied to bioinformatics problems for decades. Many such attempts come with satisfactory prediction performance by introducing a great number of structural or sequence attributes. However, overfitting problems are hard to observed at once in model building. There are various factors causing overfitting in machine learning. Cross-validation, a common method to evaluate model performances, can help users to monitor whether overfitting happens. With independent data, models are also able to be examined. This study is to address the question that whether or not overfitting is a factor behind the sometimes exceptionally high classification performance and to discover causes of overfitting. Three protein sequence datasets and three bioinformatics sequence descriptors are used in the study. The descriptors include (1) Scoring Card Method (SCM) , (2) Multi-scale Local Descriptor method (MLD), and (3) Distance Frequency method (DF). SCM calculate weights of dipeptides with Genetic Algorithm optimized. MLD builds different combination from divided sequences and involves the composition, transition and the distribution of residues. DF considers the frequency of the distance in the same property. The datasets are (1) the protein interaction dataset (hub vs end proteins) collected from Human Protein Reference Database (HPRD), (2) the protein localization dataset (chloroplast vs mitochondrial proteins) assembled from datasets of MultiLoc, dataset of MultiP and Mammalian Protein Localization Database, and (3) the DNA and RNA binding protein dataset provided in the study of Peled. We study the problem using the traditional strategy involving a cross-validation as well as an independent testing stage. In results, SCM has the greatest tendency to overfit, possibly due to that the its fitness function is improperly designed to match the final prediction performance. MLD is another approach exhibiting apparent overfitting, and is likely held accountable by the large number of sequence features including some useless information. Finally, DF is slightly less prone to overfitting because of conducting PCA that contains meaningful information. In addition, we discovered that bias of data also affects prediction performance. In summary, overfitting is suspected to occur in all three feature-rich methods, to different extent. This work also descries couples of issues that lead to overfitting. Kuo-Bin Li 李國彬 2018 學位論文 ; thesis 27 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立陽明大學 === 生物醫學資訊研究所 === 106 === Computational learning methods have been applied to bioinformatics problems for decades. Many such attempts come with satisfactory prediction performance by introducing a great number of structural or sequence attributes. However, overfitting problems are hard to observed at once in model building. There are various factors causing overfitting in machine learning. Cross-validation, a common method to evaluate model performances, can help users to monitor whether overfitting happens. With independent data, models are also able to be examined. This study is to address the question that whether or not overfitting is a factor behind the sometimes exceptionally high classification performance and to discover causes of overfitting. Three protein sequence datasets and three bioinformatics sequence descriptors are used in the study. The descriptors include (1) Scoring Card Method (SCM) , (2) Multi-scale Local Descriptor method (MLD), and (3) Distance Frequency method (DF). SCM calculate weights of dipeptides with Genetic Algorithm optimized. MLD builds different combination from divided sequences and involves the composition, transition and the distribution of residues. DF considers the frequency of the distance in the same property. The datasets are (1) the protein interaction dataset (hub vs end proteins) collected from Human Protein Reference Database (HPRD), (2) the protein localization dataset (chloroplast vs mitochondrial proteins) assembled from datasets of MultiLoc, dataset of MultiP and Mammalian Protein Localization Database, and (3) the DNA and RNA binding protein dataset provided in the study of Peled. We study the problem using the traditional strategy involving a cross-validation as well as an independent testing stage. In results, SCM has the greatest tendency to overfit, possibly due to that the its fitness function is improperly designed to match the final prediction performance. MLD is another approach exhibiting apparent overfitting, and is likely held accountable by the large number of sequence features including some useless information. Finally, DF is slightly less prone to overfitting because of conducting PCA that contains meaningful information. In addition, we discovered that bias of data also affects prediction performance. In summary, overfitting is suspected to occur in all three feature-rich methods, to different extent. This work also descries couples of issues that lead to overfitting.
author2 Kuo-Bin Li
author_facet Kuo-Bin Li
Hao-Hsuan Shih
施皓軒
author Hao-Hsuan Shih
施皓軒
spellingShingle Hao-Hsuan Shih
施皓軒
An exploration of overfitting in feature-rich bioinformatics learning methods
author_sort Hao-Hsuan Shih
title An exploration of overfitting in feature-rich bioinformatics learning methods
title_short An exploration of overfitting in feature-rich bioinformatics learning methods
title_full An exploration of overfitting in feature-rich bioinformatics learning methods
title_fullStr An exploration of overfitting in feature-rich bioinformatics learning methods
title_full_unstemmed An exploration of overfitting in feature-rich bioinformatics learning methods
title_sort exploration of overfitting in feature-rich bioinformatics learning methods
publishDate 2018
url http://ndltd.ncl.edu.tw/handle/8a2cwb
work_keys_str_mv AT haohsuanshih anexplorationofoverfittinginfeaturerichbioinformaticslearningmethods
AT shīhàoxuān anexplorationofoverfittinginfeaturerichbioinformaticslearningmethods
AT haohsuanshih tàntǎozàigāowéidùshēngwùzīxùnxuéxífāngfǎdeguòshìwèntí
AT shīhàoxuān tàntǎozàigāowéidùshēngwùzīxùnxuéxífāngfǎdeguòshìwèntí
AT haohsuanshih explorationofoverfittinginfeaturerichbioinformaticslearningmethods
AT shīhàoxuān explorationofoverfittinginfeaturerichbioinformaticslearningmethods
_version_ 1719252686806712320