An exploration of overfitting in feature-rich bioinformatics learning methods
碩士 === 國立陽明大學 === 生物醫學資訊研究所 === 106 === Computational learning methods have been applied to bioinformatics problems for decades. Many such attempts come with satisfactory prediction performance by introducing a great number of structural or sequence attributes. However, overfitting problems are hard...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2018
|
Online Access: | http://ndltd.ncl.edu.tw/handle/8a2cwb |
id |
ndltd-TW-106YM005114050 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-106YM0051140502019-09-19T03:30:14Z http://ndltd.ncl.edu.tw/handle/8a2cwb An exploration of overfitting in feature-rich bioinformatics learning methods 探討在高維度生物資訊學習方法的過適問題 Hao-Hsuan Shih 施皓軒 碩士 國立陽明大學 生物醫學資訊研究所 106 Computational learning methods have been applied to bioinformatics problems for decades. Many such attempts come with satisfactory prediction performance by introducing a great number of structural or sequence attributes. However, overfitting problems are hard to observed at once in model building. There are various factors causing overfitting in machine learning. Cross-validation, a common method to evaluate model performances, can help users to monitor whether overfitting happens. With independent data, models are also able to be examined. This study is to address the question that whether or not overfitting is a factor behind the sometimes exceptionally high classification performance and to discover causes of overfitting. Three protein sequence datasets and three bioinformatics sequence descriptors are used in the study. The descriptors include (1) Scoring Card Method (SCM) , (2) Multi-scale Local Descriptor method (MLD), and (3) Distance Frequency method (DF). SCM calculate weights of dipeptides with Genetic Algorithm optimized. MLD builds different combination from divided sequences and involves the composition, transition and the distribution of residues. DF considers the frequency of the distance in the same property. The datasets are (1) the protein interaction dataset (hub vs end proteins) collected from Human Protein Reference Database (HPRD), (2) the protein localization dataset (chloroplast vs mitochondrial proteins) assembled from datasets of MultiLoc, dataset of MultiP and Mammalian Protein Localization Database, and (3) the DNA and RNA binding protein dataset provided in the study of Peled. We study the problem using the traditional strategy involving a cross-validation as well as an independent testing stage. In results, SCM has the greatest tendency to overfit, possibly due to that the its fitness function is improperly designed to match the final prediction performance. MLD is another approach exhibiting apparent overfitting, and is likely held accountable by the large number of sequence features including some useless information. Finally, DF is slightly less prone to overfitting because of conducting PCA that contains meaningful information. In addition, we discovered that bias of data also affects prediction performance. In summary, overfitting is suspected to occur in all three feature-rich methods, to different extent. This work also descries couples of issues that lead to overfitting. Kuo-Bin Li 李國彬 2018 學位論文 ; thesis 27 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立陽明大學 === 生物醫學資訊研究所 === 106 === Computational learning methods have been applied to bioinformatics problems for decades. Many such attempts come with satisfactory prediction performance by introducing a great number of structural or sequence attributes. However, overfitting problems are hard to observed at once in model building. There are various factors causing overfitting in machine learning. Cross-validation, a common method to evaluate model performances, can help users to monitor whether overfitting happens. With independent data, models are also able to be examined. This study is to address the question that whether or not overfitting is a factor behind the sometimes exceptionally high classification performance and to discover causes of overfitting. Three protein sequence datasets and three bioinformatics sequence descriptors are used in the study. The descriptors include (1) Scoring Card Method (SCM) , (2) Multi-scale Local Descriptor method (MLD), and (3) Distance Frequency method (DF). SCM calculate weights of dipeptides with Genetic Algorithm optimized. MLD builds different combination from divided sequences and involves the composition, transition and the distribution of residues. DF considers the frequency of the distance in the same property. The datasets are (1) the protein interaction dataset (hub vs end proteins) collected from Human Protein Reference Database (HPRD), (2) the protein localization dataset (chloroplast vs mitochondrial proteins) assembled from datasets of MultiLoc, dataset of MultiP and Mammalian Protein Localization Database, and (3) the DNA and RNA binding protein dataset provided in the study of Peled. We study the problem using the traditional strategy involving a cross-validation as well as an independent testing stage. In results, SCM has the greatest tendency to overfit, possibly due to that the its fitness function is improperly designed to match the final prediction performance. MLD is another approach exhibiting apparent overfitting, and is likely held accountable by the large number of sequence features including some useless information. Finally, DF is slightly less prone to overfitting because of conducting PCA that contains meaningful information. In addition, we discovered that bias of data also affects prediction performance. In summary, overfitting is suspected to occur in all three feature-rich methods, to different extent. This work also descries couples of issues that lead to overfitting.
|
author2 |
Kuo-Bin Li |
author_facet |
Kuo-Bin Li Hao-Hsuan Shih 施皓軒 |
author |
Hao-Hsuan Shih 施皓軒 |
spellingShingle |
Hao-Hsuan Shih 施皓軒 An exploration of overfitting in feature-rich bioinformatics learning methods |
author_sort |
Hao-Hsuan Shih |
title |
An exploration of overfitting in feature-rich bioinformatics learning methods |
title_short |
An exploration of overfitting in feature-rich bioinformatics learning methods |
title_full |
An exploration of overfitting in feature-rich bioinformatics learning methods |
title_fullStr |
An exploration of overfitting in feature-rich bioinformatics learning methods |
title_full_unstemmed |
An exploration of overfitting in feature-rich bioinformatics learning methods |
title_sort |
exploration of overfitting in feature-rich bioinformatics learning methods |
publishDate |
2018 |
url |
http://ndltd.ncl.edu.tw/handle/8a2cwb |
work_keys_str_mv |
AT haohsuanshih anexplorationofoverfittinginfeaturerichbioinformaticslearningmethods AT shīhàoxuān anexplorationofoverfittinginfeaturerichbioinformaticslearningmethods AT haohsuanshih tàntǎozàigāowéidùshēngwùzīxùnxuéxífāngfǎdeguòshìwèntí AT shīhàoxuān tàntǎozàigāowéidùshēngwùzīxùnxuéxífāngfǎdeguòshìwèntí AT haohsuanshih explorationofoverfittinginfeaturerichbioinformaticslearningmethods AT shīhàoxuān explorationofoverfittinginfeaturerichbioinformaticslearningmethods |
_version_ |
1719252686806712320 |