Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine.

One of the important modes of pre-mRNA post-transcriptional modification is alternative splicing. Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing different splice sites. In plants like Arabidopsis thaliana, the most common type of alterna...

Full description

Bibliographic Details
Main Authors: Rui Mao, Praveen Kumar Raj Kumar, Cheng Guo, Yang Zhang, Chun Liang
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2014-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC4128822?pdf=render
id doaj-82b098f64d7d4c1c8526513d699783aa
record_format Article
spelling doaj-82b098f64d7d4c1c8526513d699783aa2020-11-24T21:39:32ZengPublic Library of Science (PLoS)PLoS ONE1932-62032014-01-0198e10404910.1371/journal.pone.0104049Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine.Rui MaoPraveen Kumar Raj KumarCheng GuoYang ZhangChun LiangOne of the important modes of pre-mRNA post-transcriptional modification is alternative splicing. Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing different splice sites. In plants like Arabidopsis thaliana, the most common type of alternative splicing is intron retention. Many studies in the past focus on positional distribution of retained introns (RIs) among different genic regions and their expression regulations, while little systematic classification of RIs from constitutively spliced introns (CSIs) has been conducted using machine learning approaches. We used random forest and support vector machine (SVM) with radial basis kernel function (RBF) to differentiate these two types of introns in Arabidopsis. By comparing coordinates of introns of all annotated mRNAs from TAIR10, we obtained our high-quality experimental data. To distinguish RIs from CSIs, We investigated the unique characteristics of RIs in comparison with CSIs and finally extracted 37 quantitative features: local and global nucleotide sequence features of introns, frequent motifs, the signal strength of splice sites, and the similarity between sequences of introns and their flanking regions. We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches. The optimal penalty parameter C and the RBF kernel parameter [Formula: see text] in SVM were set based on particle swarm optimization algorithm (PSOSVM). Our classification performance showed F-Measure of 80.8% (random forest) and 77.4% (PSOSVM). Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach. Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention.http://europepmc.org/articles/PMC4128822?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Rui Mao
Praveen Kumar Raj Kumar
Cheng Guo
Yang Zhang
Chun Liang
spellingShingle Rui Mao
Praveen Kumar Raj Kumar
Cheng Guo
Yang Zhang
Chun Liang
Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine.
PLoS ONE
author_facet Rui Mao
Praveen Kumar Raj Kumar
Cheng Guo
Yang Zhang
Chun Liang
author_sort Rui Mao
title Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine.
title_short Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine.
title_full Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine.
title_fullStr Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine.
title_full_unstemmed Comparative analyses between retained introns and constitutively spliced introns in Arabidopsis thaliana using random forest and support vector machine.
title_sort comparative analyses between retained introns and constitutively spliced introns in arabidopsis thaliana using random forest and support vector machine.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2014-01-01
description One of the important modes of pre-mRNA post-transcriptional modification is alternative splicing. Alternative splicing allows creation of many distinct mature mRNA transcripts from a single gene by utilizing different splice sites. In plants like Arabidopsis thaliana, the most common type of alternative splicing is intron retention. Many studies in the past focus on positional distribution of retained introns (RIs) among different genic regions and their expression regulations, while little systematic classification of RIs from constitutively spliced introns (CSIs) has been conducted using machine learning approaches. We used random forest and support vector machine (SVM) with radial basis kernel function (RBF) to differentiate these two types of introns in Arabidopsis. By comparing coordinates of introns of all annotated mRNAs from TAIR10, we obtained our high-quality experimental data. To distinguish RIs from CSIs, We investigated the unique characteristics of RIs in comparison with CSIs and finally extracted 37 quantitative features: local and global nucleotide sequence features of introns, frequent motifs, the signal strength of splice sites, and the similarity between sequences of introns and their flanking regions. We demonstrated that our proposed feature extraction approach was more accurate in effectively classifying RIs from CSIs in comparison with other four approaches. The optimal penalty parameter C and the RBF kernel parameter [Formula: see text] in SVM were set based on particle swarm optimization algorithm (PSOSVM). Our classification performance showed F-Measure of 80.8% (random forest) and 77.4% (PSOSVM). Not only the basic sequence features and positional distribution characteristics of RIs were obtained, but also putative regulatory motifs in intron splicing were predicted based on our feature extraction approach. Clearly, our study will facilitate a better understanding of underlying mechanisms involved in intron retention.
url http://europepmc.org/articles/PMC4128822?pdf=render
work_keys_str_mv AT ruimao comparativeanalysesbetweenretainedintronsandconstitutivelysplicedintronsinarabidopsisthalianausingrandomforestandsupportvectormachine
AT praveenkumarrajkumar comparativeanalysesbetweenretainedintronsandconstitutivelysplicedintronsinarabidopsisthalianausingrandomforestandsupportvectormachine
AT chengguo comparativeanalysesbetweenretainedintronsandconstitutivelysplicedintronsinarabidopsisthalianausingrandomforestandsupportvectormachine
AT yangzhang comparativeanalysesbetweenretainedintronsandconstitutivelysplicedintronsinarabidopsisthalianausingrandomforestandsupportvectormachine
AT chunliang comparativeanalysesbetweenretainedintronsandconstitutivelysplicedintronsinarabidopsisthalianausingrandomforestandsupportvectormachine
_version_ 1725930736011182080