Summary: | 博士 === 國立中山大學 === 資訊工程學系研究所 === 102 === The machine learning approach has been adopted in bioinformatics for several decades. Given a sequence, which may be composed of nucleotides or amino acids, the problem is to ask the learning machine about the status of the sequence without performing experiments. In this dissertation, we focus on two problems of recent interest, which are the prediction of the RNA secondary structure, and the prediction of the protein essentiality.
An RNA secondary structure is the fold of a nucleotide sequence. Conventional methods usually address the structure prediction problem from the thermodynamics or comparative perspectives. Instead of developing our prediction tool from scratch, we take advantage of the state-of-the-art software tools. We adopt a tool preference choice approach to select a good software tool for prediction, in hope that the performance is better than any base prediction software. Our tool selector is built by incorporating various RNA sequence features and several SVM classifiers. To facilitate classifier combination and important feature identification, we propose an incremental feature selection method for classifier ensemble construction. The experimental results show that the achieved prediction accuracy is significantly better than any base predictor.
For the essential protein prediction problem, we also adopt various features, which include sequence, protein, topology, and other properties. To identify features relevant to the protein essentiality, we propose a modified sequential backward feature selection method. The method takes both feature sizes and prediction performance into consideration. The experimental results show that the achieved performance is significantly better than those of previous works.
|