Scalable Index Techniques for Biological String Databases

碩士 === 國立清華大學 === 資訊工程學系 === 94 === With the advances of Bioinformatics and computer technologies, the knowledge development in this joint field comes to a new era. More often than not, the laboratory discoveries of biologists substantially help refine the computing models used by computer scientist...

Full description

Bibliographic Details
Main Authors: Chien-Wen Lin, 林健紋
Other Authors: Chaur-Chin Chen
Format: Others
Language:en_US
Published: 2006
Online Access:http://ndltd.ncl.edu.tw/handle/26897867296726728308
Description
Summary:碩士 === 國立清華大學 === 資訊工程學系 === 94 === With the advances of Bioinformatics and computer technologies, the knowledge development in this joint field comes to a new era. More often than not, the laboratory discoveries of biologists substantially help refine the computing models used by computer scientists. In turn, the predictive results from computing facilitate confining the scope of the subsequent verification. As the amount of biological strings grows at an increasing speed, the efficiency of computing model is vital to this knowledge discovery process. Human Genome project and its ongoing work places a high demand on computing. This thesis aims at this issue. Our research focuses on improving the search efficiency of semi-global alignment problem. In particular, we propose a scalable solution to leverage the state-of-the-art index designs to date. Despite the wide acceptance in prediction accuracy, Smith-Waterman (SW) –based alignment algorithms are too expensive to scale. “An Online and Accurate Technique for Local-alignment Searches on Biological Sequences”(OASIS) scheme embeds SW to better search scalability using a suffix tree index. However, its space overhead is considerable. To address these drawbacks, we develop a space-efficient, easy-to-update index design to rapidly identify the potential matches to a query string before subsequent costly SW verification. As a result, most of futile computations can be avoided in our approach using only about 10% index space overhead. Such preliminary filtration framework is generic in nature. It serves as a safe lower bound to the distance functions for a variety of SW cost matrices. Yet, the usage of suffix trees in its design offers favorable scalability for the problem. Intensive experimental results substantiate these advantages of our technique.