Efficient Exact and Approximate String Matching Algorithms

博士 === 國立清華大學 === 資訊工程學系 === 102 === In this thesis, we first propose two algorithms for exact string matching problem, which aims to find all the positions i's in a given text where a given pattern occurs. Our algorithms find an optimal selective comparing order of the characters of the pat...

Full description

Bibliographic Details
Main Authors: Lu, Chia Wei, 呂嘉維
Other Authors: Lee, R. C. T.
Format: Others
Language:en_US
Published: 2014
Online Access:http://ndltd.ncl.edu.tw/handle/49922113552090356808
id ndltd-TW-102NTHU5392049
record_format oai_dc
spelling ndltd-TW-102NTHU53920492015-10-13T23:37:12Z http://ndltd.ncl.edu.tw/handle/49922113552090356808 Efficient Exact and Approximate String Matching Algorithms 有效率的字串比對和近似字串比對演算法 Lu, Chia Wei 呂嘉維 博士 國立清華大學 資訊工程學系 102 In this thesis, we first propose two algorithms for exact string matching problem, which aims to find all the positions i's in a given text where a given pattern occurs. Our algorithms find an optimal selective comparing order of the characters of the pattern so that we could have a better performance in the searching phase. To find the optimal comparing order, we adopt the branch and bound approach. Moreover, our proposed algorithm can be combined with other existing exact string matching algorithms to improve the searching efficiency. The experimental results show that our algorithms indeed have the smallest number of character comparisons and are also efficient in time as compared with other existing exact string matching algorithms. Second, we propose a new filtration algorithm, as well as a hybrid filtration strategy, to efficiently solve the approximate string matching problem (also called the k-difference problem), which aims to find all the positions i's in a given text such that there exists a substring of the text ending at position i whose edit distance from a given pattern is less than or equal to a given error bound k. Our experimental results on simulated datasets of DNA sequences show that when compared with other filtration algorithms, our filtration algorithm has better performance on the efficiency to filter out those positions of the text at which the pattern does not occur approximately. Moreover, our hybrid filtration strategy further improves the effectiveness of our filtration algorithm. Third, we propose a progressive approach to solve the DNA resequencing problem which is defined as follows: We are given an unknown DNA sequence X and a known reference sequence R. Our task is to see whether X and R are similar or not. The present popular approach is to break up X into subsequences by the next generation sequencing (NGS) technologies, called reads. We then map the reads of X onto R with a suitable error bound. However, if the similarity between X and R is not very high (<95%), there would be many reads unmapped, and we then cannot obtain the mutations inside the unmapped regions. One can use a large error bound to increase the number of reads mapped. But it is not a good solution because increasing error bound will also increase the probability of false positive mapping. Our approach uses a small error bound and to increase the number of reads mapped, our approach modifies R each time after the reads are mapped. Thus our approach is a progressive approach. Compared with other available tools, our approach allows us to be able to map more reads to the reference sequence. In our simulated experiments, we also show the high correctness of our mapping algorithm. Lee, R. C. T. Tang, Chuan Yi 李家同 唐傳義 2014 學位論文 ; thesis 83 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 博士 === 國立清華大學 === 資訊工程學系 === 102 === In this thesis, we first propose two algorithms for exact string matching problem, which aims to find all the positions i's in a given text where a given pattern occurs. Our algorithms find an optimal selective comparing order of the characters of the pattern so that we could have a better performance in the searching phase. To find the optimal comparing order, we adopt the branch and bound approach. Moreover, our proposed algorithm can be combined with other existing exact string matching algorithms to improve the searching efficiency. The experimental results show that our algorithms indeed have the smallest number of character comparisons and are also efficient in time as compared with other existing exact string matching algorithms. Second, we propose a new filtration algorithm, as well as a hybrid filtration strategy, to efficiently solve the approximate string matching problem (also called the k-difference problem), which aims to find all the positions i's in a given text such that there exists a substring of the text ending at position i whose edit distance from a given pattern is less than or equal to a given error bound k. Our experimental results on simulated datasets of DNA sequences show that when compared with other filtration algorithms, our filtration algorithm has better performance on the efficiency to filter out those positions of the text at which the pattern does not occur approximately. Moreover, our hybrid filtration strategy further improves the effectiveness of our filtration algorithm. Third, we propose a progressive approach to solve the DNA resequencing problem which is defined as follows: We are given an unknown DNA sequence X and a known reference sequence R. Our task is to see whether X and R are similar or not. The present popular approach is to break up X into subsequences by the next generation sequencing (NGS) technologies, called reads. We then map the reads of X onto R with a suitable error bound. However, if the similarity between X and R is not very high (<95%), there would be many reads unmapped, and we then cannot obtain the mutations inside the unmapped regions. One can use a large error bound to increase the number of reads mapped. But it is not a good solution because increasing error bound will also increase the probability of false positive mapping. Our approach uses a small error bound and to increase the number of reads mapped, our approach modifies R each time after the reads are mapped. Thus our approach is a progressive approach. Compared with other available tools, our approach allows us to be able to map more reads to the reference sequence. In our simulated experiments, we also show the high correctness of our mapping algorithm.
author2 Lee, R. C. T.
author_facet Lee, R. C. T.
Lu, Chia Wei
呂嘉維
author Lu, Chia Wei
呂嘉維
spellingShingle Lu, Chia Wei
呂嘉維
Efficient Exact and Approximate String Matching Algorithms
author_sort Lu, Chia Wei
title Efficient Exact and Approximate String Matching Algorithms
title_short Efficient Exact and Approximate String Matching Algorithms
title_full Efficient Exact and Approximate String Matching Algorithms
title_fullStr Efficient Exact and Approximate String Matching Algorithms
title_full_unstemmed Efficient Exact and Approximate String Matching Algorithms
title_sort efficient exact and approximate string matching algorithms
publishDate 2014
url http://ndltd.ncl.edu.tw/handle/49922113552090356808
work_keys_str_mv AT luchiawei efficientexactandapproximatestringmatchingalgorithms
AT lǚjiāwéi efficientexactandapproximatestringmatchingalgorithms
AT luchiawei yǒuxiàolǜdezìchuànbǐduìhéjìnshìzìchuànbǐduìyǎnsuànfǎ
AT lǚjiāwéi yǒuxiàolǜdezìchuànbǐduìhéjìnshìzìchuànbǐduìyǎnsuànfǎ
_version_ 1718086670206631936