Efficient Exact and Approximate String Matching Algorithms

博士 === 國立清華大學 === 資訊工程學系 === 102 === In this thesis, we first propose two algorithms for exact string matching problem, which aims to find all the positions i's in a given text where a given pattern occurs. Our algorithms find an optimal selective comparing order of the characters of the pat...

Full description

Bibliographic Details
Main Authors:	Lu, Chia Wei, 呂嘉維
Other Authors:	Lee, R. C. T.
Format:	Others
Language:	en_US
Published:	2014
Online Access:	http://ndltd.ncl.edu.tw/handle/49922113552090356808

id	ndltd-TW-102NTHU5392049
record_format	oai_dc
spelling	ndltd-TW-102NTHU53920492015-10-13T23:37:12Z http://ndltd.ncl.edu.tw/handle/49922113552090356808 Efficient Exact and Approximate String Matching Algorithms 有效率的字串比對和近似字串比對演算法 Lu, Chia Wei 呂嘉維博士國立清華大學資訊工程學系 102 In this thesis, we first propose two algorithms for exact string matching problem, which aims to find all the positions i's in a given text where a given pattern occurs. Our algorithms find an optimal selective comparing order of the characters of the pattern so that we could have a better performance in the searching phase. To find the optimal comparing order, we adopt the branch and bound approach. Moreover, our proposed algorithm can be combined with other existing exact string matching algorithms to improve the searching efficiency. The experimental results show that our algorithms indeed have the smallest number of character comparisons and are also efficient in time as compared with other existing exact string matching algorithms. Second, we propose a new filtration algorithm, as well as a hybrid filtration strategy, to efficiently solve the approximate string matching problem (also called the k-difference problem), which aims to find all the positions i's in a given text such that there exists a substring of the text ending at position i whose edit distance from a given pattern is less than or equal to a given error bound k. Our experimental results on simulated datasets of DNA sequences show that when compared with other filtration algorithms, our filtration algorithm has better performance on the efficiency to filter out those positions of the text at which the pattern does not occur approximately. Moreover, our hybrid filtration strategy further improves the effectiveness of our filtration algorithm. Third, we propose a progressive approach to solve the DNA resequencing problem which is defined as follows: We are given an unknown DNA sequence X and a known reference sequence R. Our task is to see whether X and R are similar or not. The present popular approach is to break up X into subsequences by the next generation sequencing (NGS) technologies, called reads. We then map the reads of X onto R with a suitable error bound. However, if the similarity between X and R is not very high (<95%), there would be many reads unmapped, and we then cannot obtain the mutations inside the unmapped regions. One can use a large error bound to increase the number of reads mapped. But it is not a good solution because increasing error bound will also increase the probability of false positive mapping. Our approach uses a small error bound and to increase the number of reads mapped, our approach modifies R each time after the reads are mapped. Thus our approach is a progressive approach. Compared with other available tools, our approach allows us to be able to map more reads to the reference sequence. In our simulated experiments, we also show the high correctness of our mapping algorithm. Lee, R. C. T. Tang, Chuan Yi 李家同唐傳義 2014 學位論文 ; thesis 83 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	博士 === 國立清華大學 === 資訊工程學系 === 102 === In this thesis, we first propose two algorithms for exact string matching problem, which aims to find all the positions i's in a given text where a given pattern occurs. Our algorithms find an optimal selective comparing order of the characters of the pattern so that we could have a better performance in the searching phase. To find the optimal comparing order, we adopt the branch and bound approach. Moreover, our proposed algorithm can be combined with other existing exact string matching algorithms to improve the searching efficiency. The experimental results show that our algorithms indeed have the smallest number of character comparisons and are also efficient in time as compared with other existing exact string matching algorithms. Second, we propose a new filtration algorithm, as well as a hybrid filtration strategy, to efficiently solve the approximate string matching problem (also called the k-difference problem), which aims to find all the positions i's in a given text such that there exists a substring of the text ending at position i whose edit distance from a given pattern is less than or equal to a given error bound k. Our experimental results on simulated datasets of DNA sequences show that when compared with other filtration algorithms, our filtration algorithm has better performance on the efficiency to filter out those positions of the text at which the pattern does not occur approximately. Moreover, our hybrid filtration strategy further improves the effectiveness of our filtration algorithm. Third, we propose a progressive approach to solve the DNA resequencing problem which is defined as follows: We are given an unknown DNA sequence X and a known reference sequence R. Our task is to see whether X and R are similar or not. The present popular approach is to break up X into subsequences by the next generation sequencing (NGS) technologies, called reads. We then map the reads of X onto R with a suitable error bound. However, if the similarity between X and R is not very high (<95%), there would be many reads unmapped, and we then cannot obtain the mutations inside the unmapped regions. One can use a large error bound to increase the number of reads mapped. But it is not a good solution because increasing error bound will also increase the probability of false positive mapping. Our approach uses a small error bound and to increase the number of reads mapped, our approach modifies R each time after the reads are mapped. Thus our approach is a progressive approach. Compared with other available tools, our approach allows us to be able to map more reads to the reference sequence. In our simulated experiments, we also show the high correctness of our mapping algorithm.
author2	Lee, R. C. T.
author_facet	Lee, R. C. T. Lu, Chia Wei 呂嘉維
author	Lu, Chia Wei 呂嘉維
spellingShingle	Lu, Chia Wei 呂嘉維 Efficient Exact and Approximate String Matching Algorithms
author_sort	Lu, Chia Wei
title	Efficient Exact and Approximate String Matching Algorithms
title_short	Efficient Exact and Approximate String Matching Algorithms
title_full	Efficient Exact and Approximate String Matching Algorithms
title_fullStr	Efficient Exact and Approximate String Matching Algorithms
title_full_unstemmed	Efficient Exact and Approximate String Matching Algorithms
title_sort	efficient exact and approximate string matching algorithms
publishDate	2014
url	http://ndltd.ncl.edu.tw/handle/49922113552090356808
work_keys_str_mv	AT luchiawei efficientexactandapproximatestringmatchingalgorithms AT lǚjiāwéi efficientexactandapproximatestringmatchingalgorithms AT luchiawei yǒuxiàolǜdezìchuànbǐduìhéjìnshìzìchuànbǐduìyǎnsuànfǎ AT lǚjiāwéi yǒuxiàolǜdezìchuànbǐduìhéjìnshìzìchuànbǐduìyǎnsuànfǎ
_version_	1718086670206631936

Efficient Exact and Approximate String Matching Algorithms

Similar Items