Efficient Exact and Approximate String Matching Algorithms
博士 === 國立清華大學 === 資訊工程學系 === 102 === In this thesis, we first propose two algorithms for exact string matching problem, which aims to find all the positions i's in a given text where a given pattern occurs. Our algorithms find an optimal selective comparing order of the characters of the pat...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2014
|
Online Access: | http://ndltd.ncl.edu.tw/handle/49922113552090356808 |
id |
ndltd-TW-102NTHU5392049 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-102NTHU53920492015-10-13T23:37:12Z http://ndltd.ncl.edu.tw/handle/49922113552090356808 Efficient Exact and Approximate String Matching Algorithms 有效率的字串比對和近似字串比對演算法 Lu, Chia Wei 呂嘉維 博士 國立清華大學 資訊工程學系 102 In this thesis, we first propose two algorithms for exact string matching problem, which aims to find all the positions i's in a given text where a given pattern occurs. Our algorithms find an optimal selective comparing order of the characters of the pattern so that we could have a better performance in the searching phase. To find the optimal comparing order, we adopt the branch and bound approach. Moreover, our proposed algorithm can be combined with other existing exact string matching algorithms to improve the searching efficiency. The experimental results show that our algorithms indeed have the smallest number of character comparisons and are also efficient in time as compared with other existing exact string matching algorithms. Second, we propose a new filtration algorithm, as well as a hybrid filtration strategy, to efficiently solve the approximate string matching problem (also called the k-difference problem), which aims to find all the positions i's in a given text such that there exists a substring of the text ending at position i whose edit distance from a given pattern is less than or equal to a given error bound k. Our experimental results on simulated datasets of DNA sequences show that when compared with other filtration algorithms, our filtration algorithm has better performance on the efficiency to filter out those positions of the text at which the pattern does not occur approximately. Moreover, our hybrid filtration strategy further improves the effectiveness of our filtration algorithm. Third, we propose a progressive approach to solve the DNA resequencing problem which is defined as follows: We are given an unknown DNA sequence X and a known reference sequence R. Our task is to see whether X and R are similar or not. The present popular approach is to break up X into subsequences by the next generation sequencing (NGS) technologies, called reads. We then map the reads of X onto R with a suitable error bound. However, if the similarity between X and R is not very high (<95%), there would be many reads unmapped, and we then cannot obtain the mutations inside the unmapped regions. One can use a large error bound to increase the number of reads mapped. But it is not a good solution because increasing error bound will also increase the probability of false positive mapping. Our approach uses a small error bound and to increase the number of reads mapped, our approach modifies R each time after the reads are mapped. Thus our approach is a progressive approach. Compared with other available tools, our approach allows us to be able to map more reads to the reference sequence. In our simulated experiments, we also show the high correctness of our mapping algorithm. Lee, R. C. T. Tang, Chuan Yi 李家同 唐傳義 2014 學位論文 ; thesis 83 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
博士 === 國立清華大學 === 資訊工程學系 === 102 === In this thesis, we first propose two algorithms for exact string matching problem, which aims to find all the positions i's in a given text where a given pattern occurs. Our algorithms find an optimal selective comparing order of the characters of the pattern so that we could have a better performance in the searching phase. To find the optimal comparing order, we adopt the branch and bound approach. Moreover, our proposed algorithm can be combined with other existing exact string matching algorithms to improve the searching efficiency. The experimental results show that our algorithms indeed have the smallest number of character comparisons and are also efficient in time as compared with other existing exact string matching algorithms.
Second, we propose a new filtration algorithm, as well as a hybrid filtration strategy, to efficiently solve the approximate string matching problem (also called the k-difference problem), which aims to find all the positions i's in a given text such that there exists a substring of the text ending at position i whose edit distance from a given pattern is less than or equal to a given error bound k. Our experimental results on simulated datasets of DNA sequences show that when compared with other filtration algorithms, our filtration algorithm has better performance on the efficiency to filter out those positions of the text at which the pattern does not occur approximately. Moreover, our hybrid filtration strategy further improves the effectiveness of our filtration algorithm.
Third, we propose a progressive approach to solve the DNA resequencing problem which is defined as follows: We are given an unknown DNA sequence X and a known reference sequence R. Our task is to see whether X and R are similar or not. The present popular approach is to break up X into subsequences by the next generation sequencing (NGS) technologies, called reads. We then map the reads of X onto R with a suitable error bound. However, if the similarity between X and R is not very high (<95%), there would be many reads unmapped, and we then cannot obtain the mutations inside the unmapped regions. One can use a large error bound to increase the number of reads mapped. But it is not a good solution because increasing error bound will also increase the probability of false positive mapping. Our approach uses a small error bound and to increase the number of reads mapped, our approach modifies R each time after the reads are mapped. Thus our approach is a progressive approach. Compared with other available tools, our approach allows us to be able to map more reads to the reference sequence. In our simulated experiments, we also show the high correctness of our mapping algorithm.
|
author2 |
Lee, R. C. T. |
author_facet |
Lee, R. C. T. Lu, Chia Wei 呂嘉維 |
author |
Lu, Chia Wei 呂嘉維 |
spellingShingle |
Lu, Chia Wei 呂嘉維 Efficient Exact and Approximate String Matching Algorithms |
author_sort |
Lu, Chia Wei |
title |
Efficient Exact and Approximate String Matching Algorithms |
title_short |
Efficient Exact and Approximate String Matching Algorithms |
title_full |
Efficient Exact and Approximate String Matching Algorithms |
title_fullStr |
Efficient Exact and Approximate String Matching Algorithms |
title_full_unstemmed |
Efficient Exact and Approximate String Matching Algorithms |
title_sort |
efficient exact and approximate string matching algorithms |
publishDate |
2014 |
url |
http://ndltd.ncl.edu.tw/handle/49922113552090356808 |
work_keys_str_mv |
AT luchiawei efficientexactandapproximatestringmatchingalgorithms AT lǚjiāwéi efficientexactandapproximatestringmatchingalgorithms AT luchiawei yǒuxiàolǜdezìchuànbǐduìhéjìnshìzìchuànbǐduìyǎnsuànfǎ AT lǚjiāwéi yǒuxiàolǜdezìchuànbǐduìhéjìnshìzìchuànbǐduìyǎnsuànfǎ |
_version_ |
1718086670206631936 |