Efficient and scalable indexing techniques for sequence data management
Sequence data is one of the rapidly growing types of data. New efficient and scalable techniques are needed to support fast access to this type of data. We study indexing techniques for sequence data, especially biological sequence data. The existing solutions for this type of data either support ef...
Main Author: | |
---|---|
Format: | Others |
Published: |
2007
|
Online Access: | http://spectrum.library.concordia.ca/975347/1/MR34725.pdf Thamildurai, Anand <http://spectrum.library.concordia.ca/view/creators/Thamildurai=3AAnand=3A=3A.html> (2007) Efficient and scalable indexing techniques for sequence data management. Masters thesis, Concordia University. |
id |
ndltd-LACETR-oai-collectionscanada.gc.ca-QMG.975347 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-LACETR-oai-collectionscanada.gc.ca-QMG.9753472013-10-22T03:47:25Z Efficient and scalable indexing techniques for sequence data management Thamildurai, Anand Sequence data is one of the rapidly growing types of data. New efficient and scalable techniques are needed to support fast access to this type of data. We study indexing techniques for sequence data, especially biological sequence data. The existing solutions for this type of data either support efficient index construction for long sequences or support fast search, but not both. We propose two new indexing techniques, Suffix Tree Top-Down 64 bit and Suffix Tree Depth-First 64 bit, which offer a tradeoff between scalable index construction, index size, and support of fast search. They differ in the order in which the index nodes are recorded but have similar performance. We compare our techniques with the best known existing techniques, which are based on suffix trees (TDD) or suffix arrays (ESA). The results of our extensive experiments show that while our proposed techniques have a slightly slower construction time for small sequences and larger index size compared to TDD, they outperform TDD in search. We further show that for very large sequences, such as the human genome (about 3GB), our techniques are superior to TDD due to the use of dynamic buffering and better index representation. Compared to the most search efficient in-memory indexing technique, ESA, our proposed techniques are slower in construction but have comparable index size and search performance. The main advantage of our techniques over ESA is that they are disk-based and can handle large sequences. 2007 Thesis NonPeerReviewed application/pdf http://spectrum.library.concordia.ca/975347/1/MR34725.pdf Thamildurai, Anand <http://spectrum.library.concordia.ca/view/creators/Thamildurai=3AAnand=3A=3A.html> (2007) Efficient and scalable indexing techniques for sequence data management. Masters thesis, Concordia University. http://spectrum.library.concordia.ca/975347/ |
collection |
NDLTD |
format |
Others
|
sources |
NDLTD |
description |
Sequence data is one of the rapidly growing types of data. New efficient and scalable techniques are needed to support fast access to this type of data. We study indexing techniques for sequence data, especially biological sequence data. The existing solutions for this type of data either support efficient index construction for long sequences or support fast search, but not both. We propose two new indexing techniques, Suffix Tree Top-Down 64 bit and Suffix Tree Depth-First 64 bit, which offer a tradeoff between scalable index construction, index size, and support of fast search. They differ in the order in which the index nodes are recorded but have similar performance. We compare our techniques with the best known existing techniques, which are based on suffix trees (TDD) or suffix arrays (ESA). The results of our extensive experiments show that while our proposed techniques have a slightly slower construction time for small sequences and larger index size compared to TDD, they outperform TDD in search. We further show that for very large sequences, such as the human genome (about 3GB), our techniques are superior to TDD due to the use of dynamic buffering and better index representation. Compared to the most search efficient in-memory indexing technique, ESA, our proposed techniques are slower in construction but have comparable index size and search performance. The main advantage of our techniques over ESA is that they are disk-based and can handle large sequences. |
author |
Thamildurai, Anand |
spellingShingle |
Thamildurai, Anand Efficient and scalable indexing techniques for sequence data management |
author_facet |
Thamildurai, Anand |
author_sort |
Thamildurai, Anand |
title |
Efficient and scalable indexing techniques for sequence data management |
title_short |
Efficient and scalable indexing techniques for sequence data management |
title_full |
Efficient and scalable indexing techniques for sequence data management |
title_fullStr |
Efficient and scalable indexing techniques for sequence data management |
title_full_unstemmed |
Efficient and scalable indexing techniques for sequence data management |
title_sort |
efficient and scalable indexing techniques for sequence data management |
publishDate |
2007 |
url |
http://spectrum.library.concordia.ca/975347/1/MR34725.pdf Thamildurai, Anand <http://spectrum.library.concordia.ca/view/creators/Thamildurai=3AAnand=3A=3A.html> (2007) Efficient and scalable indexing techniques for sequence data management. Masters thesis, Concordia University. |
work_keys_str_mv |
AT thamilduraianand efficientandscalableindexingtechniquesforsequencedatamanagement |
_version_ |
1716607873114439680 |