Efficient and scalable indexing techniques for sequence data management

Sequence data is one of the rapidly growing types of data. New efficient and scalable techniques are needed to support fast access to this type of data. We study indexing techniques for sequence data, especially biological sequence data. The existing solutions for this type of data either support ef...

Full description

Bibliographic Details
Main Author: Thamildurai, Anand
Format: Others
Published: 2007
Online Access:http://spectrum.library.concordia.ca/975347/1/MR34725.pdf
Thamildurai, Anand <http://spectrum.library.concordia.ca/view/creators/Thamildurai=3AAnand=3A=3A.html> (2007) Efficient and scalable indexing techniques for sequence data management. Masters thesis, Concordia University.
id ndltd-LACETR-oai-collectionscanada.gc.ca-QMG.975347
record_format oai_dc
spelling ndltd-LACETR-oai-collectionscanada.gc.ca-QMG.9753472013-10-22T03:47:25Z Efficient and scalable indexing techniques for sequence data management Thamildurai, Anand Sequence data is one of the rapidly growing types of data. New efficient and scalable techniques are needed to support fast access to this type of data. We study indexing techniques for sequence data, especially biological sequence data. The existing solutions for this type of data either support efficient index construction for long sequences or support fast search, but not both. We propose two new indexing techniques, Suffix Tree Top-Down 64 bit and Suffix Tree Depth-First 64 bit, which offer a tradeoff between scalable index construction, index size, and support of fast search. They differ in the order in which the index nodes are recorded but have similar performance. We compare our techniques with the best known existing techniques, which are based on suffix trees (TDD) or suffix arrays (ESA). The results of our extensive experiments show that while our proposed techniques have a slightly slower construction time for small sequences and larger index size compared to TDD, they outperform TDD in search. We further show that for very large sequences, such as the human genome (about 3GB), our techniques are superior to TDD due to the use of dynamic buffering and better index representation. Compared to the most search efficient in-memory indexing technique, ESA, our proposed techniques are slower in construction but have comparable index size and search performance. The main advantage of our techniques over ESA is that they are disk-based and can handle large sequences. 2007 Thesis NonPeerReviewed application/pdf http://spectrum.library.concordia.ca/975347/1/MR34725.pdf Thamildurai, Anand <http://spectrum.library.concordia.ca/view/creators/Thamildurai=3AAnand=3A=3A.html> (2007) Efficient and scalable indexing techniques for sequence data management. Masters thesis, Concordia University. http://spectrum.library.concordia.ca/975347/
collection NDLTD
format Others
sources NDLTD
description Sequence data is one of the rapidly growing types of data. New efficient and scalable techniques are needed to support fast access to this type of data. We study indexing techniques for sequence data, especially biological sequence data. The existing solutions for this type of data either support efficient index construction for long sequences or support fast search, but not both. We propose two new indexing techniques, Suffix Tree Top-Down 64 bit and Suffix Tree Depth-First 64 bit, which offer a tradeoff between scalable index construction, index size, and support of fast search. They differ in the order in which the index nodes are recorded but have similar performance. We compare our techniques with the best known existing techniques, which are based on suffix trees (TDD) or suffix arrays (ESA). The results of our extensive experiments show that while our proposed techniques have a slightly slower construction time for small sequences and larger index size compared to TDD, they outperform TDD in search. We further show that for very large sequences, such as the human genome (about 3GB), our techniques are superior to TDD due to the use of dynamic buffering and better index representation. Compared to the most search efficient in-memory indexing technique, ESA, our proposed techniques are slower in construction but have comparable index size and search performance. The main advantage of our techniques over ESA is that they are disk-based and can handle large sequences.
author Thamildurai, Anand
spellingShingle Thamildurai, Anand
Efficient and scalable indexing techniques for sequence data management
author_facet Thamildurai, Anand
author_sort Thamildurai, Anand
title Efficient and scalable indexing techniques for sequence data management
title_short Efficient and scalable indexing techniques for sequence data management
title_full Efficient and scalable indexing techniques for sequence data management
title_fullStr Efficient and scalable indexing techniques for sequence data management
title_full_unstemmed Efficient and scalable indexing techniques for sequence data management
title_sort efficient and scalable indexing techniques for sequence data management
publishDate 2007
url http://spectrum.library.concordia.ca/975347/1/MR34725.pdf
Thamildurai, Anand <http://spectrum.library.concordia.ca/view/creators/Thamildurai=3AAnand=3A=3A.html> (2007) Efficient and scalable indexing techniques for sequence data management. Masters thesis, Concordia University.
work_keys_str_mv AT thamilduraianand efficientandscalableindexingtechniquesforsequencedatamanagement
_version_ 1716607873114439680