Efficient and scalable indexing techniques for sequence data management

Sequence data is one of the rapidly growing types of data. New efficient and scalable techniques are needed to support fast access to this type of data. We study indexing techniques for sequence data, especially biological sequence data. The existing solutions for this type of data either support ef...

Full description

Bibliographic Details
Main Author:	Thamildurai, Anand
Format:	Others
Published:	2007
Online Access:	http://spectrum.library.concordia.ca/975347/1/MR34725.pdf Thamildurai, Anand <http://spectrum.library.concordia.ca/view/creators/Thamildurai=3AAnand=3A=3A.html> (2007) Efficient and scalable indexing techniques for sequence data management. Masters thesis, Concordia University.

id	ndltd-LACETR-oai-collectionscanada.gc.ca-QMG.975347
record_format	oai_dc
spelling	ndltd-LACETR-oai-collectionscanada.gc.ca-QMG.9753472013-10-22T03:47:25Z Efficient and scalable indexing techniques for sequence data management Thamildurai, Anand Sequence data is one of the rapidly growing types of data. New efficient and scalable techniques are needed to support fast access to this type of data. We study indexing techniques for sequence data, especially biological sequence data. The existing solutions for this type of data either support efficient index construction for long sequences or support fast search, but not both. We propose two new indexing techniques, Suffix Tree Top-Down 64 bit and Suffix Tree Depth-First 64 bit, which offer a tradeoff between scalable index construction, index size, and support of fast search. They differ in the order in which the index nodes are recorded but have similar performance. We compare our techniques with the best known existing techniques, which are based on suffix trees (TDD) or suffix arrays (ESA). The results of our extensive experiments show that while our proposed techniques have a slightly slower construction time for small sequences and larger index size compared to TDD, they outperform TDD in search. We further show that for very large sequences, such as the human genome (about 3GB), our techniques are superior to TDD due to the use of dynamic buffering and better index representation. Compared to the most search efficient in-memory indexing technique, ESA, our proposed techniques are slower in construction but have comparable index size and search performance. The main advantage of our techniques over ESA is that they are disk-based and can handle large sequences. 2007 Thesis NonPeerReviewed application/pdf http://spectrum.library.concordia.ca/975347/1/MR34725.pdf Thamildurai, Anand <http://spectrum.library.concordia.ca/view/creators/Thamildurai=3AAnand=3A=3A.html> (2007) Efficient and scalable indexing techniques for sequence data management. Masters thesis, Concordia University. http://spectrum.library.concordia.ca/975347/
collection	NDLTD
format	Others
sources	NDLTD
description	Sequence data is one of the rapidly growing types of data. New efficient and scalable techniques are needed to support fast access to this type of data. We study indexing techniques for sequence data, especially biological sequence data. The existing solutions for this type of data either support efficient index construction for long sequences or support fast search, but not both. We propose two new indexing techniques, Suffix Tree Top-Down 64 bit and Suffix Tree Depth-First 64 bit, which offer a tradeoff between scalable index construction, index size, and support of fast search. They differ in the order in which the index nodes are recorded but have similar performance. We compare our techniques with the best known existing techniques, which are based on suffix trees (TDD) or suffix arrays (ESA). The results of our extensive experiments show that while our proposed techniques have a slightly slower construction time for small sequences and larger index size compared to TDD, they outperform TDD in search. We further show that for very large sequences, such as the human genome (about 3GB), our techniques are superior to TDD due to the use of dynamic buffering and better index representation. Compared to the most search efficient in-memory indexing technique, ESA, our proposed techniques are slower in construction but have comparable index size and search performance. The main advantage of our techniques over ESA is that they are disk-based and can handle large sequences.
author	Thamildurai, Anand
spellingShingle	Thamildurai, Anand Efficient and scalable indexing techniques for sequence data management
author_facet	Thamildurai, Anand
author_sort	Thamildurai, Anand
title	Efficient and scalable indexing techniques for sequence data management
title_short	Efficient and scalable indexing techniques for sequence data management
title_full	Efficient and scalable indexing techniques for sequence data management
title_fullStr	Efficient and scalable indexing techniques for sequence data management
title_full_unstemmed	Efficient and scalable indexing techniques for sequence data management
title_sort	efficient and scalable indexing techniques for sequence data management
publishDate	2007
url	http://spectrum.library.concordia.ca/975347/1/MR34725.pdf Thamildurai, Anand <http://spectrum.library.concordia.ca/view/creators/Thamildurai=3AAnand=3A=3A.html> (2007) Efficient and scalable indexing techniques for sequence data management. Masters thesis, Concordia University.
work_keys_str_mv	AT thamilduraianand efficientandscalableindexingtechniquesforsequencedatamanagement
_version_	1716607873114439680

Efficient and scalable indexing techniques for sequence data management

Similar Items