Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule Set

<p>Abstract</p> <p>There is an enormous amount of information encoded in each genome – enough to create living, responsive and adaptive organisms. Raw sequence data alone is not enough to understand function, mechanisms or interactions. Changes in a single base pair can lead to dis...

Full description

Bibliographic Details
Main Authors: Johnson David, Wren Jonathan D, Gruenwald Le
Format: Article
Language:English
Published: BMC 2005-07-01
Series:BMC Bioinformatics
Subjects:
id doaj-e7c20c68a5a04afb9c9fd5ba4adc43e0
record_format Article
spelling doaj-e7c20c68a5a04afb9c9fd5ba4adc43e02020-11-25T00:02:19ZengBMCBMC Bioinformatics1471-21052005-07-016Suppl 2S210.1186/1471-2105-6-S2-S2Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule SetJohnson DavidWren Jonathan DGruenwald Le<p>Abstract</p> <p>There is an enormous amount of information encoded in each genome – enough to create living, responsive and adaptive organisms. Raw sequence data alone is not enough to understand function, mechanisms or interactions. Changes in a single base pair can lead to disease, such as sickle-cell anemia, while some large megabase deletions have no apparent phenotypic effect. Genomic features are varied in their data types and annotation of these features is spread across multiple databases. Herein, we develop a method to automate exploration of genomes by iteratively exploring sequence data for correlations and building upon them. First, to integrate and compare different annotation sources, a sequence matrix (SM) is developed to contain position-dependant information. Second, a classification tree is developed for matrix row types, specifying how each data type is to be treated with respect to other data types for analysis purposes. Third, correlative analyses are developed to analyze features of each matrix row in terms of the other rows, guided by the classification tree as to which analyses are appropriate. A prototype was developed and successful in detecting coinciding genomic features among genes, exons, repetitive elements and CpG islands.</p> Data mininggenomicsformat specificationsassociation rule discovery
collection DOAJ
language English
format Article
sources DOAJ
author Johnson David
Wren Jonathan D
Gruenwald Le
spellingShingle Johnson David
Wren Jonathan D
Gruenwald Le
Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule Set
BMC Bioinformatics
Data mining
genomics
format specifications
association rule discovery
author_facet Johnson David
Wren Jonathan D
Gruenwald Le
author_sort Johnson David
title Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule Set
title_short Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule Set
title_full Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule Set
title_fullStr Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule Set
title_full_unstemmed Automating Genomic Data Mining via a Sequence-based Matrix Format and Associative Rule Set
title_sort automating genomic data mining via a sequence-based matrix format and associative rule set
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2005-07-01
description <p>Abstract</p> <p>There is an enormous amount of information encoded in each genome – enough to create living, responsive and adaptive organisms. Raw sequence data alone is not enough to understand function, mechanisms or interactions. Changes in a single base pair can lead to disease, such as sickle-cell anemia, while some large megabase deletions have no apparent phenotypic effect. Genomic features are varied in their data types and annotation of these features is spread across multiple databases. Herein, we develop a method to automate exploration of genomes by iteratively exploring sequence data for correlations and building upon them. First, to integrate and compare different annotation sources, a sequence matrix (SM) is developed to contain position-dependant information. Second, a classification tree is developed for matrix row types, specifying how each data type is to be treated with respect to other data types for analysis purposes. Third, correlative analyses are developed to analyze features of each matrix row in terms of the other rows, guided by the classification tree as to which analyses are appropriate. A prototype was developed and successful in detecting coinciding genomic features among genes, exons, repetitive elements and CpG islands.</p>
topic Data mining
genomics
format specifications
association rule discovery
work_keys_str_mv AT johnsondavid automatinggenomicdataminingviaasequencebasedmatrixformatandassociativeruleset
AT wrenjonathand automatinggenomicdataminingviaasequencebasedmatrixformatandassociativeruleset
AT gruenwaldle automatinggenomicdataminingviaasequencebasedmatrixformatandassociativeruleset
_version_ 1725438373323079680