Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology

<p>Abstract</p> <p>Background</p> <p>DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. DNA repair has mainly been studied in experimental and clinical situations, and...

Full description

Bibliographic Details
Main Authors: Akutsu Tatsuya, Brown JB
Format: Article
Language:English
Published: BMC 2009-01-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/10/25
id doaj-c9e1c9ce336d424293edcba67e497e4e
record_format Article
spelling doaj-c9e1c9ce336d424293edcba67e497e4e2020-11-25T01:30:36ZengBMCBMC Bioinformatics1471-21052009-01-011012510.1186/1471-2105-10-25Identification of novel DNA repair proteins via primary sequence, secondary structure, and homologyAkutsu TatsuyaBrown JB<p>Abstract</p> <p>Background</p> <p>DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. DNA repair has mainly been studied in experimental and clinical situations, and relatively few information-based approaches to new extracting DNA repair knowledge exist. As a first step, automatic detection of DNA repair proteins in genomes via informatics techniques is desirable; however, there are many forms of DNA repair and it is not a straightforward process to identify and classify repair proteins with a single optimal method. We perform a study of the ability of homology and machine learning-based methods to identify and classify DNA repair proteins, as well as scan vertebrate genomes for the presence of novel repair proteins. Combinations of primary sequence polypeptide frequency, secondary structure, and homology information are used as feature information for input to a Support Vector Machine (SVM).</p> <p>Results</p> <p>We identify that SVM techniques are capable of identifying portions of DNA repair protein datasets without admitting false positives; at low levels of false positive tolerance, homology can also identify and classify proteins with good performance. Secondary structure information provides improved performance compared to using primary structure alone. Furthermore, we observe that machine learning methods incorporating homology information perform best when data is filtered by some clustering technique. Analysis by applying these methodologies to the scanning of multiple vertebrate genomes confirms a positive correlation between the size of a genome and the number of DNA repair protein transcripts it is likely to contain, and simultaneously suggests that all organisms have a non-zero minimum number of repair genes. In addition, the scan result clusters several organisms' repair abilities in an evolutionarily consistent fashion. Analysis also identifies several functionally unconfirmed proteins that are highly likely to be involved in the repair process. A new web service, INTREPED, has been made available for the immediate search and annotation of DNA repair proteins in newly sequenced genomes.</p> <p>Conclusion</p> <p>Despite complexity due to a multitude of repair pathways, combinations of sequence, structure, and homology with Support Vector Machines offer good methods in addition to existing homology searches for DNA repair protein identification and functional annotation. Most importantly, this study has uncovered relationships between the size of a genome and a genome's available repair repetoire, and offers a number of new predictions as well as a prediction service, both which reduce the search time and cost for novel repair genes and proteins.</p> http://www.biomedcentral.com/1471-2105/10/25
collection DOAJ
language English
format Article
sources DOAJ
author Akutsu Tatsuya
Brown JB
spellingShingle Akutsu Tatsuya
Brown JB
Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology
BMC Bioinformatics
author_facet Akutsu Tatsuya
Brown JB
author_sort Akutsu Tatsuya
title Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology
title_short Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology
title_full Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology
title_fullStr Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology
title_full_unstemmed Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology
title_sort identification of novel dna repair proteins via primary sequence, secondary structure, and homology
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2009-01-01
description <p>Abstract</p> <p>Background</p> <p>DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. DNA repair has mainly been studied in experimental and clinical situations, and relatively few information-based approaches to new extracting DNA repair knowledge exist. As a first step, automatic detection of DNA repair proteins in genomes via informatics techniques is desirable; however, there are many forms of DNA repair and it is not a straightforward process to identify and classify repair proteins with a single optimal method. We perform a study of the ability of homology and machine learning-based methods to identify and classify DNA repair proteins, as well as scan vertebrate genomes for the presence of novel repair proteins. Combinations of primary sequence polypeptide frequency, secondary structure, and homology information are used as feature information for input to a Support Vector Machine (SVM).</p> <p>Results</p> <p>We identify that SVM techniques are capable of identifying portions of DNA repair protein datasets without admitting false positives; at low levels of false positive tolerance, homology can also identify and classify proteins with good performance. Secondary structure information provides improved performance compared to using primary structure alone. Furthermore, we observe that machine learning methods incorporating homology information perform best when data is filtered by some clustering technique. Analysis by applying these methodologies to the scanning of multiple vertebrate genomes confirms a positive correlation between the size of a genome and the number of DNA repair protein transcripts it is likely to contain, and simultaneously suggests that all organisms have a non-zero minimum number of repair genes. In addition, the scan result clusters several organisms' repair abilities in an evolutionarily consistent fashion. Analysis also identifies several functionally unconfirmed proteins that are highly likely to be involved in the repair process. A new web service, INTREPED, has been made available for the immediate search and annotation of DNA repair proteins in newly sequenced genomes.</p> <p>Conclusion</p> <p>Despite complexity due to a multitude of repair pathways, combinations of sequence, structure, and homology with Support Vector Machines offer good methods in addition to existing homology searches for DNA repair protein identification and functional annotation. Most importantly, this study has uncovered relationships between the size of a genome and a genome's available repair repetoire, and offers a number of new predictions as well as a prediction service, both which reduce the search time and cost for novel repair genes and proteins.</p>
url http://www.biomedcentral.com/1471-2105/10/25
work_keys_str_mv AT akutsutatsuya identificationofnoveldnarepairproteinsviaprimarysequencesecondarystructureandhomology
AT brownjb identificationofnoveldnarepairproteinsviaprimarysequencesecondarystructureandhomology
_version_ 1725091214390198272