Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology

Abstract Background DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. DNA repair has mainly been studied in experimental and clinical situations, and...

Full description

Bibliographic Details
Main Authors:	Akutsu Tatsuya, Brown JB
Format:	Article
Language:	English
Published:	BMC 2009-01-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/10/25

id	doaj-c9e1c9ce336d424293edcba67e497e4e
record_format	Article
spelling	doaj-c9e1c9ce336d424293edcba67e497e4e2020-11-25T01:30:36ZengBMCBMC Bioinformatics1471-21052009-01-011012510.1186/1471-2105-10-25Identification of novel DNA repair proteins via primary sequence, secondary structure, and homologyAkutsu TatsuyaBrown JB<p>Abstract</p> <p>Background</p> <p>DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. DNA repair has mainly been studied in experimental and clinical situations, and relatively few information-based approaches to new extracting DNA repair knowledge exist. As a first step, automatic detection of DNA repair proteins in genomes via informatics techniques is desirable; however, there are many forms of DNA repair and it is not a straightforward process to identify and classify repair proteins with a single optimal method. We perform a study of the ability of homology and machine learning-based methods to identify and classify DNA repair proteins, as well as scan vertebrate genomes for the presence of novel repair proteins. Combinations of primary sequence polypeptide frequency, secondary structure, and homology information are used as feature information for input to a Support Vector Machine (SVM).</p> <p>Results</p> <p>We identify that SVM techniques are capable of identifying portions of DNA repair protein datasets without admitting false positives; at low levels of false positive tolerance, homology can also identify and classify proteins with good performance. Secondary structure information provides improved performance compared to using primary structure alone. Furthermore, we observe that machine learning methods incorporating homology information perform best when data is filtered by some clustering technique. Analysis by applying these methodologies to the scanning of multiple vertebrate genomes confirms a positive correlation between the size of a genome and the number of DNA repair protein transcripts it is likely to contain, and simultaneously suggests that all organisms have a non-zero minimum number of repair genes. In addition, the scan result clusters several organisms' repair abilities in an evolutionarily consistent fashion. Analysis also identifies several functionally unconfirmed proteins that are highly likely to be involved in the repair process. A new web service, INTREPED, has been made available for the immediate search and annotation of DNA repair proteins in newly sequenced genomes.</p> <p>Conclusion</p> <p>Despite complexity due to a multitude of repair pathways, combinations of sequence, structure, and homology with Support Vector Machines offer good methods in addition to existing homology searches for DNA repair protein identification and functional annotation. Most importantly, this study has uncovered relationships between the size of a genome and a genome's available repair repetoire, and offers a number of new predictions as well as a prediction service, both which reduce the search time and cost for novel repair genes and proteins.</p> http://www.biomedcentral.com/1471-2105/10/25
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Akutsu Tatsuya Brown JB
spellingShingle	Akutsu Tatsuya Brown JB Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology BMC Bioinformatics
author_facet	Akutsu Tatsuya Brown JB
author_sort	Akutsu Tatsuya
title	Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology
title_short	Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology
title_full	Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology
title_fullStr	Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology
title_full_unstemmed	Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology
title_sort	identification of novel dna repair proteins via primary sequence, secondary structure, and homology
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2009-01-01
description	<p>Abstract</p> <p>Background</p> <p>DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. DNA repair has mainly been studied in experimental and clinical situations, and relatively few information-based approaches to new extracting DNA repair knowledge exist. As a first step, automatic detection of DNA repair proteins in genomes via informatics techniques is desirable; however, there are many forms of DNA repair and it is not a straightforward process to identify and classify repair proteins with a single optimal method. We perform a study of the ability of homology and machine learning-based methods to identify and classify DNA repair proteins, as well as scan vertebrate genomes for the presence of novel repair proteins. Combinations of primary sequence polypeptide frequency, secondary structure, and homology information are used as feature information for input to a Support Vector Machine (SVM).</p> <p>Results</p> <p>We identify that SVM techniques are capable of identifying portions of DNA repair protein datasets without admitting false positives; at low levels of false positive tolerance, homology can also identify and classify proteins with good performance. Secondary structure information provides improved performance compared to using primary structure alone. Furthermore, we observe that machine learning methods incorporating homology information perform best when data is filtered by some clustering technique. Analysis by applying these methodologies to the scanning of multiple vertebrate genomes confirms a positive correlation between the size of a genome and the number of DNA repair protein transcripts it is likely to contain, and simultaneously suggests that all organisms have a non-zero minimum number of repair genes. In addition, the scan result clusters several organisms' repair abilities in an evolutionarily consistent fashion. Analysis also identifies several functionally unconfirmed proteins that are highly likely to be involved in the repair process. A new web service, INTREPED, has been made available for the immediate search and annotation of DNA repair proteins in newly sequenced genomes.</p> <p>Conclusion</p> <p>Despite complexity due to a multitude of repair pathways, combinations of sequence, structure, and homology with Support Vector Machines offer good methods in addition to existing homology searches for DNA repair protein identification and functional annotation. Most importantly, this study has uncovered relationships between the size of a genome and a genome's available repair repetoire, and offers a number of new predictions as well as a prediction service, both which reduce the search time and cost for novel repair genes and proteins.</p>
url	http://www.biomedcentral.com/1471-2105/10/25
work_keys_str_mv	AT akutsutatsuya identificationofnoveldnarepairproteinsviaprimarysequencesecondarystructureandhomology AT brownjb identificationofnoveldnarepairproteinsviaprimarysequencesecondarystructureandhomology
_version_	1725091214390198272

Identification of novel DNA repair proteins via primary sequence, secondary structure, and homology

Similar Items