Core column prediction for protein multiple sequence alignments

Background: In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the refer...

Full description

Bibliographic Details
Main Authors: DeBlasio, Dan, Kececioglu, John
Other Authors: Univ Arizona, Dept Comp Sci
Language:en
Published: BIOMED CENTRAL LTD 2017
Subjects:
Online Access:http://hdl.handle.net/10150/623957
http://arizona.openrepository.com/arizona/handle/10150/623957
id ndltd-arizona.edu-oai-arizona.openrepository.com-10150-623957
record_format oai_dc
spelling ndltd-arizona.edu-oai-arizona.openrepository.com-10150-6239572017-06-07T03:00:35Z Core column prediction for protein multiple sequence alignments DeBlasio, Dan Kececioglu, John Univ Arizona, Dept Comp Sci Multiple sequence alignment Core blocks Alignment accuracy Accuracy estimation Parameter advising Machine learning Regression Background: In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known three-dimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted. Results: We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment's accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner's scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy. 2017-04-19 Article Core column prediction for protein multiple sequence alignments 2017, 12 (1) Algorithms for Molecular Biology 1748-7188 28435440 10.1186/s13015-017-0102-3 http://hdl.handle.net/10150/623957 http://arizona.openrepository.com/arizona/handle/10150/623957 Algorithms for Molecular Biology en http://almob.biomedcentral.com/articles/10.1186/s13015-017-0102-3 © The Author(s) 2017. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License. BIOMED CENTRAL LTD
collection NDLTD
language en
sources NDLTD
topic Multiple sequence alignment
Core blocks
Alignment accuracy
Accuracy estimation
Parameter advising
Machine learning
Regression
spellingShingle Multiple sequence alignment
Core blocks
Alignment accuracy
Accuracy estimation
Parameter advising
Machine learning
Regression
DeBlasio, Dan
Kececioglu, John
Core column prediction for protein multiple sequence alignments
description Background: In a computed protein multiple sequence alignment, the coreness of a column is the fraction of its substitutions that are in so-called core columns of the gold-standard reference alignment of its proteins. In benchmark suites of protein reference alignments, the core columns of the reference alignment are those that can be confidently labeled as correct, usually due to all residues in the column being sufficiently close in the spatial superposition of the known three-dimensional structures of the proteins. Typically the accuracy of a protein multiple sequence alignment that has been computed for a benchmark is only measured with respect to the core columns of the reference alignment. When computing an alignment in practice, however, a reference alignment is not known, so the coreness of its columns can only be predicted. Results: We develop for the first time a predictor of column coreness for protein multiple sequence alignments. This allows us to predict which columns of a computed alignment are core, and hence better estimate the alignment's accuracy. Our approach to predicting coreness is similar to nearest-neighbor classification from machine learning, except we transform nearest-neighbor distances into a coreness prediction via a regression function, and we learn an appropriate distance function through a new optimization formulation that solves a large-scale linear programming problem. We apply our coreness predictor to parameter advising, the task of choosing parameter values for an aligner's scoring function to obtain a more accurate alignment of a specific set of sequences. We show that for this task, our predictor strongly outperforms other column-confidence estimators from the literature, and affords a substantial boost in alignment accuracy.
author2 Univ Arizona, Dept Comp Sci
author_facet Univ Arizona, Dept Comp Sci
DeBlasio, Dan
Kececioglu, John
author DeBlasio, Dan
Kececioglu, John
author_sort DeBlasio, Dan
title Core column prediction for protein multiple sequence alignments
title_short Core column prediction for protein multiple sequence alignments
title_full Core column prediction for protein multiple sequence alignments
title_fullStr Core column prediction for protein multiple sequence alignments
title_full_unstemmed Core column prediction for protein multiple sequence alignments
title_sort core column prediction for protein multiple sequence alignments
publisher BIOMED CENTRAL LTD
publishDate 2017
url http://hdl.handle.net/10150/623957
http://arizona.openrepository.com/arizona/handle/10150/623957
work_keys_str_mv AT deblasiodan corecolumnpredictionforproteinmultiplesequencealignments
AT kececioglujohn corecolumnpredictionforproteinmultiplesequencealignments
_version_ 1718455936088014848