A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations

Abstract Background Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins bec...

Full description

Bibliographic Details
Main Authors: John Anders, Hannes Petruschke, Nico Jehmlich, Sven-Bastiaan Haange, Martin von Bergen, Peter F Stadler
Format: Article
Language:English
Published: BMC 2021-05-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-021-04159-8
id doaj-ef4f88052a0946e2b9f80c64ec97b611
record_format Article
spelling doaj-ef4f88052a0946e2b9f80c64ec97b6112021-05-30T11:52:53ZengBMCBMC Bioinformatics1471-21052021-05-0122112010.1186/s12859-021-04159-8A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locationsJohn Anders0Hannes Petruschke1Nico Jehmlich2Sven-Bastiaan Haange3Martin von Bergen4Peter F Stadler5Bioinformatics Group, Department of Computer Science, and Interdisciplinary Center for Bioinformatics, Universität LeipzigDepartment of Molecular Systems Biology, Helmholtz Centre for Environmental Research - UFZDepartment of Molecular Systems Biology, Helmholtz Centre for Environmental Research - UFZDepartment of Molecular Systems Biology, Helmholtz Centre for Environmental Research - UFZInstitute of Biochemistry, Faculty of Life Sciences, University of LeipzigGerman Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig and Competence Center for Scalable Data Services and Solutions Dresden-Leipzig and Leipzig Research Center for Civilization Diseases, University LeipzigAbstract Background Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins because genome annotation pipelines often exclude short open reading frames or over-predict hypothetical proteins based on simple models. The validation of novel proteins, and in particular of small proteins (sProteins), therefore requires additional evidence. Proteogenomics is considered the gold standard for this purpose. It extends beyond established annotations and includes all possible open reading frames (ORFs) as potential sources of peptides, thus allowing the discovery of novel, unannotated proteins. Typically this results in large numbers of putative novel small proteins fraught with large fractions of false-positive predictions. Results We observe that number and quality of the peptide-spectrum matches (PSMs) that map to a candidate ORF can be highly informative for the purpose of distinguishing proteins from spurious ORF annotations. We report here on a workflow that aggregates PSM quality information and local context into simple descriptors and reliably separates likely proteins from the large pool of false-positive, i.e., most likely untranslated ORFs. We investigated the artificial gut microbiome model SIHUMIx, comprising eight different species, for which we validate 5114 proteins that have previously been annotated only as hypothetical ORFs. In addition, we identified 37 non-annotated protein candidates for which we found evidence at the proteomic and transcriptomic level. Half (19) of these candidates have close functional homologs in other species. Another 12 candidates have homologs designated as hypothetical proteins in other species. The remaining six candidates are short (< 100 AA) and are most likely bona fide novel proteins. Conclusions The aggregation of PSM quality information for predicted ORFs provides a robust and efficient method to identify novel proteins in proteomics data. The workflow is in particular capable of identifying small proteins and frameshift variants. Since PSMs are explicitly mapped to genomic locations, it furthermore facilitates the integration of transcriptomics data and other sources of genome-level information.https://doi.org/10.1186/s12859-021-04159-8Small proteinsMetaproteogenomicsPeptide-spectrum matchesMicrobial communitities
collection DOAJ
language English
format Article
sources DOAJ
author John Anders
Hannes Petruschke
Nico Jehmlich
Sven-Bastiaan Haange
Martin von Bergen
Peter F Stadler
spellingShingle John Anders
Hannes Petruschke
Nico Jehmlich
Sven-Bastiaan Haange
Martin von Bergen
Peter F Stadler
A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
BMC Bioinformatics
Small proteins
Metaproteogenomics
Peptide-spectrum matches
Microbial communitities
author_facet John Anders
Hannes Petruschke
Nico Jehmlich
Sven-Bastiaan Haange
Martin von Bergen
Peter F Stadler
author_sort John Anders
title A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
title_short A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
title_full A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
title_fullStr A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
title_full_unstemmed A workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
title_sort workflow to identify novel proteins based on the direct mapping of peptide-spectrum-matches to genomic locations
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2021-05-01
description Abstract Background Small Proteins have received increasing attention in recent years. They have in particular been implicated as signals contributing to the coordination of bacterial communities. In genome annotations they are often missing or hidden among large numbers of hypothetical proteins because genome annotation pipelines often exclude short open reading frames or over-predict hypothetical proteins based on simple models. The validation of novel proteins, and in particular of small proteins (sProteins), therefore requires additional evidence. Proteogenomics is considered the gold standard for this purpose. It extends beyond established annotations and includes all possible open reading frames (ORFs) as potential sources of peptides, thus allowing the discovery of novel, unannotated proteins. Typically this results in large numbers of putative novel small proteins fraught with large fractions of false-positive predictions. Results We observe that number and quality of the peptide-spectrum matches (PSMs) that map to a candidate ORF can be highly informative for the purpose of distinguishing proteins from spurious ORF annotations. We report here on a workflow that aggregates PSM quality information and local context into simple descriptors and reliably separates likely proteins from the large pool of false-positive, i.e., most likely untranslated ORFs. We investigated the artificial gut microbiome model SIHUMIx, comprising eight different species, for which we validate 5114 proteins that have previously been annotated only as hypothetical ORFs. In addition, we identified 37 non-annotated protein candidates for which we found evidence at the proteomic and transcriptomic level. Half (19) of these candidates have close functional homologs in other species. Another 12 candidates have homologs designated as hypothetical proteins in other species. The remaining six candidates are short (< 100 AA) and are most likely bona fide novel proteins. Conclusions The aggregation of PSM quality information for predicted ORFs provides a robust and efficient method to identify novel proteins in proteomics data. The workflow is in particular capable of identifying small proteins and frameshift variants. Since PSMs are explicitly mapped to genomic locations, it furthermore facilitates the integration of transcriptomics data and other sources of genome-level information.
topic Small proteins
Metaproteogenomics
Peptide-spectrum matches
Microbial communitities
url https://doi.org/10.1186/s12859-021-04159-8
work_keys_str_mv AT johnanders aworkflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations
AT hannespetruschke aworkflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations
AT nicojehmlich aworkflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations
AT svenbastiaanhaange aworkflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations
AT martinvonbergen aworkflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations
AT peterfstadler aworkflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations
AT johnanders workflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations
AT hannespetruschke workflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations
AT nicojehmlich workflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations
AT svenbastiaanhaange workflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations
AT martinvonbergen workflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations
AT peterfstadler workflowtoidentifynovelproteinsbasedonthedirectmappingofpeptidespectrummatchestogenomiclocations
_version_ 1721419988259045376