Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins

<p>Abstract</p> <p>Background</p> <p>The evolution of the full repertoire of proteins encoded in a given genome is mostly driven by gene duplications, deletions, and sequence modifications of existing proteins. Indirect information about relative rates and other intrins...

Full description

Bibliographic Details
Main Authors: Yan Koon-Kiu, Axelsen Jacob, Maslov Sergei
Format: Article
Language:English
Published: BMC 2007-11-01
Series:Biology Direct
Online Access:http://www.biology-direct.com/content/2/1/32
id doaj-2cbb8a3d6b35459b9ab64577454d17f8
record_format Article
collection DOAJ
language English
format Article
sources DOAJ
author Yan Koon-Kiu
Axelsen Jacob
Maslov Sergei
spellingShingle Yan Koon-Kiu
Axelsen Jacob
Maslov Sergei
Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins
Biology Direct
author_facet Yan Koon-Kiu
Axelsen Jacob
Maslov Sergei
author_sort Yan Koon-Kiu
title Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins
title_short Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins
title_full Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins
title_fullStr Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins
title_full_unstemmed Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins
title_sort parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins
publisher BMC
series Biology Direct
issn 1745-6150
publishDate 2007-11-01
description <p>Abstract</p> <p>Background</p> <p>The evolution of the full repertoire of proteins encoded in a given genome is mostly driven by gene duplications, deletions, and sequence modifications of existing proteins. Indirect information about relative rates and other intrinsic parameters of these three basic processes is contained in the proteome-wide distribution of sequence identities of pairs of paralogous proteins.</p> <p>Results</p> <p>We introduce a simple mathematical framework based on a stochastic birth-and-death model that allows one to extract some of this information and apply it to the set of all pairs of paralogous proteins in <it>H. pylori</it>, <it>E. coli</it>, <it>S. cerevisiae</it>, <it>C. elegans</it>, <it>D. melanogaster</it>, and <it>H. sapiens</it>. It was found that the histogram of sequence identities <it>p </it>generated by an all-to-all alignment of all protein sequences encoded in a genome is well fitted with a power-law form ~ <it>p</it><sup>-<it>γ </it></sup>with the value of the exponent <it>γ </it>around 4 for the majority of organisms used in this study. This implies that the intra-protein variability of substitution rates is best described by the Gamma-distribution with the exponent <it>α </it>≈ 0.33. Different features of the shape of such histograms allow us to quantify the ratio between the genome-wide average deletion/duplication rates and the amino-acid substitution rate.</p> <p>Conclusion</p> <p>We separately measure the short-term ("raw") duplication and deletion rates <inline-formula><m:math name="1745-6150-2-32-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>r</m:mi><m:mrow><m:mtext>dup</m:mtext></m:mrow><m:mo>∗</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF"> MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOCai3aa0baaSqaaiabbsgaKjabbwha1jabbchaWbqaaiabgEHiQaaaaaa@3283@</m:annotation></m:semantics></m:math></inline-formula>, <inline-formula><m:math name="1745-6150-2-32-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>r</m:mi><m:mrow><m:mtext>del</m:mtext></m:mrow><m:mo>∗</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF"> MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOCai3aa0baaSqaaiabbsgaKjabbwgaLjabbYgaSbqaaiabgEHiQaaaaaa@325B@</m:annotation></m:semantics></m:math></inline-formula> which include gene copies that will be removed soon after the duplication event and their dramatically reduced long-term counterparts <it>r</it><sub>dup</sub>, <it>r</it><sub>del</sub>. High deletion rate among recently duplicated proteins is consistent with a scenario in which they didn't have enough time to significantly change their functional roles and thus are to a large degree disposable. Systematic trends of each of the four duplication/deletion rates with the total number of genes in the genome were analyzed. All but the deletion rate of recent duplicates <inline-formula><m:math name="1745-6150-2-32-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>r</m:mi><m:mrow><m:mtext>del</m:mtext></m:mrow><m:mo>∗</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF"> MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOCai3aa0baaSqaaiabbsgaKjabbwgaLjabbYgaSbqaaiabgEHiQaaaaaa@325B@</m:annotation></m:semantics></m:math></inline-formula> were shown to systematically increase with <it>N</it><sub>genes</sub>. Abnormally flat shapes of sequence identity histograms observed for yeast and human are consistent with lineages leading to these organisms undergoing one or more whole-genome duplications. This interpretation is corroborated by our analysis of the genome of <it>Paramecium tetraurelia </it>where the <it>p</it><sup>-4 </sup>profile of the histogram is gradually restored by the successive removal of paralogs generated in its four known whole-genome duplication events.</p>
url http://www.biology-direct.com/content/2/1/32
work_keys_str_mv AT yankoonkiu parametersofproteomeevolutionfromhistogramsofaminoacidsequenceidentitiesofparalogousproteins
AT axelsenjacob parametersofproteomeevolutionfromhistogramsofaminoacidsequenceidentitiesofparalogousproteins
AT maslovsergei parametersofproteomeevolutionfromhistogramsofaminoacidsequenceidentitiesofparalogousproteins
_version_ 1725142753804812288
spelling doaj-2cbb8a3d6b35459b9ab64577454d17f82020-11-25T01:18:24ZengBMCBiology Direct1745-61502007-11-01213210.1186/1745-6150-2-32Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteinsYan Koon-KiuAxelsen JacobMaslov Sergei<p>Abstract</p> <p>Background</p> <p>The evolution of the full repertoire of proteins encoded in a given genome is mostly driven by gene duplications, deletions, and sequence modifications of existing proteins. Indirect information about relative rates and other intrinsic parameters of these three basic processes is contained in the proteome-wide distribution of sequence identities of pairs of paralogous proteins.</p> <p>Results</p> <p>We introduce a simple mathematical framework based on a stochastic birth-and-death model that allows one to extract some of this information and apply it to the set of all pairs of paralogous proteins in <it>H. pylori</it>, <it>E. coli</it>, <it>S. cerevisiae</it>, <it>C. elegans</it>, <it>D. melanogaster</it>, and <it>H. sapiens</it>. It was found that the histogram of sequence identities <it>p </it>generated by an all-to-all alignment of all protein sequences encoded in a genome is well fitted with a power-law form ~ <it>p</it><sup>-<it>γ </it></sup>with the value of the exponent <it>γ </it>around 4 for the majority of organisms used in this study. This implies that the intra-protein variability of substitution rates is best described by the Gamma-distribution with the exponent <it>α </it>≈ 0.33. Different features of the shape of such histograms allow us to quantify the ratio between the genome-wide average deletion/duplication rates and the amino-acid substitution rate.</p> <p>Conclusion</p> <p>We separately measure the short-term ("raw") duplication and deletion rates <inline-formula><m:math name="1745-6150-2-32-i1" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>r</m:mi><m:mrow><m:mtext>dup</m:mtext></m:mrow><m:mo>∗</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF"> MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOCai3aa0baaSqaaiabbsgaKjabbwha1jabbchaWbqaaiabgEHiQaaaaaa@3283@</m:annotation></m:semantics></m:math></inline-formula>, <inline-formula><m:math name="1745-6150-2-32-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>r</m:mi><m:mrow><m:mtext>del</m:mtext></m:mrow><m:mo>∗</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF"> MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOCai3aa0baaSqaaiabbsgaKjabbwgaLjabbYgaSbqaaiabgEHiQaaaaaa@325B@</m:annotation></m:semantics></m:math></inline-formula> which include gene copies that will be removed soon after the duplication event and their dramatically reduced long-term counterparts <it>r</it><sub>dup</sub>, <it>r</it><sub>del</sub>. High deletion rate among recently duplicated proteins is consistent with a scenario in which they didn't have enough time to significantly change their functional roles and thus are to a large degree disposable. Systematic trends of each of the four duplication/deletion rates with the total number of genes in the genome were analyzed. All but the deletion rate of recent duplicates <inline-formula><m:math name="1745-6150-2-32-i2" xmlns:m="http://www.w3.org/1998/Math/MathML"><m:semantics><m:mrow><m:msubsup><m:mi>r</m:mi><m:mrow><m:mtext>del</m:mtext></m:mrow><m:mo>∗</m:mo></m:msubsup></m:mrow><m:annotation encoding="MathType-MTEF"> MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGacaGaaiaabeqaaeqabiWaaaGcbaGaemOCai3aa0baaSqaaiabbsgaKjabbwgaLjabbYgaSbqaaiabgEHiQaaaaaa@325B@</m:annotation></m:semantics></m:math></inline-formula> were shown to systematically increase with <it>N</it><sub>genes</sub>. Abnormally flat shapes of sequence identity histograms observed for yeast and human are consistent with lineages leading to these organisms undergoing one or more whole-genome duplications. This interpretation is corroborated by our analysis of the genome of <it>Paramecium tetraurelia </it>where the <it>p</it><sup>-4 </sup>profile of the histogram is gradually restored by the successive removal of paralogs generated in its four known whole-genome duplication events.</p> http://www.biology-direct.com/content/2/1/32