Size distribution of function-based human gene sets and the split–merge model

The sizes of paralogues—gene families produced by ancestral duplication—are known to follow a power-law distribution. We examine the size distribution of gene sets or gene families where genes are grouped by a similar function or share a common property. The size distribution of Human Gene Nomenclat...

Full description

Bibliographic Details
Main Authors: Wentian Li, Oscar Fontanelli, Pedro Miramontes
Format: Article
Language:English
Published: The Royal Society 2016-01-01
Series:Royal Society Open Science
Subjects:
Online Access:https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.160275
id doaj-bfe3e9569cfa442d93692ee857c8ca08
record_format Article
spelling doaj-bfe3e9569cfa442d93692ee857c8ca082020-11-25T03:09:37ZengThe Royal SocietyRoyal Society Open Science2054-57032016-01-013810.1098/rsos.160275160275Size distribution of function-based human gene sets and the split–merge modelWentian LiOscar FontanelliPedro MiramontesThe sizes of paralogues—gene families produced by ancestral duplication—are known to follow a power-law distribution. We examine the size distribution of gene sets or gene families where genes are grouped by a similar function or share a common property. The size distribution of Human Gene Nomenclature Committee (HGNC) gene sets deviate from the power-law, and can be fitted much better by a beta rank function. We propose a simple mechanism to break a power-law size distribution by a combination of splitting and merging operations. The largest gene sets are split into two to account for the subfunctional categories, and a small proportion of other gene sets are merged into larger sets as new common themes might be realized. These operations are not uncommon for a curator of gene sets. A simulation shows that iteration of these operations changes the size distribution of Ensembl paralogues and could lead to a distribution fitted by a rank beta function. We further illustrate application of beta rank function by the example of distribution of transcription factors and drug target genes among HGNC gene families.https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.160275gene family sizesgene set sizespower-lawbeta rank function
collection DOAJ
language English
format Article
sources DOAJ
author Wentian Li
Oscar Fontanelli
Pedro Miramontes
spellingShingle Wentian Li
Oscar Fontanelli
Pedro Miramontes
Size distribution of function-based human gene sets and the split–merge model
Royal Society Open Science
gene family sizes
gene set sizes
power-law
beta rank function
author_facet Wentian Li
Oscar Fontanelli
Pedro Miramontes
author_sort Wentian Li
title Size distribution of function-based human gene sets and the split–merge model
title_short Size distribution of function-based human gene sets and the split–merge model
title_full Size distribution of function-based human gene sets and the split–merge model
title_fullStr Size distribution of function-based human gene sets and the split–merge model
title_full_unstemmed Size distribution of function-based human gene sets and the split–merge model
title_sort size distribution of function-based human gene sets and the split–merge model
publisher The Royal Society
series Royal Society Open Science
issn 2054-5703
publishDate 2016-01-01
description The sizes of paralogues—gene families produced by ancestral duplication—are known to follow a power-law distribution. We examine the size distribution of gene sets or gene families where genes are grouped by a similar function or share a common property. The size distribution of Human Gene Nomenclature Committee (HGNC) gene sets deviate from the power-law, and can be fitted much better by a beta rank function. We propose a simple mechanism to break a power-law size distribution by a combination of splitting and merging operations. The largest gene sets are split into two to account for the subfunctional categories, and a small proportion of other gene sets are merged into larger sets as new common themes might be realized. These operations are not uncommon for a curator of gene sets. A simulation shows that iteration of these operations changes the size distribution of Ensembl paralogues and could lead to a distribution fitted by a rank beta function. We further illustrate application of beta rank function by the example of distribution of transcription factors and drug target genes among HGNC gene families.
topic gene family sizes
gene set sizes
power-law
beta rank function
url https://royalsocietypublishing.org/doi/pdf/10.1098/rsos.160275
work_keys_str_mv AT wentianli sizedistributionoffunctionbasedhumangenesetsandthesplitmergemodel
AT oscarfontanelli sizedistributionoffunctionbasedhumangenesetsandthesplitmergemodel
AT pedromiramontes sizedistributionoffunctionbasedhumangenesetsandthesplitmergemodel
_version_ 1724661471706611712