Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model

Abstract Background Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene...

Full description

Bibliographic Details
Main Authors:	Zhai Chengxiang, Chee Brant, Ling Xu, Sarma Moushumi, He Xin, Schatz Bruce
Format:	Article
Language:	English
Published:	BMC 2010-05-01
Series:	BMC Bioinformatics
Online Access:	http://www.biomedcentral.com/1471-2105/11/272

id	doaj-5b810ff5be99433b95a103425a3ddf08
record_format	Article
spelling	doaj-5b810ff5be99433b95a103425a3ddf082020-11-24T20:54:15ZengBMCBMC Bioinformatics1471-21052010-05-0111127210.1186/1471-2105-11-272Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture modelZhai ChengxiangChee BrantLing XuSarma MoushumiHe XinSchatz Bruce<p>Abstract</p> <p>Background</p> <p>Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered.</p> <p>Results</p> <p>We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results.</p> <p>Conclusions</p> <p>We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: <url>http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp</url></p> http://www.biomedcentral.com/1471-2105/11/272
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Zhai Chengxiang Chee Brant Ling Xu Sarma Moushumi He Xin Schatz Bruce
spellingShingle	Zhai Chengxiang Chee Brant Ling Xu Sarma Moushumi He Xin Schatz Bruce Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model BMC Bioinformatics
author_facet	Zhai Chengxiang Chee Brant Ling Xu Sarma Moushumi He Xin Schatz Bruce
author_sort	Zhai Chengxiang
title	Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model
title_short	Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model
title_full	Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model
title_fullStr	Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model
title_full_unstemmed	Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model
title_sort	identifying overrepresented concepts in gene lists from literature: a statistical approach based on poisson mixture model
publisher	BMC
series	BMC Bioinformatics
issn	1471-2105
publishDate	2010-05-01
description	<p>Abstract</p> <p>Background</p> <p>Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered.</p> <p>Results</p> <p>We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results.</p> <p>Conclusions</p> <p>We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: <url>http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp</url></p>
url	http://www.biomedcentral.com/1471-2105/11/272
work_keys_str_mv	AT zhaichengxiang identifyingoverrepresentedconceptsingenelistsfromliteratureastatisticalapproachbasedonpoissonmixturemodel AT cheebrant identifyingoverrepresentedconceptsingenelistsfromliteratureastatisticalapproachbasedonpoissonmixturemodel AT lingxu identifyingoverrepresentedconceptsingenelistsfromliteratureastatisticalapproachbasedonpoissonmixturemodel AT sarmamoushumi identifyingoverrepresentedconceptsingenelistsfromliteratureastatisticalapproachbasedonpoissonmixturemodel AT hexin identifyingoverrepresentedconceptsingenelistsfromliteratureastatisticalapproachbasedonpoissonmixturemodel AT schatzbruce identifyingoverrepresentedconceptsingenelistsfromliteratureastatisticalapproachbasedonpoissonmixturemodel
_version_	1716795147953373184

Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model

Similar Items