Latent Semantic Indexing of PubMed abstracts for identification of transcription factor candidates from microarray derived gene sets

<p>Abstract</p> <p>Background</p> <p>Identification of transcription factors (TFs) responsible for modulation of differentially expressed genes is a key step in deducing gene regulatory pathways. Most current methods identify TFs by searching for presence of DNA binding...

Full description

Bibliographic Details
Main Authors: Phan Vinhthuy, Heinrich Kevin, Roy Sujoy, Berry Michael W, Homayouni Ramin
Format: Article
Language:English
Published: BMC 2011-10-01
Series:BMC Bioinformatics
id doaj-a8c63c4be8ef41f78bd3aff9b5a4913b
record_format Article
spelling doaj-a8c63c4be8ef41f78bd3aff9b5a4913b2020-11-25T00:37:58ZengBMCBMC Bioinformatics1471-21052011-10-0112Suppl 10S1910.1186/1471-2105-12-S10-S19Latent Semantic Indexing of PubMed abstracts for identification of transcription factor candidates from microarray derived gene setsPhan VinhthuyHeinrich KevinRoy SujoyBerry Michael WHomayouni Ramin<p>Abstract</p> <p>Background</p> <p>Identification of transcription factors (TFs) responsible for modulation of differentially expressed genes is a key step in deducing gene regulatory pathways. Most current methods identify TFs by searching for presence of DNA binding motifs in the promoter regions of co-regulated genes. However, this strategy may not always be useful as presence of a motif does not necessarily imply a regulatory role. Conversely, motif presence may not be required for a TF to regulate a set of genes. Therefore, it is imperative to include functional (biochemical and molecular) associations, such as those found in the biomedical literature, into algorithms for identification of putative regulatory TFs that might be explicitly or implicitly linked to the genes under investigation.</p> <p>Results</p> <p>In this study, we present a Latent Semantic Indexing (LSI) based text mining approach for identification and ranking of putative regulatory TFs from microarray derived differentially expressed genes (DEGs). Two LSI models were built using different term weighting schemes to devise pair-wise similarities between 21,027 mouse genes annotated in the Entrez Gene repository. Amongst these genes, 433 were designated TFs in the TRANSFAC database. The LSI derived TF-to-gene similarities were used to calculate TF literature enrichment p-values and rank the TFs for a given set of genes. We evaluated our approach using five different publicly available microarray datasets focusing on TFs <it>Rel</it>, <it>Stat6</it>, <it>Ddit3</it>, <it>Stat5</it> and <it>Nfic</it>. In addition, for each of the datasets, we constructed gold standard TFs known to be functionally relevant to the study in question. Receiver Operating Characteristics (ROC) curves showed that the log-entropy LSI model outperformed the <it>tf</it>-normal LSI model and a benchmark co-occurrence based method for four out of five datasets, as well as motif searching approaches, in identifying putative TFs.</p> <p>Conclusions</p> <p>Our results suggest that our LSI based text mining approach can complement existing approaches used in systems biology research to decipher gene regulatory networks by providing putative lists of ranked TFs that might be explicitly or implicitly associated with sets of DEGs derived from microarray experiments. In addition, unlike motif searching approaches, LSI based approaches can reveal TFs that may indirectly regulate genes.</p>
collection DOAJ
language English
format Article
sources DOAJ
author Phan Vinhthuy
Heinrich Kevin
Roy Sujoy
Berry Michael W
Homayouni Ramin
spellingShingle Phan Vinhthuy
Heinrich Kevin
Roy Sujoy
Berry Michael W
Homayouni Ramin
Latent Semantic Indexing of PubMed abstracts for identification of transcription factor candidates from microarray derived gene sets
BMC Bioinformatics
author_facet Phan Vinhthuy
Heinrich Kevin
Roy Sujoy
Berry Michael W
Homayouni Ramin
author_sort Phan Vinhthuy
title Latent Semantic Indexing of PubMed abstracts for identification of transcription factor candidates from microarray derived gene sets
title_short Latent Semantic Indexing of PubMed abstracts for identification of transcription factor candidates from microarray derived gene sets
title_full Latent Semantic Indexing of PubMed abstracts for identification of transcription factor candidates from microarray derived gene sets
title_fullStr Latent Semantic Indexing of PubMed abstracts for identification of transcription factor candidates from microarray derived gene sets
title_full_unstemmed Latent Semantic Indexing of PubMed abstracts for identification of transcription factor candidates from microarray derived gene sets
title_sort latent semantic indexing of pubmed abstracts for identification of transcription factor candidates from microarray derived gene sets
publisher BMC
series BMC Bioinformatics
issn 1471-2105
publishDate 2011-10-01
description <p>Abstract</p> <p>Background</p> <p>Identification of transcription factors (TFs) responsible for modulation of differentially expressed genes is a key step in deducing gene regulatory pathways. Most current methods identify TFs by searching for presence of DNA binding motifs in the promoter regions of co-regulated genes. However, this strategy may not always be useful as presence of a motif does not necessarily imply a regulatory role. Conversely, motif presence may not be required for a TF to regulate a set of genes. Therefore, it is imperative to include functional (biochemical and molecular) associations, such as those found in the biomedical literature, into algorithms for identification of putative regulatory TFs that might be explicitly or implicitly linked to the genes under investigation.</p> <p>Results</p> <p>In this study, we present a Latent Semantic Indexing (LSI) based text mining approach for identification and ranking of putative regulatory TFs from microarray derived differentially expressed genes (DEGs). Two LSI models were built using different term weighting schemes to devise pair-wise similarities between 21,027 mouse genes annotated in the Entrez Gene repository. Amongst these genes, 433 were designated TFs in the TRANSFAC database. The LSI derived TF-to-gene similarities were used to calculate TF literature enrichment p-values and rank the TFs for a given set of genes. We evaluated our approach using five different publicly available microarray datasets focusing on TFs <it>Rel</it>, <it>Stat6</it>, <it>Ddit3</it>, <it>Stat5</it> and <it>Nfic</it>. In addition, for each of the datasets, we constructed gold standard TFs known to be functionally relevant to the study in question. Receiver Operating Characteristics (ROC) curves showed that the log-entropy LSI model outperformed the <it>tf</it>-normal LSI model and a benchmark co-occurrence based method for four out of five datasets, as well as motif searching approaches, in identifying putative TFs.</p> <p>Conclusions</p> <p>Our results suggest that our LSI based text mining approach can complement existing approaches used in systems biology research to decipher gene regulatory networks by providing putative lists of ranked TFs that might be explicitly or implicitly associated with sets of DEGs derived from microarray experiments. In addition, unlike motif searching approaches, LSI based approaches can reveal TFs that may indirectly regulate genes.</p>
work_keys_str_mv AT phanvinhthuy latentsemanticindexingofpubmedabstractsforidentificationoftranscriptionfactorcandidatesfrommicroarrayderivedgenesets
AT heinrichkevin latentsemanticindexingofpubmedabstractsforidentificationoftranscriptionfactorcandidatesfrommicroarrayderivedgenesets
AT roysujoy latentsemanticindexingofpubmedabstractsforidentificationoftranscriptionfactorcandidatesfrommicroarrayderivedgenesets
AT berrymichaelw latentsemanticindexingofpubmedabstractsforidentificationoftranscriptionfactorcandidatesfrommicroarrayderivedgenesets
AT homayouniramin latentsemanticindexingofpubmedabstractsforidentificationoftranscriptionfactorcandidatesfrommicroarrayderivedgenesets
_version_ 1725298719175213056