A LDA-based approach to promoting ranking diversity for genomics information retrieval

Abstract Background In the biomedical domain, there are immense data and tremendous increase of genomics and biomedical relevant publications. The wealth of information has led to an increasing amount of interest in and need for applying information ret...

Full description

Bibliographic Details
Main Authors:	Chen Yan, Yin Xiaoshi, Li Zhoujun, Hu Xiaohua, Huang Jimmy
Format:	Article
Language:	English
Published:	BMC 2012-06-01
Series:	BMC Genomics

id	doaj-7db4af8d73d6421cb8ae9b1808717bde
record_format	Article
spelling	doaj-7db4af8d73d6421cb8ae9b1808717bde2020-11-25T01:59:01ZengBMCBMC Genomics1471-21642012-06-0113Suppl 3S210.1186/1471-2164-13-S3-S2A LDA-based approach to promoting ranking diversity for genomics information retrievalChen YanYin XiaoshiLi ZhoujunHu XiaohuaHuang Jimmy<p>Abstract</p> <p>Background</p> <p>In the biomedical domain, there are immense data and tremendous increase of genomics and biomedical relevant publications. The wealth of information has led to an increasing amount of interest in and need for applying information retrieval techniques to access the scientific literature in genomics and related biomedical disciplines. In many cases, the desired information of a query asked by biologists is a list of a certain type of entities covering different aspects that are related to the question, such as cells, genes, diseases, proteins, mutations, etc. Hence, it is important of a biomedical IR system to be able to provide relevant and diverse answers to fulfill biologists' information needs. However traditional IR model only concerns with the relevance between retrieved documents and user query, but does not take redundancy between retrieved documents into account. This will lead to high redundancy and low diversity in the retrieval ranked lists.</p> <p>Results</p> <p>In this paper, we propose an approach which employs a topic generative model called Latent Dirichlet Allocation (LDA) to promoting ranking diversity for biomedical information retrieval. Different from other approaches or models which consider aspects on word level, our approach assumes that aspects should be identified by the topics of retrieved documents. We present LDA model to discover topic distribution of retrieval passages and word distribution of each topic dimension, and then re-rank retrieval results with topic distribution similarity between passages based on <it>N</it>-size slide window. We perform our approach on TREC 2007 Genomics collection and two distinctive IR baseline runs, which can achieve 8% improvement over the highest Aspect MAP reported in TREC 2007 Genomics track.</p> <p>Conclusions</p> <p>The proposed method is the first study of adopting topic model to genomics information retrieval, and demonstrates its effectiveness in promoting ranking diversity as well as in improving relevance of ranked lists of genomics search. Moreover, we proposes a distance measure to quantify how much a passage can increase topical diversity by considering both topical importance and topical coefficient by LDA, and the distance measure is a modified Euclidean distance.</p>
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Chen Yan Yin Xiaoshi Li Zhoujun Hu Xiaohua Huang Jimmy
spellingShingle	Chen Yan Yin Xiaoshi Li Zhoujun Hu Xiaohua Huang Jimmy A LDA-based approach to promoting ranking diversity for genomics information retrieval BMC Genomics
author_facet	Chen Yan Yin Xiaoshi Li Zhoujun Hu Xiaohua Huang Jimmy
author_sort	Chen Yan
title	A LDA-based approach to promoting ranking diversity for genomics information retrieval
title_short	A LDA-based approach to promoting ranking diversity for genomics information retrieval
title_full	A LDA-based approach to promoting ranking diversity for genomics information retrieval
title_fullStr	A LDA-based approach to promoting ranking diversity for genomics information retrieval
title_full_unstemmed	A LDA-based approach to promoting ranking diversity for genomics information retrieval
title_sort	lda-based approach to promoting ranking diversity for genomics information retrieval
publisher	BMC
series	BMC Genomics
issn	1471-2164
publishDate	2012-06-01
description	<p>Abstract</p> <p>Background</p> <p>In the biomedical domain, there are immense data and tremendous increase of genomics and biomedical relevant publications. The wealth of information has led to an increasing amount of interest in and need for applying information retrieval techniques to access the scientific literature in genomics and related biomedical disciplines. In many cases, the desired information of a query asked by biologists is a list of a certain type of entities covering different aspects that are related to the question, such as cells, genes, diseases, proteins, mutations, etc. Hence, it is important of a biomedical IR system to be able to provide relevant and diverse answers to fulfill biologists' information needs. However traditional IR model only concerns with the relevance between retrieved documents and user query, but does not take redundancy between retrieved documents into account. This will lead to high redundancy and low diversity in the retrieval ranked lists.</p> <p>Results</p> <p>In this paper, we propose an approach which employs a topic generative model called Latent Dirichlet Allocation (LDA) to promoting ranking diversity for biomedical information retrieval. Different from other approaches or models which consider aspects on word level, our approach assumes that aspects should be identified by the topics of retrieved documents. We present LDA model to discover topic distribution of retrieval passages and word distribution of each topic dimension, and then re-rank retrieval results with topic distribution similarity between passages based on <it>N</it>-size slide window. We perform our approach on TREC 2007 Genomics collection and two distinctive IR baseline runs, which can achieve 8% improvement over the highest Aspect MAP reported in TREC 2007 Genomics track.</p> <p>Conclusions</p> <p>The proposed method is the first study of adopting topic model to genomics information retrieval, and demonstrates its effectiveness in promoting ranking diversity as well as in improving relevance of ranked lists of genomics search. Moreover, we proposes a distance measure to quantify how much a passage can increase topical diversity by considering both topical importance and topical coefficient by LDA, and the distance measure is a modified Euclidean distance.</p>
work_keys_str_mv	AT chenyan aldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval AT yinxiaoshi aldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval AT lizhoujun aldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval AT huxiaohua aldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval AT huangjimmy aldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval AT chenyan ldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval AT yinxiaoshi ldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval AT lizhoujun ldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval AT huxiaohua ldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval AT huangjimmy ldabasedapproachtopromotingrankingdiversityforgenomicsinformationretrieval
_version_	1724966421303132160

A LDA-based approach to promoting ranking diversity for genomics information retrieval

Similar Items