An index-based algorithm for fast on-line query processing of latent semantic analysis.

Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line qu...

Full description

Bibliographic Details
Main Authors:	Mingxi Zhang, Pohan Li, Wei Wang
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2017-01-01
Series:	PLoS ONE
Online Access:	http://europepmc.org/articles/PMC5433746?pdf=render

id	doaj-acaf3c46c8564c54b6904c88e4a233aa
record_format	Article
spelling	doaj-acaf3c46c8564c54b6904c88e4a233aa2020-11-24T21:35:36ZengPublic Library of Science (PLoS)PLoS ONE1932-62032017-01-01125e017752310.1371/journal.pone.0177523An index-based algorithm for fast on-line query processing of latent semantic analysis.Mingxi ZhangPohan LiWei WangLatent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm.http://europepmc.org/articles/PMC5433746?pdf=render
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Mingxi Zhang Pohan Li Wei Wang
spellingShingle	Mingxi Zhang Pohan Li Wei Wang An index-based algorithm for fast on-line query processing of latent semantic analysis. PLoS ONE
author_facet	Mingxi Zhang Pohan Li Wei Wang
author_sort	Mingxi Zhang
title	An index-based algorithm for fast on-line query processing of latent semantic analysis.
title_short	An index-based algorithm for fast on-line query processing of latent semantic analysis.
title_full	An index-based algorithm for fast on-line query processing of latent semantic analysis.
title_fullStr	An index-based algorithm for fast on-line query processing of latent semantic analysis.
title_full_unstemmed	An index-based algorithm for fast on-line query processing of latent semantic analysis.
title_sort	index-based algorithm for fast on-line query processing of latent semantic analysis.
publisher	Public Library of Science (PLoS)
series	PLoS ONE
issn	1932-6203
publishDate	2017-01-01
description	Latent Semantic Analysis (LSA) is widely used for finding the documents whose semantic is similar to the query of keywords. Although LSA yield promising similar results, the existing LSA algorithms involve lots of unnecessary operations in similarity computation and candidate check during on-line query processing, which is expensive in terms of time cost and cannot efficiently response the query request especially when the dataset becomes large. In this paper, we study the efficiency problem of on-line query processing for LSA towards efficiently searching the similar documents to a given query. We rewrite the similarity equation of LSA combined with an intermediate value called partial similarity that is stored in a designed index called partial index. For reducing the searching space, we give an approximate form of similarity equation, and then develop an efficient algorithm for building partial index, which skips the partial similarities lower than a given threshold θ. Based on partial index, we develop an efficient algorithm called ILSA for supporting fast on-line query processing. The given query is transformed into a pseudo document vector, and the similarities between query and candidate documents are computed by accumulating the partial similarities obtained from the index nodes corresponds to non-zero entries in the pseudo document vector. Compared to the LSA algorithm, ILSA reduces the time cost of on-line query processing by pruning the candidate documents that are not promising and skipping the operations that make little contribution to similarity scores. Extensive experiments through comparison with LSA have been done, which demonstrate the efficiency and effectiveness of our proposed algorithm.
url	http://europepmc.org/articles/PMC5433746?pdf=render
work_keys_str_mv	AT mingxizhang anindexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis AT pohanli anindexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis AT weiwang anindexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis AT mingxizhang indexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis AT pohanli indexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis AT weiwang indexbasedalgorithmforfastonlinequeryprocessingoflatentsemanticanalysis
_version_	1725944955249098752

An index-based algorithm for fast on-line query processing of latent semantic analysis.

Similar Items