Geometric and topological approaches to semantic text retrieval.

In the first part of this thesis, we present a new understanding of the latent semantic space of a dataset from the dual perspective, which relaxes the above assumed conditions and leads naturally to a unified kernel function for a class of vector space models. New semantic analysis methods based on...

Full description

Bibliographic Details
Other Authors: Li, Dandan.
Format: Others
Language:English
Chinese
Published: 2007
Subjects:
Online Access:http://library.cuhk.edu.hk/record=b6074419
http://repository.lib.cuhk.edu.hk/en/item/cuhk-344052
Description
Summary:In the first part of this thesis, we present a new understanding of the latent semantic space of a dataset from the dual perspective, which relaxes the above assumed conditions and leads naturally to a unified kernel function for a class of vector space models. New semantic analysis methods based on the unified kernel function are developed, which combine the advantages of LSI and GVSM. We also show that the new methods possess the stable property on the rank choice, i.e., even if the selected rank is quite far away from the optimal one, the retrieval performance will not degrade much. The experimental results of our methods on the standard test sets are promising. === In the second part of this thesis, we propose that the mathematical structure of simplexes can be attached to a term-document matrix in the vector-space model (VSM) for information retrieval. The Q-analysis devised by R. H. Atkin may then be applied to effect an analysis of the topological structure of the simplexes and their corresponding dataset. Experimental results of this analysis reveal that there is a correlation between the effectiveness of LSI and the topological structure of the dataset. By using the information obtained from the topological analysis, we develop a new query expansion method. Experimental results show that our method can enhance the performance of VSM for datasets over which LSI is not effective. Finally, the notion of homology is introduced to the topological analysis of datasets and its possible relation to word sense disambiguation is studied through a simple example. === With the vast amount of textual information available today, the task of designing effective and efficient retrieval methods becomes more important and complex. The Basic Vector Space Model (BVSM) is well known in information retrieval. Unfortunately, it can not retrieve all relevant documents since it is based on literal term matching. The Generalized Vector Space Model (GVSM) and the Latent Semantic Indexing (LSI) are two famous semantic retrieval methods, in which some underlying latent semantic structures in the dataset are assumed. However, their assumptions about where the semantic structure locates are a bit strong. Moreover, the performance of LSI can be very different for various datasets and the questions of what characteristics of a dataset and why these characteristics contribute to this difference have not been fully understood. The present thesis focuses on providing answers to these two questions. === Li , Dandan. === "August 2007." === Adviser: Chung-Ping Kwong. === Source: Dissertation Abstracts International, Volume: 69-02, Section: B, page: 1108. === Thesis (Ph.D.)--Chinese University of Hong Kong, 2007. === Includes bibliographical references (p. 118-120). === Electronic reproduction. Hong Kong : Chinese University of Hong Kong, [2012] System requirements: Adobe Acrobat Reader. Available via World Wide Web. === Electronic reproduction. [Ann Arbor, MI] : ProQuest Information and Learning, [200-] System requirements: Adobe Acrobat Reader. Available via World Wide Web. === Abstract in English and Chinese. === School code: 1307.