Summary: | 碩士 === 國立成功大學 === 資訊工程學系碩博士班 === 101 === In the era of information age, due to the explosion of information flood and a huge variety and uncertainty of media data on internet, the information browsers are difficult to extract useful information within a very short period. To save man power and tackle information extraction from tons of documents, the technologies of document representation, word clustering and document indexing have become much more important than before. Information browsers shall save a lot of time on searching what they are eager to know and extracting the relevant documents which are fitted to their interests. For this concern, unsupervised learning plays a crucial role in construction of information systems, e.g. document summarization, text categorization and information retrieval. In the literature, the statistical document representation based on latent topic model has been impacting the areas of natural language processing and developing for many applications. However, traditional topic model is constrained by assuming that (1) the number of topics should be fixed, (2) different topics should be independent and (3) topics should be selected under a single tree path. This dissertation presents a new approach to release these three assumptions and build up a flexible topic model with adaptive topic selection from heterogeneous documents.
In this thesis, we start from topic-based document model by using latent Dirichlet allocation (LDA) where latent topics are assumed to be independent. The prediction distribution of a new document is calculated without model retraining. By considering the dependencies between topics, we extend LDA to the hierarchical LDA or the nested Chinese restaurant process (nCRP) where a hierarchical tree structure of topics ranging from global topic in root layer to specific topics in leave layer is constructed for document representation. The tree nodes represent the word clustering in different levels. We further relax the limitation of topic model with the fixed number of topics. The Bayesian nonparametric method is developed for data representation where model selection problem is tackled by infinitely generating new topics or mixture components from the infinitely observed documents. The infinite topic model is established without limitation of tree layers and tree branches. When new documents are enrolled, the complexity of probability model is gradually and automatically enlarged according to Bayesian nonparametric theory. The issues of model selection and over-estimation problem are resolved. In addition, we release the limitation of nCRP, where only the topics along a fixed tree path are used to represent a document. The Indian buffet process (IBP) and the tree-structured-stick-breaking process are introduced. The nested IBP (nIBP) or the nCRP compound IBP is proposed to conduct flexible topic selection for word representation. A new scenario is designed to infer the proposed nIBP based on the collapsed Gibbs sampling procedure. We derive the posterior probabilities of latent variables and used them to sample model parameters as well as to draw the associated branches or select the tree paths for document representation. In the experiments, different text corpora are used to evaluate the performance of nIBP for document representation and document retrieval.
|