Summary: | 碩士 === 國立成功大學 === 資訊工程學系碩博士班 === 96 === As the internet grows speedily and prosperously, the amount of information is too large and too miscellaneous to browse. Users have to spend a lot of time to dig out the information they need. The automatic text summarization was accordingly developed to help extracting the concepts or themes among large documents or web pages. There are two types of automatic summarization. One of them is to extract the theme from user’s query, while another is to find out important notes from original articles directly. The former is applied in the domain of information retrieval, and the latter is used for text summarization.
In this thesis, we addressed a new automatic summarization technique based on the state-of-art latent Dirichlet allocation (LDA) model and applied it for text summarization. Different from the traditional vector space summarization methods, we adopted the sentence-based LDA (SLDA) model to derive the summary of a document. This SLDA method can tackle the problem of unseen words by sharing the information from synonyms and co-occurrence words and extracting the true critical sentences from given articles. Furthermore, this SLDA method can be easily generalized to new documents and calculated for sentence selection without model retraining. Using the trained model parameters, the proposed method is not only available for text summarization, but also extendible to query summarization. The experiment results are shown better performance than other methods.
|