Design and Implementation of the Topic Generation Methods for Document Summarization

碩士 === 國立成功大學 === 電腦與通信工程研究所 === 96 === In recently years, more and more Internet users rely on the search engines to help them find the information they need. However, the information they often consists of a large amount of irrelevant parts. It would often spend Internet users much time to achieve...

Full description

Bibliographic Details
Main Authors: Hen-Yao Hsu, 許恒耀
Other Authors: Chu-Sing Yang
Format: Others
Language:en_US
Published: 2008
Online Access:http://ndltd.ncl.edu.tw/handle/38102566931806279773
Description
Summary:碩士 === 國立成功大學 === 電腦與通信工程研究所 === 96 === In recently years, more and more Internet users rely on the search engines to help them find the information they need. However, the information they often consists of a large amount of irrelevant parts. It would often spend Internet users much time to achieve the correct information users need. To help Internet users find the information they are looking for quickly, an efficient algorithm for automatically building the summaries of a collection of documents found by a search engine in response to a user query, called DiSco (Distribution Scoring), is proposed. Topics are the basic units of the summaries DiSco generated. The main idea of DiSco is to consider the distribution of lexicons in a document, while the distribution is in practice thought to be related to different importance of lexicons. Using a scoring mechanism to score the weight of individual lexicons, DiSco could generate the topic sets to be the summaries of the document based on the weights. To demonstrate the performance of the proposed algorithm in this thesis, Reuters-21578 text categorization collection and the search results of the hot queries from Google Trends are used in the simulation. Moreover, several measure methods such as coverage, overlap, and the computation time are employed in evaluating the performance of the proposed algorithm. To further improve the efficiency of the proposed algorithm, an alternative version of DiSco is designed and implemented. The tradeoff between computation time and the quality of the summarization is also discussed. All the simulation results indicate that the proposed algorithm, which is based on the distribution of lexicons, outperforms all the existing algorithms in terms of not only the benchmark of data coverage rate, data overlap rate and the computation time. The simulation results also indicate that the topic sets generated by the proposed algorithm are better suited for document summarization to fit the requirement of getting information quickly.