Discovering Topics and Their Evolution in CorpusBased on Random-Walk with Restart andNamed-Entity Feature

碩士 === 國立中興大學 === 資訊科學與工程學系 === 102 === In the decision making process, people usually need to collect the related information and topics regarding their problems as soon as possible. Due to the rapid development of the Web, the amount of data grows quickly, making it difficult to classify and to f...

Full description

Bibliographic Details
Main Authors: Chi-Hao Chen, 陳期皓
Other Authors: I-En Liao
Format: Others
Language:zh-TW
Published: 2014
Online Access:http://ndltd.ncl.edu.tw/handle/27386888925155926267
Description
Summary:碩士 === 國立中興大學 === 資訊科學與工程學系 === 102 === In the decision making process, people usually need to collect the related information and topics regarding their problems as soon as possible. Due to the rapid development of the Web, the amount of data grows quickly, making it difficult to classify and to find the related topic/event manually within the desired time. In this thesis, we proposed a topic detection and topic evolution modeling system based on random-walk with restart (RWR) and named-entity feature. This system can automatically discover the topics discussed within the corpus and generate the topic/event evolution information with respect to the temporal and content similarity between topics/events. The proposed system uses the information from Wikipedia’s web pages to categorize document terms into five predefined named-entity classes and assigns different weights to those classes in order to generate more distinctive features of each document. After extracting features of each document, a graph generation module computes graph-based similarity for text documents using cosine similarity. The RWR-based clustering algorithm is then used to detect the topics/events within those documents by aggregating documents under the same topics. Finally, the proposed system generates the topic/event evolution information of given documents by considering the similarity of average features of contained documents, the similarity of representative document features, and temporal similarity between topics. These topic evolution information can be used to support the decision making process. The experimental results showed that the proposed algorithm indeed outperforms both K-Means and Latent Dirichlet Allocation (LDA) in terms of F1-measure on real world data. In addition, the RWR-based clustering algorithm provides better topic detection quality than other approaches in terms of the ability of handling multi-topic documents. The experimental results also showed that the proposed model can generate suitable topic/event evolution information.