Discovering Topics and Their Evolution in CorpusBased on Random-Walk with Restart andNamed-Entity Feature

碩士 === 國立中興大學 === 資訊科學與工程學系 === 102 === In the decision making process, people usually need to collect the related information and topics regarding their problems as soon as possible. Due to the rapid development of the Web, the amount of data grows quickly, making it difficult to classify and to f...

Full description

Bibliographic Details
Main Authors: Chi-Hao Chen, 陳期皓
Other Authors: I-En Liao
Format: Others
Language:zh-TW
Published: 2014
Online Access:http://ndltd.ncl.edu.tw/handle/27386888925155926267
id ndltd-TW-102NCHU5394063
record_format oai_dc
spelling ndltd-TW-102NCHU53940632017-08-27T04:29:50Z http://ndltd.ncl.edu.tw/handle/27386888925155926267 Discovering Topics and Their Evolution in CorpusBased on Random-Walk with Restart andNamed-Entity Feature 基於隨機漫步與詞類權重加權之文件主題偵測與歷程分析模型 Chi-Hao Chen 陳期皓 碩士 國立中興大學 資訊科學與工程學系 102 In the decision making process, people usually need to collect the related information and topics regarding their problems as soon as possible. Due to the rapid development of the Web, the amount of data grows quickly, making it difficult to classify and to find the related topic/event manually within the desired time. In this thesis, we proposed a topic detection and topic evolution modeling system based on random-walk with restart (RWR) and named-entity feature. This system can automatically discover the topics discussed within the corpus and generate the topic/event evolution information with respect to the temporal and content similarity between topics/events. The proposed system uses the information from Wikipedia’s web pages to categorize document terms into five predefined named-entity classes and assigns different weights to those classes in order to generate more distinctive features of each document. After extracting features of each document, a graph generation module computes graph-based similarity for text documents using cosine similarity. The RWR-based clustering algorithm is then used to detect the topics/events within those documents by aggregating documents under the same topics. Finally, the proposed system generates the topic/event evolution information of given documents by considering the similarity of average features of contained documents, the similarity of representative document features, and temporal similarity between topics. These topic evolution information can be used to support the decision making process. The experimental results showed that the proposed algorithm indeed outperforms both K-Means and Latent Dirichlet Allocation (LDA) in terms of F1-measure on real world data. In addition, the RWR-based clustering algorithm provides better topic detection quality than other approaches in terms of the ability of handling multi-topic documents. The experimental results also showed that the proposed model can generate suitable topic/event evolution information. I-En Liao 廖宜恩 2014 學位論文 ; thesis 47 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立中興大學 === 資訊科學與工程學系 === 102 === In the decision making process, people usually need to collect the related information and topics regarding their problems as soon as possible. Due to the rapid development of the Web, the amount of data grows quickly, making it difficult to classify and to find the related topic/event manually within the desired time. In this thesis, we proposed a topic detection and topic evolution modeling system based on random-walk with restart (RWR) and named-entity feature. This system can automatically discover the topics discussed within the corpus and generate the topic/event evolution information with respect to the temporal and content similarity between topics/events. The proposed system uses the information from Wikipedia’s web pages to categorize document terms into five predefined named-entity classes and assigns different weights to those classes in order to generate more distinctive features of each document. After extracting features of each document, a graph generation module computes graph-based similarity for text documents using cosine similarity. The RWR-based clustering algorithm is then used to detect the topics/events within those documents by aggregating documents under the same topics. Finally, the proposed system generates the topic/event evolution information of given documents by considering the similarity of average features of contained documents, the similarity of representative document features, and temporal similarity between topics. These topic evolution information can be used to support the decision making process. The experimental results showed that the proposed algorithm indeed outperforms both K-Means and Latent Dirichlet Allocation (LDA) in terms of F1-measure on real world data. In addition, the RWR-based clustering algorithm provides better topic detection quality than other approaches in terms of the ability of handling multi-topic documents. The experimental results also showed that the proposed model can generate suitable topic/event evolution information.
author2 I-En Liao
author_facet I-En Liao
Chi-Hao Chen
陳期皓
author Chi-Hao Chen
陳期皓
spellingShingle Chi-Hao Chen
陳期皓
Discovering Topics and Their Evolution in CorpusBased on Random-Walk with Restart andNamed-Entity Feature
author_sort Chi-Hao Chen
title Discovering Topics and Their Evolution in CorpusBased on Random-Walk with Restart andNamed-Entity Feature
title_short Discovering Topics and Their Evolution in CorpusBased on Random-Walk with Restart andNamed-Entity Feature
title_full Discovering Topics and Their Evolution in CorpusBased on Random-Walk with Restart andNamed-Entity Feature
title_fullStr Discovering Topics and Their Evolution in CorpusBased on Random-Walk with Restart andNamed-Entity Feature
title_full_unstemmed Discovering Topics and Their Evolution in CorpusBased on Random-Walk with Restart andNamed-Entity Feature
title_sort discovering topics and their evolution in corpusbased on random-walk with restart andnamed-entity feature
publishDate 2014
url http://ndltd.ncl.edu.tw/handle/27386888925155926267
work_keys_str_mv AT chihaochen discoveringtopicsandtheirevolutionincorpusbasedonrandomwalkwithrestartandnamedentityfeature
AT chénqīhào discoveringtopicsandtheirevolutionincorpusbasedonrandomwalkwithrestartandnamedentityfeature
AT chihaochen jīyúsuíjīmànbùyǔcílèiquánzhòngjiāquánzhīwénjiànzhǔtízhēncèyǔlìchéngfēnxīmóxíng
AT chénqīhào jīyúsuíjīmànbùyǔcílèiquánzhòngjiāquánzhīwénjiànzhǔtízhēncèyǔlìchéngfēnxīmóxíng
_version_ 1718518961024270336