A Study of Personal Name Disambiguation

碩士 === 國立臺灣大學 === 資訊工程學研究所 === 94 === In this thesis, we study the problem of personal name disambiguation. As we know, many individuals have the same name. The objective of our work is to identify different individuals from a set of documents and cluster the documents in groups such that each gro...

Full description

Bibliographic Details
Main Authors:	Yu-Chuan Wei, 魏煜娟
Other Authors:	陳信希
Format:	Others
Language:	en_US
Published:	2006
Online Access:	http://ndltd.ncl.edu.tw/handle/06536909770938821172

id	ndltd-TW-094NTU05392037
record_format	oai_dc
spelling	ndltd-TW-094NTU053920372015-12-16T04:38:20Z http://ndltd.ncl.edu.tw/handle/06536909770938821172 A Study of Personal Name Disambiguation 人名歧義性分析之研究 Yu-Chuan Wei 魏煜娟碩士國立臺灣大學資訊工程學研究所 94 In this thesis, we study the problem of personal name disambiguation. As we know, many individuals have the same name. The objective of our work is to identify different individuals from a set of documents and cluster the documents in groups such that each group relates to one person. Two types of approaches are proposed and compared. In the multiple-classifier approach, several classifiers are integrated to disambiguate the denotations of personal names. Each classifier is built based on one feature. Alternatives are proposed and replaced in the three classifiers. In the single-classifier approach, documents are clustered at a time. Different clustering algorithms and different features are considered and compared in this approach. In the test data sets, we address the issues of awareness degree of an entity (household name vs. general name), the sources of materials (newswire vs. web pages), and web pages in different areas (Mainland of China vs. Taiwan). The experimental results in the multiple-classifier approach show that personal titles and communities are two strong cues for clustering. The first two classifiers achieve very high precisions, and the last three classifiers improve the recalls at only some expense of precisions. The average F-score increases gradually from the first classifier to the last one. The results of several alternatives show that clustering personal titles directly performs better than the two steps strategy, and terms extracted from the full text seems to bring in many noises for name disambiguation. In the single-classifier approach, high performance is achieved when all types of the features are applied. Expanding communities from the Web improves the performance in both approaches. The alternative in the multiple-classifier approach achieves the best F-score 70% and has about 40% increases compared to the general name disambiguation method. We close with discussion of the comparison of two proposed clustering algorithms and make the conclusion. 陳信希 2006 學位論文 ; thesis 76 en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
description	碩士 === 國立臺灣大學 === 資訊工程學研究所 === 94 === In this thesis, we study the problem of personal name disambiguation. As we know, many individuals have the same name. The objective of our work is to identify different individuals from a set of documents and cluster the documents in groups such that each group relates to one person. Two types of approaches are proposed and compared. In the multiple-classifier approach, several classifiers are integrated to disambiguate the denotations of personal names. Each classifier is built based on one feature. Alternatives are proposed and replaced in the three classifiers. In the single-classifier approach, documents are clustered at a time. Different clustering algorithms and different features are considered and compared in this approach. In the test data sets, we address the issues of awareness degree of an entity (household name vs. general name), the sources of materials (newswire vs. web pages), and web pages in different areas (Mainland of China vs. Taiwan). The experimental results in the multiple-classifier approach show that personal titles and communities are two strong cues for clustering. The first two classifiers achieve very high precisions, and the last three classifiers improve the recalls at only some expense of precisions. The average F-score increases gradually from the first classifier to the last one. The results of several alternatives show that clustering personal titles directly performs better than the two steps strategy, and terms extracted from the full text seems to bring in many noises for name disambiguation. In the single-classifier approach, high performance is achieved when all types of the features are applied. Expanding communities from the Web improves the performance in both approaches. The alternative in the multiple-classifier approach achieves the best F-score 70% and has about 40% increases compared to the general name disambiguation method. We close with discussion of the comparison of two proposed clustering algorithms and make the conclusion.
author2	陳信希
author_facet	陳信希 Yu-Chuan Wei 魏煜娟
author	Yu-Chuan Wei 魏煜娟
spellingShingle	Yu-Chuan Wei 魏煜娟 A Study of Personal Name Disambiguation
author_sort	Yu-Chuan Wei
title	A Study of Personal Name Disambiguation
title_short	A Study of Personal Name Disambiguation
title_full	A Study of Personal Name Disambiguation
title_fullStr	A Study of Personal Name Disambiguation
title_full_unstemmed	A Study of Personal Name Disambiguation
title_sort	study of personal name disambiguation
publishDate	2006
url	http://ndltd.ncl.edu.tw/handle/06536909770938821172
work_keys_str_mv	AT yuchuanwei astudyofpersonalnamedisambiguation AT wèiyùjuān astudyofpersonalnamedisambiguation AT yuchuanwei rénmíngqíyìxìngfēnxīzhīyánjiū AT wèiyùjuān rénmíngqíyìxìngfēnxīzhīyánjiū AT yuchuanwei studyofpersonalnamedisambiguation AT wèiyùjuān studyofpersonalnamedisambiguation
_version_	1718149762848391168

A Study of Personal Name Disambiguation

Similar Items