A Study of Personal Name Disambiguation

碩士 === 國立臺灣大學 === 資訊工程學研究所 === 94 === In this thesis, we study the problem of personal name disambiguation. As we know, many individuals have the same name. The objective of our work is to identify different individuals from a set of documents and cluster the documents in groups such that each gro...

Full description

Bibliographic Details
Main Authors: Yu-Chuan Wei, 魏煜娟
Other Authors: 陳信希
Format: Others
Language:en_US
Published: 2006
Online Access:http://ndltd.ncl.edu.tw/handle/06536909770938821172
id ndltd-TW-094NTU05392037
record_format oai_dc
spelling ndltd-TW-094NTU053920372015-12-16T04:38:20Z http://ndltd.ncl.edu.tw/handle/06536909770938821172 A Study of Personal Name Disambiguation 人名歧義性分析之研究 Yu-Chuan Wei 魏煜娟 碩士 國立臺灣大學 資訊工程學研究所 94 In this thesis, we study the problem of personal name disambiguation. As we know, many individuals have the same name. The objective of our work is to identify different individuals from a set of documents and cluster the documents in groups such that each group relates to one person. Two types of approaches are proposed and compared. In the multiple-classifier approach, several classifiers are integrated to disambiguate the denotations of personal names. Each classifier is built based on one feature. Alternatives are proposed and replaced in the three classifiers. In the single-classifier approach, documents are clustered at a time. Different clustering algorithms and different features are considered and compared in this approach. In the test data sets, we address the issues of awareness degree of an entity (household name vs. general name), the sources of materials (newswire vs. web pages), and web pages in different areas (Mainland of China vs. Taiwan). The experimental results in the multiple-classifier approach show that personal titles and communities are two strong cues for clustering. The first two classifiers achieve very high precisions, and the last three classifiers improve the recalls at only some expense of precisions. The average F-score increases gradually from the first classifier to the last one. The results of several alternatives show that clustering personal titles directly performs better than the two steps strategy, and terms extracted from the full text seems to bring in many noises for name disambiguation. In the single-classifier approach, high performance is achieved when all types of the features are applied. Expanding communities from the Web improves the performance in both approaches. The alternative in the multiple-classifier approach achieves the best F-score 70% and has about 40% increases compared to the general name disambiguation method. We close with discussion of the comparison of two proposed clustering algorithms and make the conclusion. 陳信希 2006 學位論文 ; thesis 76 en_US
collection NDLTD
language en_US
format Others
sources NDLTD
description 碩士 === 國立臺灣大學 === 資訊工程學研究所 === 94 === In this thesis, we study the problem of personal name disambiguation. As we know, many individuals have the same name. The objective of our work is to identify different individuals from a set of documents and cluster the documents in groups such that each group relates to one person. Two types of approaches are proposed and compared. In the multiple-classifier approach, several classifiers are integrated to disambiguate the denotations of personal names. Each classifier is built based on one feature. Alternatives are proposed and replaced in the three classifiers. In the single-classifier approach, documents are clustered at a time. Different clustering algorithms and different features are considered and compared in this approach. In the test data sets, we address the issues of awareness degree of an entity (household name vs. general name), the sources of materials (newswire vs. web pages), and web pages in different areas (Mainland of China vs. Taiwan). The experimental results in the multiple-classifier approach show that personal titles and communities are two strong cues for clustering. The first two classifiers achieve very high precisions, and the last three classifiers improve the recalls at only some expense of precisions. The average F-score increases gradually from the first classifier to the last one. The results of several alternatives show that clustering personal titles directly performs better than the two steps strategy, and terms extracted from the full text seems to bring in many noises for name disambiguation. In the single-classifier approach, high performance is achieved when all types of the features are applied. Expanding communities from the Web improves the performance in both approaches. The alternative in the multiple-classifier approach achieves the best F-score 70% and has about 40% increases compared to the general name disambiguation method. We close with discussion of the comparison of two proposed clustering algorithms and make the conclusion.
author2 陳信希
author_facet 陳信希
Yu-Chuan Wei
魏煜娟
author Yu-Chuan Wei
魏煜娟
spellingShingle Yu-Chuan Wei
魏煜娟
A Study of Personal Name Disambiguation
author_sort Yu-Chuan Wei
title A Study of Personal Name Disambiguation
title_short A Study of Personal Name Disambiguation
title_full A Study of Personal Name Disambiguation
title_fullStr A Study of Personal Name Disambiguation
title_full_unstemmed A Study of Personal Name Disambiguation
title_sort study of personal name disambiguation
publishDate 2006
url http://ndltd.ncl.edu.tw/handle/06536909770938821172
work_keys_str_mv AT yuchuanwei astudyofpersonalnamedisambiguation
AT wèiyùjuān astudyofpersonalnamedisambiguation
AT yuchuanwei rénmíngqíyìxìngfēnxīzhīyánjiū
AT wèiyùjuān rénmíngqíyìxìngfēnxīzhīyánjiū
AT yuchuanwei studyofpersonalnamedisambiguation
AT wèiyùjuān studyofpersonalnamedisambiguation
_version_ 1718149762848391168