Latent Conceptual Analysis--Using Web data in Chinese to represent conceptual knowledge about word relations in a vector space model

碩士 === 國立臺灣師範大學 === 英語學系 === 100 === In the field of Natural Language Processing, lexical patterns are often applied in many experiments that involve similarity measure among word relations. Despite their growing importance, however, these patterns are rarely examined in terms of what aspect they in...

Full description

Bibliographic Details
Main Authors: Qian-Rong Chang, 張虔榮
Other Authors: Shu-Kai Hsieh
Format: Others
Language:en_US
Published: 2012
Online Access:http://ndltd.ncl.edu.tw/handle/20311828580113346123
Description
Summary:碩士 === 國立臺灣師範大學 === 英語學系 === 100 === In the field of Natural Language Processing, lexical patterns are often applied in many experiments that involve similarity measure among word relations. Despite their growing importance, however, these patterns are rarely examined in terms of what aspect they inherit from the word relation they are claimed to represent. In the thesis, it is proposed that lexical patterns exhibit the same conceptual nature as word relations do. They both display conceptual qualities when they are applied in language use. It is also proposed in this thesis that the conceptual nature of lexical patterns can be captured and implemented in a computational model, latent conceptual analysis (LCA), to calculate similarity among the patterns. LCA is an automatic algorithm that relies on singular vector decomposition (SVD) to reduce the high dimensionality resulted from large-scale corpus. In the thesis, after 35 lexical patterns are generated semi-automatically, each of them is sent to LCA as input data, whose distance from the other 34 patterns will be subsequently determined. To validate the performance of LCA, the result is compared to that of a manual clustering method whose standards are based on principles applied in FrameNet. As revealed from the comparison, LCA has achieved a result similar to that of manual clustering. The approach adopted in the thesis is similar to that applied by Turney (2006) and Bollegala et al. (2009). However, instead of relying solely on frequency distribution, language users’ conceptual knowledge about lexical patterns is also taken into consideration in LCA. Because LCA uses Web contents as its corpus, the dynamic and constantly changing nature of data collected from the Web can sometimes affect the performance of LCA. Therefore it is suggested that future studies applying LCA should collect data in a long-term fashion to alleviate this problem.