Summary: | 碩士 === 國立清華大學 === 資訊系統與應用研究所 === 96 === New computational tools for extracting collocations are a great boon to both language learners and lexicographers alike. A new method is proposed in this paper to organize the extremely numerous collocates that these tools can return into semantic thesaurus categories. The approach introduces a thesaurus-based semantic classification model automatically learning semantic relations for classifying adjective-noun (A-N) and verb-noun (V-N) collocations into different categories. As it is most relevant to language learners, the research focuses on the frequent patterns of collocation errors, A-N and V-N collocation pairs. Our model uses a random walk over vertices and edges on a weighted graph derived from WordNet semantic relations. We compute a semantic label stationary distribution via an iterative graphical algorithm. Semantic label of a collocate is scored by a novel divergence measure that imposes a thesaurus structure on collocation reference tools. In our experiment the resulting semantic relatedness is the WordNet-based measure, most highly correlated with human similarity judgments. The evaluation is conducted on a set of collocations whose collocates involve varying level of abstractness in the collocation usage box of Macmillan English Dictionary. We present our experimental evaluation with a collection of 150 multiple-choice questions commonly used as a similarity benchmark in TOEFL synonym test. The experimental results show that a thesaurus structure is successfully imposed to help enhance collocation production for L2 learners and significantly outperform existing collocation reference tools. The resulting semantic classification establishes close consistency among human judgments as fairly refined examples for evaluation of the model. The methodology neatly improves the performance of collocation reference tools and imposes semantic structure to collocations, which is a good starting point for a much improved and useful presentation of collocations and has been lived up to have positive consequences on robustness for semantic classification for collocations, an attractive feature for organizing broad-coverage machine-readable data to be merged together for catalogued usages of natural language processing.
|