Summary: | 碩士 === 國立臺灣大學 === 資訊工程學研究所 === 104 === This thesis investigates relation extraction, which learns semantic relations of concept pairs from text, as an approach to mining commonsense knowledge. To achieve good performance, state-of-the-art supervised learning requires a large labeled training set, which is often expensive to prepare. As an alternative, distant supervision, a semi-supervised learning method, was adopted to extract relations from unlabeled corpora. A training set consisting of a large amount of sentences can be weakly labeled automatically based on a set of concept pairs for any given relation in a knowledge base.
Labels generated with heuristics can be quite noisy. When the sources of sentences in the training set are not correlated with the knowledge base, the automatic labeling mechanism is unreliable. Instead of assuming all sentences are labeled correctly in the training set, multiple instance learning learns from bags of instances, provided that each positive bag contains at least one positive instance while negative bags contain only negative instances.
We conducted experiments on relation extraction in Chinese using concept pairs in ConceptNet, a commonsense knowledge base, as the seeds for labeling a set of predefined relations. The training bags were generated from the Sinica Corpus. The performance of multiple instance learning is compared with single-instance learning and a few other learning algorithms. Our experiments extracted new pairs for relations “AtLocation”, “CapableOf”, “HasProperty” and “IsA”. This study showed that a knowledge base can be improved by another corpus using the proposed approach.
|