Unsupervised learning of multiword expressions

Multiword expressions are expressions consisting of two or more words that correspond to some conventional way of saying things (Manning & Schutze 1999). Due to the idiomatic nature of many of them and their high frequency of occurence in all sorts of text, they cause problems in many Natural La...

Full description

Bibliographic Details
Main Author: Korkontzelos, Ioannis
Published: University of York 2010
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.535066
Description
Summary:Multiword expressions are expressions consisting of two or more words that correspond to some conventional way of saying things (Manning & Schutze 1999). Due to the idiomatic nature of many of them and their high frequency of occurence in all sorts of text, they cause problems in many Natural Language Processing (NLP) applications and are frequently responsible for their shortcomings. Efficiently recognising multiword expressions and deciding the degree of their idiomaticity would be useful to all applications that require some degree of semantic processing, such as question-answering, summarisation, parsing, language modelling and language generation. In this thesis we investigate the issues of recognising multiword expressions, domainspecific or not, and of deciding whether they are idiomatic. Moreover, we inspect the extent to which multiword expressions can contribute to a basic NLP task such as shallow parsing and ways that the basic property of multiword expressions, idiomaticity, can be employed to define a novel task for Compositional Distributional Semantics (CDS). The results show that it is possible to recognise multiword expressions and decide their compositionality in an unsupervised manner, based on cooccurrence statistics and distributional semantics. Further, multiword expressions are beneficial for other fundamental applications of Natural Language Processing either by direct integration or as an evaluation tool. In particular, termhood-based methods, which are based on nestedness information, are shown to outperform unithood-based methods, which measure the strength of association among the constituents of a multi-word candidate term. A simple heuristic was proved to perform better than more sophisticated methods. A new graph-based algorithm employing sense induction is proposed to address multiword expression compositionality and is shown to perform better than a standard vector space model. Its parameters were estimated by an unsupervised scheme based on graph connectivity. Multiword expressions are shown to contribute to shallow parsing. Moreover, they are used to define a new evaluation task for distributional semantic composition models.