Collocation extraction : a generic substitution-based approach

One of the fundamental aspects of any natural language is the set of words used within it. In addition to knowing how individual words can be combined to communicate meaning, competent language users also know a large number of specific word combinations whose grammatical or distributional behaviour...

Full description

Bibliographic Details
Main Author: Pearce, Darren Michael
Published: University of Sussex 2009
Subjects:
Online Access:http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.496797
Description
Summary:One of the fundamental aspects of any natural language is the set of words used within it. In addition to knowing how individual words can be combined to communicate meaning, competent language users also know a large number of specific word combinations whose grammatical or distributional behaviour or meaning is idiosyncratic. This research is concerned with computational aspects of one important type of word combination: collocation. There is no agreed formal definition of collocation but it can be informally characterised as a sequence of words that occurs more often than would be expected by chance and whose combination tends to produce an element of added meaning. One of the often-cited characteristics of collocations is that they restrict substitution for their constituent words. This thesis develops a generic framework for the extraction of collocations that exploits this restriction. Experiments exploring the performance of such techniques use frequency counts derived from the WWW as well as large amounts of analysed text from conventional corpora and show that substitution-based techniques can out-perform many existing approaches to collocation extraction. The thesis concludes with a discussion of the many ways in which further research can leverage the genericity of the framework and utilise substitution for collocation extraction.