Measuring Semantic Distance using Distributional Profiles of Concepts

Semantic distance is a measure of how close or distant in meaning two units of language are. A large number of important natural language problems, including machine translation and word sense disambiguation, can be viewed as semantic distance problems. The two dominant approaches to estimating sema...

Full description

Bibliographic Details
Main Author:	Mohammad, Saif
Other Authors:	Hirst, Graeme
Format:	Others
Language:	en_US
Published:	2008
Subjects:	Computational Linguistics Natural Language Processing Lexical semantics semantic distance distributional similarity semantic similarity semantic relatedness word concept co-occurrence matrix distributional profiles of concepts thesaurus corpus-based techniques word senses cross-lingual techniques word sense dominance word sense disambiguation wordnet 0984 0800
Online Access:	http://hdl.handle.net/1807/11238

id	ndltd-LACETR-oai-collectionscanada.gc.ca-OTU.1807-11238
record_format	oai_dc
spelling	ndltd-LACETR-oai-collectionscanada.gc.ca-OTU.1807-112382014-01-29T03:29:27ZMeasuring Semantic Distance using Distributional Profiles of ConceptsMohammad, SaifComputational LinguisticsNatural Language ProcessingLexical semanticssemantic distancedistributional similaritysemantic similaritysemantic relatednessword concept co-occurrence matrixdistributional profiles of conceptsthesauruscorpus-based techniquesword sensescross-lingual techniquesword sense dominanceword sense disambiguationwordnet09840800Semantic distance is a measure of how close or distant in meaning two units of language are. A large number of important natural language problems, including machine translation and word sense disambiguation, can be viewed as semantic distance problems. The two dominant approaches to estimating semantic distance are the WordNet-based semantic measures and the corpus-based distributional measures. In this thesis, I compare them, both qualitatively and quantitatively, and identify the limitations of each. This thesis argues that estimating semantic distance is essentially a property of concepts (rather than words) and that two concepts are semantically close if they occur in similar contexts. Instead of identifying the co-occurrence (distributional) profiles of words (distributional hypothesis), I argue that distributional profiles of concepts (DPCs) can be used to infer the semantic properties of concepts and indeed to estimate semantic distance more accurately. I propose a new hybrid approach to calculating semantic distance that combines corpus statistics and a published thesaurus (Macquarie Thesaurus). The algorithm determines estimates of the DPCs using the categories in the thesaurus as very coarse concepts and, notably, without requiring any sense-annotated data. Even though the use of only about 1000 concepts to represent the vocabulary of a language seems drastic, I show that the method achieves results better than the state-of-the-art in a number of natural language tasks. I show how cross-lingual DPCs can be created by combining text in one language with a thesaurus from another. Using these cross-lingual DPCs, we can solve problems in one, possibly resource-poor, language using a knowledge source from another, possibly resource-rich, language. I show that the approach is also useful in tasks that inherently involve two or more languages, such as machine translation and multilingual text summarization. The proposed approach is computationally inexpensive, it can estimate both semantic relatedness and semantic similarity, and it can be applied to all parts of speech. Extensive experiments on ranking word pairs as per semantic distance, real-word spelling correction, solving Reader's Digest word choice problems, determining word sense dominance, word sense disambiguation, and word translation show that the new approach is markedly superior to previous ones.Hirst, Graeme2008-062008-08-01T20:54:33ZNO_RESTRICTION2008-08-01T20:54:33Z2008-08-01T20:54:33ZThesis1257436 bytesapplication/pdfhttp://hdl.handle.net/1807/11238en_US
collection	NDLTD
language	en_US
format	Others
sources	NDLTD
topic	Computational Linguistics Natural Language Processing Lexical semantics semantic distance distributional similarity semantic similarity semantic relatedness word concept co-occurrence matrix distributional profiles of concepts thesaurus corpus-based techniques word senses cross-lingual techniques word sense dominance word sense disambiguation wordnet 0984 0800
spellingShingle	Computational Linguistics Natural Language Processing Lexical semantics semantic distance distributional similarity semantic similarity semantic relatedness word concept co-occurrence matrix distributional profiles of concepts thesaurus corpus-based techniques word senses cross-lingual techniques word sense dominance word sense disambiguation wordnet 0984 0800 Mohammad, Saif Measuring Semantic Distance using Distributional Profiles of Concepts
description	Semantic distance is a measure of how close or distant in meaning two units of language are. A large number of important natural language problems, including machine translation and word sense disambiguation, can be viewed as semantic distance problems. The two dominant approaches to estimating semantic distance are the WordNet-based semantic measures and the corpus-based distributional measures. In this thesis, I compare them, both qualitatively and quantitatively, and identify the limitations of each. This thesis argues that estimating semantic distance is essentially a property of concepts (rather than words) and that two concepts are semantically close if they occur in similar contexts. Instead of identifying the co-occurrence (distributional) profiles of words (distributional hypothesis), I argue that distributional profiles of concepts (DPCs) can be used to infer the semantic properties of concepts and indeed to estimate semantic distance more accurately. I propose a new hybrid approach to calculating semantic distance that combines corpus statistics and a published thesaurus (Macquarie Thesaurus). The algorithm determines estimates of the DPCs using the categories in the thesaurus as very coarse concepts and, notably, without requiring any sense-annotated data. Even though the use of only about 1000 concepts to represent the vocabulary of a language seems drastic, I show that the method achieves results better than the state-of-the-art in a number of natural language tasks. I show how cross-lingual DPCs can be created by combining text in one language with a thesaurus from another. Using these cross-lingual DPCs, we can solve problems in one, possibly resource-poor, language using a knowledge source from another, possibly resource-rich, language. I show that the approach is also useful in tasks that inherently involve two or more languages, such as machine translation and multilingual text summarization. The proposed approach is computationally inexpensive, it can estimate both semantic relatedness and semantic similarity, and it can be applied to all parts of speech. Extensive experiments on ranking word pairs as per semantic distance, real-word spelling correction, solving Reader's Digest word choice problems, determining word sense dominance, word sense disambiguation, and word translation show that the new approach is markedly superior to previous ones.
author2	Hirst, Graeme
author_facet	Hirst, Graeme Mohammad, Saif
author	Mohammad, Saif
author_sort	Mohammad, Saif
title	Measuring Semantic Distance using Distributional Profiles of Concepts
title_short	Measuring Semantic Distance using Distributional Profiles of Concepts
title_full	Measuring Semantic Distance using Distributional Profiles of Concepts
title_fullStr	Measuring Semantic Distance using Distributional Profiles of Concepts
title_full_unstemmed	Measuring Semantic Distance using Distributional Profiles of Concepts
title_sort	measuring semantic distance using distributional profiles of concepts
publishDate	2008
url	http://hdl.handle.net/1807/11238
work_keys_str_mv	AT mohammadsaif measuringsemanticdistanceusingdistributionalprofilesofconcepts
_version_	1716627509599010816

Measuring Semantic Distance using Distributional Profiles of Concepts

Similar Items