Extraction of linguistic resources from multilingual corpora and their exploitation

Increasing availability of on-line and off-line multilingual resources along with the developments in the related automatic tools that can process this information, such as GIZA++ (Och & Ney 2003), has made it possible to build new multilingual resources that can be used for NLP/IR tasks. Lexico...

Full description

Bibliographic Details
Main Author:	Shahid, Ahmad
Other Authors:	Kazakov, Dimitar
Published:	University of York 2012
Subjects:	005
Online Access:	http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.550283

id	ndltd-bl.uk-oai-ethos.bl.uk-550283
record_format	oai_dc
spelling	ndltd-bl.uk-oai-ethos.bl.uk-5502832017-10-04T03:20:50ZExtraction of linguistic resources from multilingual corpora and their exploitationShahid, AhmadKazakov, Dimitar2012Increasing availability of on-line and off-line multilingual resources along with the developments in the related automatic tools that can process this information, such as GIZA++ (Och & Ney 2003), has made it possible to build new multilingual resources that can be used for NLP/IR tasks. Lexicon generation is one such task, which if done by hand is quite expensive with human and capital costs involved. Generation of multilingual lexicons can now be automated, as is done in this research work. Wikipedia, an on-line multilingual resource was gainfully employed to automatically build multilingual lexicons using simple search strategies. Europarl parallel corpus (Koehn 2002) was used to create multilingual sets of synonyms, that were later used to carry out the task of Word Sense Disambiguation (WSD) on the original corpus from which they were derived. The theoretical analysis of the methodology validated our approach. The multilingual sets of synonyms were then used to learn unsupervised models of word morphology in the individual languages. The set of experiments we carried out, along with another unsupervised technique, were evaluated against the gold standard. Our results compared very favorably with the other approach. The combination of the two approaches gave even better results.005University of Yorkhttp://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.550283http://etheses.whiterose.ac.uk/2111/Electronic Thesis or Dissertation
collection	NDLTD
sources	NDLTD
topic	005
spellingShingle	005 Shahid, Ahmad Extraction of linguistic resources from multilingual corpora and their exploitation
description	Increasing availability of on-line and off-line multilingual resources along with the developments in the related automatic tools that can process this information, such as GIZA++ (Och & Ney 2003), has made it possible to build new multilingual resources that can be used for NLP/IR tasks. Lexicon generation is one such task, which if done by hand is quite expensive with human and capital costs involved. Generation of multilingual lexicons can now be automated, as is done in this research work. Wikipedia, an on-line multilingual resource was gainfully employed to automatically build multilingual lexicons using simple search strategies. Europarl parallel corpus (Koehn 2002) was used to create multilingual sets of synonyms, that were later used to carry out the task of Word Sense Disambiguation (WSD) on the original corpus from which they were derived. The theoretical analysis of the methodology validated our approach. The multilingual sets of synonyms were then used to learn unsupervised models of word morphology in the individual languages. The set of experiments we carried out, along with another unsupervised technique, were evaluated against the gold standard. Our results compared very favorably with the other approach. The combination of the two approaches gave even better results.
author2	Kazakov, Dimitar
author_facet	Kazakov, Dimitar Shahid, Ahmad
author	Shahid, Ahmad
author_sort	Shahid, Ahmad
title	Extraction of linguistic resources from multilingual corpora and their exploitation
title_short	Extraction of linguistic resources from multilingual corpora and their exploitation
title_full	Extraction of linguistic resources from multilingual corpora and their exploitation
title_fullStr	Extraction of linguistic resources from multilingual corpora and their exploitation
title_full_unstemmed	Extraction of linguistic resources from multilingual corpora and their exploitation
title_sort	extraction of linguistic resources from multilingual corpora and their exploitation
publisher	University of York
publishDate	2012
url	http://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.550283
work_keys_str_mv	AT shahidahmad extractionoflinguisticresourcesfrommultilingualcorporaandtheirexploitation
_version_	1718543046604226560

Extraction of linguistic resources from multilingual corpora and their exploitation

Similar Items