Choosing the most reasonable split of a compound word using Wikipedia

The purpose of this master thesis is to make use of the category taxonomy of Wikipedia to determine the most reasonable split from the suggestions generated by an independent compound word splitter. The articles a word was found in can be seen as a group of contexts the word can occur in and also di...

Full description

Bibliographic Details
Main Author:	Le, Yvonne
Format:	Others
Language:	English
Published:	KTH, Skolan för datavetenskap och kommunikation (CSC) 2017
Subjects:	Compound splitting compounding Computer Sciences Datavetenskap (datalogi)
Online Access:	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-202310

id	ndltd-UPSALLA1-oai-DiVA.org-kth-202310
record_format	oai_dc
spelling	ndltd-UPSALLA1-oai-DiVA.org-kth-2023102018-01-14T05:10:59ZChoosing the most reasonable split of a compound word using WikipediaengVal av den rimligaste delningen av ett sammansatt ord med hjälp av WikipediaLe, YvonneKTH, Skolan för datavetenskap och kommunikation (CSC)2017Compound splitting compoundingComputer SciencesDatavetenskap (datalogi)The purpose of this master thesis is to make use of the category taxonomy of Wikipedia to determine the most reasonable split from the suggestions generated by an independent compound word splitter. The articles a word was found in can be seen as a group of contexts the word can occur in and also different representations of the word, i.e. an article is a representation of the word. Instead of only analysing the data of each single article, the intention is to find more data for each representation/context to perform an analysis on. The idea is to expand each article representing one context by including related articles in the same category. Two perceptions of a ”reasonable split” was studied. The first case was a split consisting of only two parts and the second case of unlimited parts. This approach is well-suited for choosing the correct split out of a several suggestions but unsuitable for identifying compound words. It would more often than not decide to not split a compound word. It is very dependant on the compound words appearing in Wikipedia. Syftet med detta examensarbete är att utse den rimligaste uppdelningen av ett sammansatt ord genom användning av Wikipedias kategoritaxonomi. Förslag på olika uppdelningar genereras av en oberoende färdig algoritm. Artiklarna som ett ord finns can ses som en grupp av kontexter som ett ord kan förekomma i och olika framställningar av ett ord. Avsikten är att hitta mer data för varje framställning/kontext att utföra en analys på istället för att bara analysera artikeln ordet hittades i. Idéen som ska testas är att expandera varje artikel som representerar en kontext genom att inkludera relaterade artiklar i samma kategori. Två olika synsätt på ”rimliga uppdelningar” studerades. Första fallet var att endast dela upp sammansatta ord i två delar och andra fallet var att dela upp i obestämt antal delar. Metoden visade sig utmärka sig på att välja rätt uppdelning när den väl gjorde ett försök. En stor nackdel var att den ofta valde att inte dela upp sammansättningar trots att den skulle ha gjort det. Metoden är mycket beroende av att sammansättningarna måste finnas i Wikipedia. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-202310application/pdfinfo:eu-repo/semantics/openAccess
collection	NDLTD
language	English
format	Others
sources	NDLTD
topic	Compound splitting compounding Computer Sciences Datavetenskap (datalogi)
spellingShingle	Compound splitting compounding Computer Sciences Datavetenskap (datalogi) Le, Yvonne Choosing the most reasonable split of a compound word using Wikipedia
description	The purpose of this master thesis is to make use of the category taxonomy of Wikipedia to determine the most reasonable split from the suggestions generated by an independent compound word splitter. The articles a word was found in can be seen as a group of contexts the word can occur in and also different representations of the word, i.e. an article is a representation of the word. Instead of only analysing the data of each single article, the intention is to find more data for each representation/context to perform an analysis on. The idea is to expand each article representing one context by including related articles in the same category. Two perceptions of a ”reasonable split” was studied. The first case was a split consisting of only two parts and the second case of unlimited parts. This approach is well-suited for choosing the correct split out of a several suggestions but unsuitable for identifying compound words. It would more often than not decide to not split a compound word. It is very dependant on the compound words appearing in Wikipedia. === Syftet med detta examensarbete är att utse den rimligaste uppdelningen av ett sammansatt ord genom användning av Wikipedias kategoritaxonomi. Förslag på olika uppdelningar genereras av en oberoende färdig algoritm. Artiklarna som ett ord finns can ses som en grupp av kontexter som ett ord kan förekomma i och olika framställningar av ett ord. Avsikten är att hitta mer data för varje framställning/kontext att utföra en analys på istället för att bara analysera artikeln ordet hittades i. Idéen som ska testas är att expandera varje artikel som representerar en kontext genom att inkludera relaterade artiklar i samma kategori. Två olika synsätt på ”rimliga uppdelningar” studerades. Första fallet var att endast dela upp sammansatta ord i två delar och andra fallet var att dela upp i obestämt antal delar. Metoden visade sig utmärka sig på att välja rätt uppdelning när den väl gjorde ett försök. En stor nackdel var att den ofta valde att inte dela upp sammansättningar trots att den skulle ha gjort det. Metoden är mycket beroende av att sammansättningarna måste finnas i Wikipedia.
author	Le, Yvonne
author_facet	Le, Yvonne
author_sort	Le, Yvonne
title	Choosing the most reasonable split of a compound word using Wikipedia
title_short	Choosing the most reasonable split of a compound word using Wikipedia
title_full	Choosing the most reasonable split of a compound word using Wikipedia
title_fullStr	Choosing the most reasonable split of a compound word using Wikipedia
title_full_unstemmed	Choosing the most reasonable split of a compound word using Wikipedia
title_sort	choosing the most reasonable split of a compound word using wikipedia
publisher	KTH, Skolan för datavetenskap och kommunikation (CSC)
publishDate	2017
url	http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-202310
work_keys_str_mv	AT leyvonne choosingthemostreasonablesplitofacompoundwordusingwikipedia AT leyvonne valavdenrimligastedelningenavettsammansattordmedhjalpavwikipedia
_version_	1718609726930944000

Choosing the most reasonable split of a compound word using Wikipedia

Similar Items