Outomatiese genreklassifikasie vir hulpbronskaars tale / Dirk Snyman

When working in the terrain of text processing, metadata about a particular text plays an important role. Metadata is often generated using automatic text classification systems which classifies a text into one or more predefined classes or categories based on its contents. One of the dimensions by...

Full description

Bibliographic Details
Main Author:	Snyman, Dirk Petrus
Language:	other
Published:	North-West University 2014
Subjects:	Genre classification Resource scarce languages Machine learning Technology recycling Human language technology Natural language processing Genreklassifikasie Hulpbronskaars tale Masjienleer Tegnologieherwinning Mensetaaltegnologie Natuurliketaalprosessering
Online Access:	http://hdl.handle.net/10394/10209

id	ndltd-NWUBOLOKA1-oai-dspace.nwu.ac.za-10394-10209
record_format	oai_dc
spelling	ndltd-NWUBOLOKA1-oai-dspace.nwu.ac.za-10394-102092014-09-30T04:06:27ZOutomatiese genreklassifikasie vir hulpbronskaars tale / Dirk SnymanSnyman, Dirk PetrusGenre classificationResource scarce languagesMachine learningTechnology recyclingHuman language technologyNatural language processingGenreklassifikasieHulpbronskaars taleMasjienleerTegnologieherwinningMensetaaltegnologieNatuurliketaalprosesseringWhen working in the terrain of text processing, metadata about a particular text plays an important role. Metadata is often generated using automatic text classification systems which classifies a text into one or more predefined classes or categories based on its contents. One of the dimensions by which a text can be can be classified, is the genre of a text. In this study the development of an automatic genre classification system in a resource scarce environment is postulated. This study aims to: i) investigate the techniques and approaches that are generally used for automatic genre classification systems, and identify the best approach for Afrikaans (a resource scarce language), ii) transfer this approach to other indigenous South African resource scarce languages, and iii) investigate the effectiveness of technology recycling for closely related languages in a resource scarce environment. To achieve the first goal, five machine learning approaches were identified from the literature that are generally used for text classification, together with five common approaches to feature extraction. Two different approaches to the identification of genre classes are presented. The machine learning-, feature extraction- and genre class identification approaches were used in a series of experiments to identify the best approach for genre classification for a resource scarce language. The best combination is identified as the multinomial naïve Bayes algorithm, using a bag of words approach as features to classify texts into three abstract classes. This results in an f-score (performance measure) of 0.929 and it was subsequently shown that this approach can be successfully applied to other indigenous South African languages. To investigate the viability of technology recycling for genre classification systems for closely related languages, Dutch test data was classified using an Afrikaans genre classification system and it is shown that this approach works well. A pre-processing step was implemented by using a machine translation system to increase the compatibility between Afrikaans and Dutch by translating the Dutch texts before classification. This results in an f-score of 0.577, indicating that technology recycling between closely related languages has merit. This approach can be used to promote and fast track the development of genre classification systems in a resource scarce environment.MA (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2013North-West University2014-03-07T07:38:31Z2014-03-07T07:38:31Z2012Thesishttp://hdl.handle.net/10394/10209other
collection	NDLTD
language	other
sources	NDLTD
topic	Genre classification Resource scarce languages Machine learning Technology recycling Human language technology Natural language processing Genreklassifikasie Hulpbronskaars tale Masjienleer Tegnologieherwinning Mensetaaltegnologie Natuurliketaalprosessering
spellingShingle	Genre classification Resource scarce languages Machine learning Technology recycling Human language technology Natural language processing Genreklassifikasie Hulpbronskaars tale Masjienleer Tegnologieherwinning Mensetaaltegnologie Natuurliketaalprosessering Snyman, Dirk Petrus Outomatiese genreklassifikasie vir hulpbronskaars tale / Dirk Snyman
description	When working in the terrain of text processing, metadata about a particular text plays an important role. Metadata is often generated using automatic text classification systems which classifies a text into one or more predefined classes or categories based on its contents. One of the dimensions by which a text can be can be classified, is the genre of a text. In this study the development of an automatic genre classification system in a resource scarce environment is postulated. This study aims to: i) investigate the techniques and approaches that are generally used for automatic genre classification systems, and identify the best approach for Afrikaans (a resource scarce language), ii) transfer this approach to other indigenous South African resource scarce languages, and iii) investigate the effectiveness of technology recycling for closely related languages in a resource scarce environment. To achieve the first goal, five machine learning approaches were identified from the literature that are generally used for text classification, together with five common approaches to feature extraction. Two different approaches to the identification of genre classes are presented. The machine learning-, feature extraction- and genre class identification approaches were used in a series of experiments to identify the best approach for genre classification for a resource scarce language. The best combination is identified as the multinomial naïve Bayes algorithm, using a bag of words approach as features to classify texts into three abstract classes. This results in an f-score (performance measure) of 0.929 and it was subsequently shown that this approach can be successfully applied to other indigenous South African languages. To investigate the viability of technology recycling for genre classification systems for closely related languages, Dutch test data was classified using an Afrikaans genre classification system and it is shown that this approach works well. A pre-processing step was implemented by using a machine translation system to increase the compatibility between Afrikaans and Dutch by translating the Dutch texts before classification. This results in an f-score of 0.577, indicating that technology recycling between closely related languages has merit. This approach can be used to promote and fast track the development of genre classification systems in a resource scarce environment. === MA (Linguistics and Literary Theory), North-West University, Potchefstroom Campus, 2013
author	Snyman, Dirk Petrus
author_facet	Snyman, Dirk Petrus
author_sort	Snyman, Dirk Petrus
title	Outomatiese genreklassifikasie vir hulpbronskaars tale / Dirk Snyman
title_short	Outomatiese genreklassifikasie vir hulpbronskaars tale / Dirk Snyman
title_full	Outomatiese genreklassifikasie vir hulpbronskaars tale / Dirk Snyman
title_fullStr	Outomatiese genreklassifikasie vir hulpbronskaars tale / Dirk Snyman
title_full_unstemmed	Outomatiese genreklassifikasie vir hulpbronskaars tale / Dirk Snyman
title_sort	outomatiese genreklassifikasie vir hulpbronskaars tale / dirk snyman
publisher	North-West University
publishDate	2014
url	http://hdl.handle.net/10394/10209
work_keys_str_mv	AT snymandirkpetrus outomatiesegenreklassifikasievirhulpbronskaarstaledirksnyman
_version_	1716715455107825664

Outomatiese genreklassifikasie vir hulpbronskaars tale / Dirk Snyman

Similar Items