Exploring Edit Distance for Normalising Out-of-Vocabulary Malay Words on Social Media

Users interact using short-formed words and abbreviations and this results in a message full of noisy words that are not recognized by the system's knowledge. The aim of this research is to overcome the limitations that still bar the progression of normalizing Malay noisy words from social medi...

Full description

Bibliographic Details
Main Authors: Raja Roza Athirah, Lay-Ki Soon, Su-Cheng Haw
Format: Article
Language:English
Published: EDP Sciences 2019-01-01
Series:MATEC Web of Conferences
Online Access:https://doi.org/10.1051/matecconf/201925503001
id doaj-6aafc625aaf146a3a2ca88d9f57f2129
record_format Article
spelling doaj-6aafc625aaf146a3a2ca88d9f57f21292021-02-02T02:11:23ZengEDP SciencesMATEC Web of Conferences2261-236X2019-01-012550300110.1051/matecconf/201925503001matecconf_eaaic2018_03001Exploring Edit Distance for Normalising Out-of-Vocabulary Malay Words on Social MediaRaja Roza Athirah0Lay-Ki SoonSu-Cheng Haw1Faculty of Computing and Informatics, Multimedia UniversityFaculty of Computing and Informatics, Multimedia UniversityUsers interact using short-formed words and abbreviations and this results in a message full of noisy words that are not recognized by the system's knowledge. The aim of this research is to overcome the limitations that still bar the progression of normalizing Malay noisy words from social media platforms. The testing data gathered is 25,000; 15,000 Tweets from Twitter and 10,000 comments from Facebook respectively. Pre-processing steps were carried out to clean the entire dataset which consists of unique 179,786 words. 36,587 out-of-vocabulary (OOV) Malay terms were then extracted and checked against an in- vocabulary (IV) Malay corpus using the Levenshtein edit distance formula and character manipulation rules. The resultant output is 3,964 unique IV Malay words. Based on the results, the usage of edit distance and rules can be further improved to elevate the normalisation of the ever changing colloquial terms of the Malay language.https://doi.org/10.1051/matecconf/201925503001
collection DOAJ
language English
format Article
sources DOAJ
author Raja Roza Athirah
Lay-Ki Soon
Su-Cheng Haw
spellingShingle Raja Roza Athirah
Lay-Ki Soon
Su-Cheng Haw
Exploring Edit Distance for Normalising Out-of-Vocabulary Malay Words on Social Media
MATEC Web of Conferences
author_facet Raja Roza Athirah
Lay-Ki Soon
Su-Cheng Haw
author_sort Raja Roza Athirah
title Exploring Edit Distance for Normalising Out-of-Vocabulary Malay Words on Social Media
title_short Exploring Edit Distance for Normalising Out-of-Vocabulary Malay Words on Social Media
title_full Exploring Edit Distance for Normalising Out-of-Vocabulary Malay Words on Social Media
title_fullStr Exploring Edit Distance for Normalising Out-of-Vocabulary Malay Words on Social Media
title_full_unstemmed Exploring Edit Distance for Normalising Out-of-Vocabulary Malay Words on Social Media
title_sort exploring edit distance for normalising out-of-vocabulary malay words on social media
publisher EDP Sciences
series MATEC Web of Conferences
issn 2261-236X
publishDate 2019-01-01
description Users interact using short-formed words and abbreviations and this results in a message full of noisy words that are not recognized by the system's knowledge. The aim of this research is to overcome the limitations that still bar the progression of normalizing Malay noisy words from social media platforms. The testing data gathered is 25,000; 15,000 Tweets from Twitter and 10,000 comments from Facebook respectively. Pre-processing steps were carried out to clean the entire dataset which consists of unique 179,786 words. 36,587 out-of-vocabulary (OOV) Malay terms were then extracted and checked against an in- vocabulary (IV) Malay corpus using the Levenshtein edit distance formula and character manipulation rules. The resultant output is 3,964 unique IV Malay words. Based on the results, the usage of edit distance and rules can be further improved to elevate the normalisation of the ever changing colloquial terms of the Malay language.
url https://doi.org/10.1051/matecconf/201925503001
work_keys_str_mv AT rajarozaathirah exploringeditdistancefornormalisingoutofvocabularymalaywordsonsocialmedia
AT laykisoon exploringeditdistancefornormalisingoutofvocabularymalaywordsonsocialmedia
AT suchenghaw exploringeditdistancefornormalisingoutofvocabularymalaywordsonsocialmedia
_version_ 1724310290198167552