Diacritic restoration of Turkish tweets with word2vec
Social media platforms such as Twitter have grown at a tremendous pace in recent years and have become an important source of data providing information countless field. This situation was of interest to researchers and many studies on machine learning and natural language processing was conducted o...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2018-12-01
|
Series: | Engineering Science and Technology, an International Journal |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2215098618308668 |
id |
doaj-b3ad4483886d449786d4bc86a722f493 |
---|---|
record_format |
Article |
spelling |
doaj-b3ad4483886d449786d4bc86a722f4932020-11-24T21:06:12ZengElsevierEngineering Science and Technology, an International Journal2215-09862018-12-0121611201127Diacritic restoration of Turkish tweets with word2vecZeynep Ozer0Ilyas Ozer1Oguz Findik2Corresponding author.; Karabuk University, Karabuk, TurkeyKarabuk University, Karabuk, TurkeyKarabuk University, Karabuk, TurkeySocial media platforms such as Twitter have grown at a tremendous pace in recent years and have become an important source of data providing information countless field. This situation was of interest to researchers and many studies on machine learning and natural language processing was conducted on social media data. However, the language is used in social media contains a very high amount of noisy data than the formal writing language. In this article, we present a study on diacritic restoration which is one of the important difficulties of social media text normalization in order to reduce the noise problem. Diacritic is a set of marks used to change the sound values of letters and is used on many languages besides Turkish. We suggest a 3-step model for this study to overcome the top of the diacritic restoration problem. In the first step, a candidate word generator produces possible word forms, in the second step the language validator chooses the correct word forms and at the final Word2vec is used to create vector representations of the words and make the most appropriate word choice by using cosine similarities. The proposed method was tested on both the 2 ad-hoc created datasets and the real dataset. Studies on small ad-hoc created dataset and real dataset provided a relative error reduction of 37.8% with an average performance of 94.5%. In addition, tests on more than 6 M words on large ad-hoc created dataset yielded a serious performance with an error rate of 3.9%. Furthermore, the proposed method was tested on the binary classification problem consisting of highway traffic data in order to evaluate the effects on classification performance, and a 3.1% increase in classification performance was achieved. Keywords: Text mining, Diacritics restoration, Twitter, Tweet normalizationhttp://www.sciencedirect.com/science/article/pii/S2215098618308668 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Zeynep Ozer Ilyas Ozer Oguz Findik |
spellingShingle |
Zeynep Ozer Ilyas Ozer Oguz Findik Diacritic restoration of Turkish tweets with word2vec Engineering Science and Technology, an International Journal |
author_facet |
Zeynep Ozer Ilyas Ozer Oguz Findik |
author_sort |
Zeynep Ozer |
title |
Diacritic restoration of Turkish tweets with word2vec |
title_short |
Diacritic restoration of Turkish tweets with word2vec |
title_full |
Diacritic restoration of Turkish tweets with word2vec |
title_fullStr |
Diacritic restoration of Turkish tweets with word2vec |
title_full_unstemmed |
Diacritic restoration of Turkish tweets with word2vec |
title_sort |
diacritic restoration of turkish tweets with word2vec |
publisher |
Elsevier |
series |
Engineering Science and Technology, an International Journal |
issn |
2215-0986 |
publishDate |
2018-12-01 |
description |
Social media platforms such as Twitter have grown at a tremendous pace in recent years and have become an important source of data providing information countless field. This situation was of interest to researchers and many studies on machine learning and natural language processing was conducted on social media data. However, the language is used in social media contains a very high amount of noisy data than the formal writing language. In this article, we present a study on diacritic restoration which is one of the important difficulties of social media text normalization in order to reduce the noise problem. Diacritic is a set of marks used to change the sound values of letters and is used on many languages besides Turkish. We suggest a 3-step model for this study to overcome the top of the diacritic restoration problem. In the first step, a candidate word generator produces possible word forms, in the second step the language validator chooses the correct word forms and at the final Word2vec is used to create vector representations of the words and make the most appropriate word choice by using cosine similarities. The proposed method was tested on both the 2 ad-hoc created datasets and the real dataset. Studies on small ad-hoc created dataset and real dataset provided a relative error reduction of 37.8% with an average performance of 94.5%. In addition, tests on more than 6 M words on large ad-hoc created dataset yielded a serious performance with an error rate of 3.9%. Furthermore, the proposed method was tested on the binary classification problem consisting of highway traffic data in order to evaluate the effects on classification performance, and a 3.1% increase in classification performance was achieved. Keywords: Text mining, Diacritics restoration, Twitter, Tweet normalization |
url |
http://www.sciencedirect.com/science/article/pii/S2215098618308668 |
work_keys_str_mv |
AT zeynepozer diacriticrestorationofturkishtweetswithword2vec AT ilyasozer diacriticrestorationofturkishtweetswithword2vec AT oguzfindik diacriticrestorationofturkishtweetswithword2vec |
_version_ |
1716766418388647936 |