Diacritic restoration of Turkish tweets with word2vec

Social media platforms such as Twitter have grown at a tremendous pace in recent years and have become an important source of data providing information countless field. This situation was of interest to researchers and many studies on machine learning and natural language processing was conducted o...

Full description

Bibliographic Details
Main Authors: Zeynep Ozer, Ilyas Ozer, Oguz Findik
Format: Article
Language:English
Published: Elsevier 2018-12-01
Series:Engineering Science and Technology, an International Journal
Online Access:http://www.sciencedirect.com/science/article/pii/S2215098618308668
id doaj-b3ad4483886d449786d4bc86a722f493
record_format Article
spelling doaj-b3ad4483886d449786d4bc86a722f4932020-11-24T21:06:12ZengElsevierEngineering Science and Technology, an International Journal2215-09862018-12-0121611201127Diacritic restoration of Turkish tweets with word2vecZeynep Ozer0Ilyas Ozer1Oguz Findik2Corresponding author.; Karabuk University, Karabuk, TurkeyKarabuk University, Karabuk, TurkeyKarabuk University, Karabuk, TurkeySocial media platforms such as Twitter have grown at a tremendous pace in recent years and have become an important source of data providing information countless field. This situation was of interest to researchers and many studies on machine learning and natural language processing was conducted on social media data. However, the language is used in social media contains a very high amount of noisy data than the formal writing language. In this article, we present a study on diacritic restoration which is one of the important difficulties of social media text normalization in order to reduce the noise problem. Diacritic is a set of marks used to change the sound values of letters and is used on many languages besides Turkish. We suggest a 3-step model for this study to overcome the top of the diacritic restoration problem. In the first step, a candidate word generator produces possible word forms, in the second step the language validator chooses the correct word forms and at the final Word2vec is used to create vector representations of the words and make the most appropriate word choice by using cosine similarities. The proposed method was tested on both the 2 ad-hoc created datasets and the real dataset. Studies on small ad-hoc created dataset and real dataset provided a relative error reduction of 37.8% with an average performance of 94.5%. In addition, tests on more than 6 M words on large ad-hoc created dataset yielded a serious performance with an error rate of 3.9%. Furthermore, the proposed method was tested on the binary classification problem consisting of highway traffic data in order to evaluate the effects on classification performance, and a 3.1% increase in classification performance was achieved. Keywords: Text mining, Diacritics restoration, Twitter, Tweet normalizationhttp://www.sciencedirect.com/science/article/pii/S2215098618308668
collection DOAJ
language English
format Article
sources DOAJ
author Zeynep Ozer
Ilyas Ozer
Oguz Findik
spellingShingle Zeynep Ozer
Ilyas Ozer
Oguz Findik
Diacritic restoration of Turkish tweets with word2vec
Engineering Science and Technology, an International Journal
author_facet Zeynep Ozer
Ilyas Ozer
Oguz Findik
author_sort Zeynep Ozer
title Diacritic restoration of Turkish tweets with word2vec
title_short Diacritic restoration of Turkish tweets with word2vec
title_full Diacritic restoration of Turkish tweets with word2vec
title_fullStr Diacritic restoration of Turkish tweets with word2vec
title_full_unstemmed Diacritic restoration of Turkish tweets with word2vec
title_sort diacritic restoration of turkish tweets with word2vec
publisher Elsevier
series Engineering Science and Technology, an International Journal
issn 2215-0986
publishDate 2018-12-01
description Social media platforms such as Twitter have grown at a tremendous pace in recent years and have become an important source of data providing information countless field. This situation was of interest to researchers and many studies on machine learning and natural language processing was conducted on social media data. However, the language is used in social media contains a very high amount of noisy data than the formal writing language. In this article, we present a study on diacritic restoration which is one of the important difficulties of social media text normalization in order to reduce the noise problem. Diacritic is a set of marks used to change the sound values of letters and is used on many languages besides Turkish. We suggest a 3-step model for this study to overcome the top of the diacritic restoration problem. In the first step, a candidate word generator produces possible word forms, in the second step the language validator chooses the correct word forms and at the final Word2vec is used to create vector representations of the words and make the most appropriate word choice by using cosine similarities. The proposed method was tested on both the 2 ad-hoc created datasets and the real dataset. Studies on small ad-hoc created dataset and real dataset provided a relative error reduction of 37.8% with an average performance of 94.5%. In addition, tests on more than 6 M words on large ad-hoc created dataset yielded a serious performance with an error rate of 3.9%. Furthermore, the proposed method was tested on the binary classification problem consisting of highway traffic data in order to evaluate the effects on classification performance, and a 3.1% increase in classification performance was achieved. Keywords: Text mining, Diacritics restoration, Twitter, Tweet normalization
url http://www.sciencedirect.com/science/article/pii/S2215098618308668
work_keys_str_mv AT zeynepozer diacriticrestorationofturkishtweetswithword2vec
AT ilyasozer diacriticrestorationofturkishtweetswithword2vec
AT oguzfindik diacriticrestorationofturkishtweetswithword2vec
_version_ 1716766418388647936