Features of Distributional Method for Indonesian Word Clustering

We described the results of a study to determine the best features for algorithm EWSB (Extended Word Similarity Based). EWSB is a word clustering algorithm that can be used for all languages with a common feature. We provided four alternative features that can be used for word similarity computation...

Full description

Bibliographic Details
Main Author: Herry Sujaini
Format: Article
Language:Indonesian
Published: Universitas Tanjungpura 2019-08-01
Series:JEPIN (Jurnal Edukasi dan Penelitian Informatika)
Subjects:
Online Access:http://jurnal.untan.ac.id/index.php/jepin/article/view/33049
id doaj-3fbc07128b8b4299b66e88419daa00ff
record_format Article
spelling doaj-3fbc07128b8b4299b66e88419daa00ff2020-11-25T02:53:06ZindUniversitas TanjungpuraJEPIN (Jurnal Edukasi dan Penelitian Informatika)2460-07412548-93642019-08-015216417010.26418/jp.v5i2.3304925783Features of Distributional Method for Indonesian Word ClusteringHerry Sujaini0Universitas TanjungpuraWe described the results of a study to determine the best features for algorithm EWSB (Extended Word Similarity Based). EWSB is a word clustering algorithm that can be used for all languages with a common feature. We provided four alternative features that can be used for word similarity computation and experimented toward the Indonesian Language to determine the best feature format for the language. We found that the best feature used in the algorithm to Indonesian EWSB is t w w' format (3-gram) with 0 (zero) word relation. Moreover, we found that using 3-gram is better than 4-gram for all the proposed features. Average recall of 3-gram is 83.50%, while the average 4-gram recall is 57.25%.http://jurnal.untan.ac.id/index.php/jepin/article/view/33049n-gramword clusteringword similarityewsb
collection DOAJ
language Indonesian
format Article
sources DOAJ
author Herry Sujaini
spellingShingle Herry Sujaini
Features of Distributional Method for Indonesian Word Clustering
JEPIN (Jurnal Edukasi dan Penelitian Informatika)
n-gram
word clustering
word similarity
ewsb
author_facet Herry Sujaini
author_sort Herry Sujaini
title Features of Distributional Method for Indonesian Word Clustering
title_short Features of Distributional Method for Indonesian Word Clustering
title_full Features of Distributional Method for Indonesian Word Clustering
title_fullStr Features of Distributional Method for Indonesian Word Clustering
title_full_unstemmed Features of Distributional Method for Indonesian Word Clustering
title_sort features of distributional method for indonesian word clustering
publisher Universitas Tanjungpura
series JEPIN (Jurnal Edukasi dan Penelitian Informatika)
issn 2460-0741
2548-9364
publishDate 2019-08-01
description We described the results of a study to determine the best features for algorithm EWSB (Extended Word Similarity Based). EWSB is a word clustering algorithm that can be used for all languages with a common feature. We provided four alternative features that can be used for word similarity computation and experimented toward the Indonesian Language to determine the best feature format for the language. We found that the best feature used in the algorithm to Indonesian EWSB is t w w' format (3-gram) with 0 (zero) word relation. Moreover, we found that using 3-gram is better than 4-gram for all the proposed features. Average recall of 3-gram is 83.50%, while the average 4-gram recall is 57.25%.
topic n-gram
word clustering
word similarity
ewsb
url http://jurnal.untan.ac.id/index.php/jepin/article/view/33049
work_keys_str_mv AT herrysujaini featuresofdistributionalmethodforindonesianwordclustering
_version_ 1724726769280352256