Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili

Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from...

Full description

Bibliographic Details
Main Authors: Casper S. Shikali, Zhou Sijie, Liu Qihe, Refuoe Mokhosi
Format: Article
Language:English
Published: MDPI AG 2019-09-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/9/18/3648
id doaj-c2e7c75c162e45d9ada775aef0be9c65
record_format Article
spelling doaj-c2e7c75c162e45d9ada775aef0be9c652020-11-25T02:07:16ZengMDPI AGApplied Sciences2076-34172019-09-01918364810.3390/app9183648app9183648Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of SwahiliCasper S. Shikali0Zhou Sijie1Liu Qihe2Refuoe Mokhosi3School of Information and Software Engineering, University of Electronic Science and Technology of China, Xiyuan Ave, West Hi-Tech Zone, Chengdu 611731, ChinaSchool of Information and Software Engineering, University of Electronic Science and Technology of China, Xiyuan Ave, West Hi-Tech Zone, Chengdu 611731, ChinaSchool of Information and Software Engineering, University of Electronic Science and Technology of China, Xiyuan Ave, West Hi-Tech Zone, Chengdu 611731, ChinaSchool of Information and Software Engineering, University of Electronic Science and Technology of China, Xiyuan Ave, West Hi-Tech Zone, Chengdu 611731, ChinaDeep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.https://www.mdpi.com/2076-3417/9/18/3648syllabic alphabetword representation vectorsdeep learningsyllable-aware language modelperplexityword analogy
collection DOAJ
language English
format Article
sources DOAJ
author Casper S. Shikali
Zhou Sijie
Liu Qihe
Refuoe Mokhosi
spellingShingle Casper S. Shikali
Zhou Sijie
Liu Qihe
Refuoe Mokhosi
Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili
Applied Sciences
syllabic alphabet
word representation vectors
deep learning
syllable-aware language model
perplexity
word analogy
author_facet Casper S. Shikali
Zhou Sijie
Liu Qihe
Refuoe Mokhosi
author_sort Casper S. Shikali
title Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili
title_short Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili
title_full Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili
title_fullStr Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili
title_full_unstemmed Better Word Representation Vectors Using Syllabic Alphabet: A Case Study of Swahili
title_sort better word representation vectors using syllabic alphabet: a case study of swahili
publisher MDPI AG
series Applied Sciences
issn 2076-3417
publishDate 2019-09-01
description Deep learning has extensively been used in natural language processing with sub-word representation vectors playing a critical role. However, this cannot be said of Swahili, which is a low resource and widely spoken language in East and Central Africa. This study proposed novel word embeddings from syllable embeddings (WEFSE) for Swahili to address the concern of word representation for agglutinative and syllabic-based languages. Inspired by the learning methodology of Swahili in beginner classes, we encoded respective syllables instead of characters, character n-grams or morphemes of words and generated quality word embeddings using a convolutional neural network. The quality of WEFSE was demonstrated by the state-of-art results in the syllable-aware language model on both the small dataset (31.229 perplexity value) and the medium dataset (45.859 perplexity value), outperforming character-aware language models. We further evaluated the word embeddings using word analogy task. To the best of our knowledge, syllabic alphabets have not been used to compose the word representation vectors. Therefore, the main contributions of the study are a syllabic alphabet, WEFSE, a syllabic-aware language model and a word analogy dataset for Swahili.
topic syllabic alphabet
word representation vectors
deep learning
syllable-aware language model
perplexity
word analogy
url https://www.mdpi.com/2076-3417/9/18/3648
work_keys_str_mv AT caspersshikali betterwordrepresentationvectorsusingsyllabicalphabetacasestudyofswahili
AT zhousijie betterwordrepresentationvectorsusingsyllabicalphabetacasestudyofswahili
AT liuqihe betterwordrepresentationvectorsusingsyllabicalphabetacasestudyofswahili
AT refuoemokhosi betterwordrepresentationvectorsusingsyllabicalphabetacasestudyofswahili
_version_ 1724930441265283072