A hybrid approach to Pali Sandhi segmentation using BiLSTM and rule-based analysis
Pali Sandhi is a phonetic transformation from two words into a new word. The phonemes of the neighbouring words are changed and merged. Pali Sandhi word segmentation is more challenging than Thai word segmentation because Pali is a highly inflected language. This study proposes a novel approach that...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Khon Kaen University
2021-07-01
|
Series: | Engineering and Applied Science Research |
Subjects: | |
Online Access: | https://ph01.tci-thaijo.org/index.php/easr/article/download/243815/166489/ |
id |
doaj-4314028e98404979931b179463334755 |
---|---|
record_format |
Article |
spelling |
doaj-4314028e98404979931b1794633347552021-07-12T04:17:45ZengKhon Kaen UniversityEngineering and Applied Science Research2539-61612539-62182021-07-01485614626A hybrid approach to Pali Sandhi segmentation using BiLSTM and rule-based analysisKlangjai TammanamNuttachot PromritSajjaporn WaijanyaPali Sandhi is a phonetic transformation from two words into a new word. The phonemes of the neighbouring words are changed and merged. Pali Sandhi word segmentation is more challenging than Thai word segmentation because Pali is a highly inflected language. This study proposes a novel approach that predicts splitting locations by classifying the sample Sandhi words into five classes with a bidirectional long short-term memory model. We applied the classified rules to rectify the words from the splitting locations. We identified 6,345 Pali Sandhi words from Dhammapada Atthakatha. We evaluated the performance of our proposed model on the basis of the accuracy of the splitting locations and compared the results with the dataset. Results showed that 92.20% of the splitting locations were correct, 1.10% of the Pali Sandhi words were predicted as non-splitting location words and 5.83% were not matched with the answers (incomplete segmentation).https://ph01.tci-thaijo.org/index.php/easr/article/download/243815/166489/bilstmpali sandhithai palirule basepali sandhi splitting |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Klangjai Tammanam Nuttachot Promrit Sajjaporn Waijanya |
spellingShingle |
Klangjai Tammanam Nuttachot Promrit Sajjaporn Waijanya A hybrid approach to Pali Sandhi segmentation using BiLSTM and rule-based analysis Engineering and Applied Science Research bilstm pali sandhi thai pali rule base pali sandhi splitting |
author_facet |
Klangjai Tammanam Nuttachot Promrit Sajjaporn Waijanya |
author_sort |
Klangjai Tammanam |
title |
A hybrid approach to Pali Sandhi segmentation using BiLSTM and rule-based analysis |
title_short |
A hybrid approach to Pali Sandhi segmentation using BiLSTM and rule-based analysis |
title_full |
A hybrid approach to Pali Sandhi segmentation using BiLSTM and rule-based analysis |
title_fullStr |
A hybrid approach to Pali Sandhi segmentation using BiLSTM and rule-based analysis |
title_full_unstemmed |
A hybrid approach to Pali Sandhi segmentation using BiLSTM and rule-based analysis |
title_sort |
hybrid approach to pali sandhi segmentation using bilstm and rule-based analysis |
publisher |
Khon Kaen University |
series |
Engineering and Applied Science Research |
issn |
2539-6161 2539-6218 |
publishDate |
2021-07-01 |
description |
Pali Sandhi is a phonetic transformation from two words into a new word. The phonemes of the neighbouring words are changed and merged. Pali Sandhi word segmentation is more challenging than Thai word segmentation because Pali is a highly inflected language. This study proposes a novel approach that predicts splitting locations by classifying the sample Sandhi words into five classes with a bidirectional long short-term memory model. We applied the classified rules to rectify the words from the splitting locations. We identified 6,345 Pali Sandhi words from Dhammapada Atthakatha. We evaluated the performance of our proposed model on the basis of the accuracy of the splitting locations and compared the results with the dataset. Results showed that 92.20% of the splitting locations were correct, 1.10% of the Pali Sandhi words were predicted as non-splitting location words and 5.83% were not matched with the answers (incomplete segmentation). |
topic |
bilstm pali sandhi thai pali rule base pali sandhi splitting |
url |
https://ph01.tci-thaijo.org/index.php/easr/article/download/243815/166489/ |
work_keys_str_mv |
AT klangjaitammanam ahybridapproachtopalisandhisegmentationusingbilstmandrulebasedanalysis AT nuttachotpromrit ahybridapproachtopalisandhisegmentationusingbilstmandrulebasedanalysis AT sajjapornwaijanya ahybridapproachtopalisandhisegmentationusingbilstmandrulebasedanalysis AT klangjaitammanam hybridapproachtopalisandhisegmentationusingbilstmandrulebasedanalysis AT nuttachotpromrit hybridapproachtopalisandhisegmentationusingbilstmandrulebasedanalysis AT sajjapornwaijanya hybridapproachtopalisandhisegmentationusingbilstmandrulebasedanalysis |
_version_ |
1721307886050607104 |