Mathematical modelling of some aspects of stressing a Lithuanian text

The present dissertation deals with one of the speech synthesizer components – automatic stressing of a text and two other goals relating to it – homographs (words that can be stressed in several ways) disambiguation and a search for clitics (unstressed words). The method, which by means of decisio...

Full description

Bibliographic Details
Main Author:	Anbinderis, Tomas
Other Authors:	Ivanauskas, Feliksas
Format:	Doctoral Thesis
Language:	English
Published:	Lithuanian Academic Libraries Network (LABT) 2010
Subjects:	Informatics Clitics Homographs Text stressing Text-to-speech synthesis Klitikai Homografai Teksto kirčiavimas Balso sintezė
Online Access:	http://vddb.laba.lt/fedora/get/LT-eLABa-0001:E.02~2010~D_20100702_105219-07956/DS.005.1.01.ETD

id	ndltd-LABT_ETD-oai-elaba.lt-LT-eLABa-0001-E.02~2010~D_20100702_105219-07956
record_format	oai_dc
spelling	ndltd-LABT_ETD-oai-elaba.lt-LT-eLABa-0001-E.02~2010~D_20100702_105219-079562014-01-16T03:39:21Z2010-07-02engInformaticsAnbinderis, TomasMathematical modelling of some aspects of stressing a Lithuanian textKai kurių lietuvių kalbos teksto kirčiavimo aspektų matematinis modeliavimasLithuanian Academic Libraries Network (LABT)The present dissertation deals with one of the speech synthesizer components – automatic stressing of a text and two other goals relating to it – homographs (words that can be stressed in several ways) disambiguation and a search for clitics (unstressed words). The method, which by means of decision trees finds sequences of letters that unambiguously define the word stressing, was applied to stress a Lithuanian text. Decision trees were created using large corpus of stressed words. Stressing rules based on sequences of letters at the beginning, ending and in the middle of a word have been formulated. The algorithm proposed reaches the accuracy of about 95.5%. The homograph disambiguation algorithm proposed by the present author is based on frequencies of lexemes and morphological features, that were obtained from corpus containing about one million words. Such methods were not used for Lithuanian language so far. The proposed algorithm enables to select the correct variant of stressing within the accuracy of 85.01%. Besides the author proposes methods of four types to search for the clitics in a Lithuanian text: methods based on recognising the combinational forms, based on statistical stressed/unstressed frequency of a word, grammar rules and stressing of the adjacent words. It is explained how to unite all the methods into a single algorithm. 4.1% of errors was obtained for the testing data among all the words, and the ratio of errors and unstressed words accounts for 18... [to full text]Disertacijoje nagrinėjama viena iš kalbos sintezatoriaus sudedamųjų dalių – teksto automatinis kirčiavimas, bei su kirčiavimu susiję kiti uždaviniai: vienodai rašomų, bet skirtingai tariamų, žodžių (homografų) vienareikšminimas bei prie gretimo žodžio prišlijusių bekirčių žodžių (klitikų) paieška. Teksto kirčiavimui pritaikytas metodas, kuris naudodamas sprendimų medžius randa raidžių sekas, vienareikšmiai nusakančias žodžio kirčiavimą. Sprendimo medžiams sudaryti buvo naudojamas didelies apimties sukirčiuotų žodžių tekstynas. Buvo sudarytos kirčiavimo taisyklės remiantis raidžių sekomis žodžių pradžioje, pabaigoje ir viduryje. Pasiūlytas kirčiavimo algoritmas pasiekia apie 95,5% tikslumą. Homografams vienareikšminti pritaikyti iki šiol lietuvių kalbai nenaudoti metodai, pagrįsti leksemų ir morfologinių pažymų vartosenos dažniais, gautais iš vieno milijono žodžių tekstyno. Darbe parodyta, kad morfologinių pažymų dažniai yra svarbesni už leksemų dažnius. Pasiūlyti metodai leido homografus vienareikšminti 85,01% tikslumu. Klitikų paieškai pasiūlyti metodai, kurie remiasi: 1) samplaikinių formų atpažinimu, 2) statistiniu žodžio kirčiavimo/nekirčiavimo dažniu, 3) kai kuriomis gramatikos taisyklėmis bei 4) gretimų žodžių kirčių pasiskirstymu (ritmika). Paaiškinta, kaip visus metodus sujungti į vieną algoritmą. Pritaikius šį algoritmą testavimo duomenims, klaidų ir visų žodžių santykis buvo 4,1%, o klaidų ir nekirčiuotų žodžių santykis – 18,8%.CliticsHomographsText stressingText-to-speech synthesisKlitikaiHomografaiTeksto kirčiavimasBalso sintezėDoctoral thesisIvanauskas, FeliksasBaronas, RomasKleiza, VytautasGirdenis, Aleksas StanislovasSapagovas, MifodijusBareiša, EduardasVaicekauskas, RimantasKasparaitis, PijusVilnius UniversityVilnius Universityhttp://vddb.laba.lt/obj/LT-eLABa-0001:E.02~2010~D_20100702_105219-07956LT-eLABa-0001:E.02~2010~D_20100702_105219-07956VU-nmzaudborep-20100524-200917http://vddb.laba.lt/fedora/get/LT-eLABa-0001:E.02~2010~D_20100702_105219-07956/DS.005.1.01.ETDUnrestrictedapplication/pdf
collection	NDLTD
language	English
format	Doctoral Thesis
sources	NDLTD
topic	Informatics Clitics Homographs Text stressing Text-to-speech synthesis Klitikai Homografai Teksto kirčiavimas Balso sintezė
spellingShingle	Informatics Clitics Homographs Text stressing Text-to-speech synthesis Klitikai Homografai Teksto kirčiavimas Balso sintezė Anbinderis, Tomas Mathematical modelling of some aspects of stressing a Lithuanian text
description	The present dissertation deals with one of the speech synthesizer components – automatic stressing of a text and two other goals relating to it – homographs (words that can be stressed in several ways) disambiguation and a search for clitics (unstressed words). The method, which by means of decision trees finds sequences of letters that unambiguously define the word stressing, was applied to stress a Lithuanian text. Decision trees were created using large corpus of stressed words. Stressing rules based on sequences of letters at the beginning, ending and in the middle of a word have been formulated. The algorithm proposed reaches the accuracy of about 95.5%. The homograph disambiguation algorithm proposed by the present author is based on frequencies of lexemes and morphological features, that were obtained from corpus containing about one million words. Such methods were not used for Lithuanian language so far. The proposed algorithm enables to select the correct variant of stressing within the accuracy of 85.01%. Besides the author proposes methods of four types to search for the clitics in a Lithuanian text: methods based on recognising the combinational forms, based on statistical stressed/unstressed frequency of a word, grammar rules and stressing of the adjacent words. It is explained how to unite all the methods into a single algorithm. 4.1% of errors was obtained for the testing data among all the words, and the ratio of errors and unstressed words accounts for 18... [to full text] === Disertacijoje nagrinėjama viena iš kalbos sintezatoriaus sudedamųjų dalių – teksto automatinis kirčiavimas, bei su kirčiavimu susiję kiti uždaviniai: vienodai rašomų, bet skirtingai tariamų, žodžių (homografų) vienareikšminimas bei prie gretimo žodžio prišlijusių bekirčių žodžių (klitikų) paieška. Teksto kirčiavimui pritaikytas metodas, kuris naudodamas sprendimų medžius randa raidžių sekas, vienareikšmiai nusakančias žodžio kirčiavimą. Sprendimo medžiams sudaryti buvo naudojamas didelies apimties sukirčiuotų žodžių tekstynas. Buvo sudarytos kirčiavimo taisyklės remiantis raidžių sekomis žodžių pradžioje, pabaigoje ir viduryje. Pasiūlytas kirčiavimo algoritmas pasiekia apie 95,5% tikslumą. Homografams vienareikšminti pritaikyti iki šiol lietuvių kalbai nenaudoti metodai, pagrįsti leksemų ir morfologinių pažymų vartosenos dažniais, gautais iš vieno milijono žodžių tekstyno. Darbe parodyta, kad morfologinių pažymų dažniai yra svarbesni už leksemų dažnius. Pasiūlyti metodai leido homografus vienareikšminti 85,01% tikslumu. Klitikų paieškai pasiūlyti metodai, kurie remiasi: 1) samplaikinių formų atpažinimu, 2) statistiniu žodžio kirčiavimo/nekirčiavimo dažniu, 3) kai kuriomis gramatikos taisyklėmis bei 4) gretimų žodžių kirčių pasiskirstymu (ritmika). Paaiškinta, kaip visus metodus sujungti į vieną algoritmą. Pritaikius šį algoritmą testavimo duomenims, klaidų ir visų žodžių santykis buvo 4,1%, o klaidų ir nekirčiuotų žodžių santykis – 18,8%.
author2	Ivanauskas, Feliksas
author_facet	Ivanauskas, Feliksas Anbinderis, Tomas
author	Anbinderis, Tomas
author_sort	Anbinderis, Tomas
title	Mathematical modelling of some aspects of stressing a Lithuanian text
title_short	Mathematical modelling of some aspects of stressing a Lithuanian text
title_full	Mathematical modelling of some aspects of stressing a Lithuanian text
title_fullStr	Mathematical modelling of some aspects of stressing a Lithuanian text
title_full_unstemmed	Mathematical modelling of some aspects of stressing a Lithuanian text
title_sort	mathematical modelling of some aspects of stressing a lithuanian text
publisher	Lithuanian Academic Libraries Network (LABT)
publishDate	2010
url	http://vddb.laba.lt/fedora/get/LT-eLABa-0001:E.02~2010~D_20100702_105219-07956/DS.005.1.01.ETD
work_keys_str_mv	AT anbinderistomas mathematicalmodellingofsomeaspectsofstressingalithuaniantext AT anbinderistomas kaikuriulietuviukalbostekstokirciavimoaspektumatematinismodeliavimas
_version_	1716624403521863680

Mathematical modelling of some aspects of stressing a Lithuanian text

Similar Items