Summary: | An important core technology in the development of human language technology
applications is an automatic morphological analyser. Such a morphological analyser
consists of various modules, one of which is a tokeniser. At present no tokeniser
exists for Afrikaans and it has therefore been impossible to develop a morphological
analyser for Afrikaans. Thus, in this research project such a tokeniser is being developed,
and the project therefore has two objectives: i)to postulate a tag set for integrated
tokenisation, and ii) to develop an algorithm for integrated tokenisation.
In order to achieve the first object, a tag set for the tagging of sentences, named-entities,
words, abbreviations and punctuation is proposed specifically for the annotation
of Afrikaans texts. It consists of 51 tags, which can be expanded in future in order to
establish a larger, more specific tag set. The postulated tag set can also be simplified
according to the level of specificity required by the user.
It is subsequently shown that an effective tokeniser cannot be developed using only
linguistic, or only statistical methods. This is due to the complexity of the task: rule-based
modules should be used for certain processes (for example sentence recognition),
while other processes (for example named-entity recognition) can only be executed
successfully by means of a machine-learning module. It is argued that a hybrid
system (a system where rule-based and statistical components are integrated) would
achieve the best results on Afrikaans tokenisation.
Various rule-based and statistical techniques, including a TiMBL-based classifier, are
then employed to develop such a hybrid tokeniser for Afrikaans. The final tokeniser
achieves an ∫-score of 97.25% when the complete set of tags is used. For sentence
recognition an ∫-score of 100% is achieved. The tokeniser also recognises 81.39% of
named entities. When a simplified tag set (consisting of only 12 tags) is used to annotate
named entities, the ∫-score rises to 94.74%.
The conclusion of the study is that a hybrid approach is indeed suitable for Afrikaans
sentencisation, named-entity recognition and tokenisation. The tokeniser will improve
if it is trained with more data, while the expansion of gazetteers as well as the
tag set will also lead to a more accurate system === Thesis (M.A. (Applied Language and Literary Studies))--North-West University, Potchefstroom Campus, 2006.
|