Language Technology for the Lazy : Avoiding Work by Using Statistics and Machine Learning

Language technology is when a computer processes human languages in some way. Since human languages are irregular and hard to define in detail, this is often difficult. Despite this, good results can many times be achieved. Often a lot of manual work is used in creating these systems though. While t...

Full description

Bibliographic Details
Main Author: Sjöbergh, Jonas
Format: Doctoral Thesis
Language:English
Published: KTH, Numerisk Analys och Datalogi, NADA 2006
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4023
http://nbn-resolving.de/urn:isbn:91-7178-356-3
id ndltd-UPSALLA1-oai-DiVA.org-kth-4023
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-kth-40232013-01-08T13:06:38ZLanguage Technology for the Lazy : Avoiding Work by Using Statistics and Machine LearningengSjöbergh, JonasKTH, Numerisk Analys och Datalogi, NADAStockholm : KTH2006computer scienceComputer scienceDatalogiLanguage technology is when a computer processes human languages in some way. Since human languages are irregular and hard to define in detail, this is often difficult. Despite this, good results can many times be achieved. Often a lot of manual work is used in creating these systems though. While this usually gives good results, it is not always desirable. For smaller languages the resources for manual work might not be available, since it is usually time consuming and expensive. This thesis discusses methods for language processing where manual work is kept to a minimum. Instead, the computer does most of the work. This usually means basing the language processing methods on statistical information. These kinds of methods can normally be applied to other languages than they were originally developed for, without requiring much manual work for the language transition. The first half of the thesis mainly deals with methods that are useful as tools for other language processing methods. Ways to improve part of speech tagging, which is an important part in many language processing systems, without using manual work, are examined. Statistical methods for analysis of compound words, also useful in language processing, is also discussed. The first part is rounded off by a presentation of methods for evaluation of language processing systems. As languages are not very clearly defined, it is hard to prove that a system does anything useful. Thus it is very important to evaluate systems, to see if they are useful. Evaluation usually entails manual work, but in this thesis two methods with minimal manual work are presented. One uses a manually developed resource for evaluating other properties than originally intended with no extra work. The other method shows how to calculate an estimate of the system performance without using any manual work at all. In the second half of the thesis, language technology tools that are in themselves useful for a human user are presented. This includes statistical methods for detecting errors in texts. These methods complement traditional methods, based on manually written error detection rules, for instance by being able to detect errors that the rule writer could not imagine that writers could make. Two methods for automatic summarization are also presented. One is based on comparing the overall impression of the summary to that of the original text. This is based on statistical methods for measuring the contents of a text. The second method tries to mitigate the common problem of very sudden topic shifts in automatically generated summaries. After this, a modified method for automatically creating a lexicon between two languages by using lexicons to a common intermediary language is presented. This type of method is useful since there are many language pairs in the world lacking a lexicon, but many languages have lexicons available with translations to one of the larger languages of the world, for instance English. The modifications were intended to improve the coverage of the lexicon, possibly at the cost of lower translation quality. Finally a program for generating puns in Japanese is presented. The generated puns are not very funny, the main purpose of the program is to test the hypothesis that by using "bad word" things become a little bit more funny. QC 20100920Doctoral thesis, monographinfo:eu-repo/semantics/doctoralThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4023urn:isbn:91-7178-356-3Trita-CSC-A, 1653-5723 ; 2006:6application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Doctoral Thesis
sources NDLTD
topic computer science
Computer science
Datalogi
spellingShingle computer science
Computer science
Datalogi
Sjöbergh, Jonas
Language Technology for the Lazy : Avoiding Work by Using Statistics and Machine Learning
description Language technology is when a computer processes human languages in some way. Since human languages are irregular and hard to define in detail, this is often difficult. Despite this, good results can many times be achieved. Often a lot of manual work is used in creating these systems though. While this usually gives good results, it is not always desirable. For smaller languages the resources for manual work might not be available, since it is usually time consuming and expensive. This thesis discusses methods for language processing where manual work is kept to a minimum. Instead, the computer does most of the work. This usually means basing the language processing methods on statistical information. These kinds of methods can normally be applied to other languages than they were originally developed for, without requiring much manual work for the language transition. The first half of the thesis mainly deals with methods that are useful as tools for other language processing methods. Ways to improve part of speech tagging, which is an important part in many language processing systems, without using manual work, are examined. Statistical methods for analysis of compound words, also useful in language processing, is also discussed. The first part is rounded off by a presentation of methods for evaluation of language processing systems. As languages are not very clearly defined, it is hard to prove that a system does anything useful. Thus it is very important to evaluate systems, to see if they are useful. Evaluation usually entails manual work, but in this thesis two methods with minimal manual work are presented. One uses a manually developed resource for evaluating other properties than originally intended with no extra work. The other method shows how to calculate an estimate of the system performance without using any manual work at all. In the second half of the thesis, language technology tools that are in themselves useful for a human user are presented. This includes statistical methods for detecting errors in texts. These methods complement traditional methods, based on manually written error detection rules, for instance by being able to detect errors that the rule writer could not imagine that writers could make. Two methods for automatic summarization are also presented. One is based on comparing the overall impression of the summary to that of the original text. This is based on statistical methods for measuring the contents of a text. The second method tries to mitigate the common problem of very sudden topic shifts in automatically generated summaries. After this, a modified method for automatically creating a lexicon between two languages by using lexicons to a common intermediary language is presented. This type of method is useful since there are many language pairs in the world lacking a lexicon, but many languages have lexicons available with translations to one of the larger languages of the world, for instance English. The modifications were intended to improve the coverage of the lexicon, possibly at the cost of lower translation quality. Finally a program for generating puns in Japanese is presented. The generated puns are not very funny, the main purpose of the program is to test the hypothesis that by using "bad word" things become a little bit more funny. === QC 20100920
author Sjöbergh, Jonas
author_facet Sjöbergh, Jonas
author_sort Sjöbergh, Jonas
title Language Technology for the Lazy : Avoiding Work by Using Statistics and Machine Learning
title_short Language Technology for the Lazy : Avoiding Work by Using Statistics and Machine Learning
title_full Language Technology for the Lazy : Avoiding Work by Using Statistics and Machine Learning
title_fullStr Language Technology for the Lazy : Avoiding Work by Using Statistics and Machine Learning
title_full_unstemmed Language Technology for the Lazy : Avoiding Work by Using Statistics and Machine Learning
title_sort language technology for the lazy : avoiding work by using statistics and machine learning
publisher KTH, Numerisk Analys och Datalogi, NADA
publishDate 2006
url http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-4023
http://nbn-resolving.de/urn:isbn:91-7178-356-3
work_keys_str_mv AT sjoberghjonas languagetechnologyforthelazyavoidingworkbyusingstatisticsandmachinelearning
_version_ 1716508952200478720