An Unsupervised Approach to Detecting and Correcting Errors in Text

In practice, most approaches for text error detection and correction are based on a conventional domain-dependent background dictionary that represents a fixed and static collection of correct words of a given language and, as a result, satisfactory correction can only be achieved if the dictionary...

Full description

Bibliographic Details
Main Author: Islam, Md Aminul
Language:en
Published: 2011
Subjects:
Online Access:http://hdl.handle.net/10393/20049
id ndltd-LACETR-oai-collectionscanada.gc.ca-OOU-OLD.-20049
record_format oai_dc
spelling ndltd-LACETR-oai-collectionscanada.gc.ca-OOU-OLD.-200492013-04-05T03:20:43ZAn Unsupervised Approach to Detecting and Correcting Errors in TextIslam, Md AminulText Error DetectionSpelling ErrorGoogle n-gramUnsupervisedText Error CorrectionIn practice, most approaches for text error detection and correction are based on a conventional domain-dependent background dictionary that represents a fixed and static collection of correct words of a given language and, as a result, satisfactory correction can only be achieved if the dictionary covers most tokens of the underlying correct text. Again, most approaches for text correction are for only one or at best a very few types of errors. The purpose of this thesis is to propose an unsupervised approach to detecting and correcting text errors, that can compete with supervised approaches and answer the following questions: Can an unsupervised approach efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature? What is the magnitude of error coverage, in terms of the number of errors that can be corrected? We conclude that (1) it is possible that an unsupervised approach can efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature. Error types include: real-word spelling errors, typographical errors, lexical choice errors, unwanted words, missing words, prepositional errors, article errors, punctuation errors, and many of the grammatical errors (e.g., errors in agreement and verb formation). (2) The magnitude of error coverage, in terms of the number of errors that can be corrected, is almost double of the number of correct words of the text. Although this is not the upper limit, this is what is practically feasible. We use engineering approaches to answer the first question and theoretical approaches to answer and support the second question. We show that finding inherent properties of a correct text using a corpus in the form of an n-gram data set is more appropriate and practical than using other approaches to detecting and correcting errors. Instead of using rule-based approaches and dictionaries, we argue that a corpus can effectively be used to infer the properties of these types of errors, and to detect and correct these errors. We test the robustness of the proposed approach separately for some individual error types, and then for all types of errors. The approach is language-independent, it can be applied to other languages, as long as n-grams are available. The results of this thesis thus suggest that unsupervised approaches, which are often dismissed in favor of supervised ones in the context of many Natural Language Processing (NLP) related tasks, may present an interesting array of NLP-related problem solving strengths.2011-06-01T19:11:01Z2011-06-01T19:11:01Z20112011-06-01Thèse / Thesishttp://hdl.handle.net/10393/20049en
collection NDLTD
language en
sources NDLTD
topic Text Error Detection
Spelling Error
Google n-gram
Unsupervised
Text Error Correction
spellingShingle Text Error Detection
Spelling Error
Google n-gram
Unsupervised
Text Error Correction
Islam, Md Aminul
An Unsupervised Approach to Detecting and Correcting Errors in Text
description In practice, most approaches for text error detection and correction are based on a conventional domain-dependent background dictionary that represents a fixed and static collection of correct words of a given language and, as a result, satisfactory correction can only be achieved if the dictionary covers most tokens of the underlying correct text. Again, most approaches for text correction are for only one or at best a very few types of errors. The purpose of this thesis is to propose an unsupervised approach to detecting and correcting text errors, that can compete with supervised approaches and answer the following questions: Can an unsupervised approach efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature? What is the magnitude of error coverage, in terms of the number of errors that can be corrected? We conclude that (1) it is possible that an unsupervised approach can efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature. Error types include: real-word spelling errors, typographical errors, lexical choice errors, unwanted words, missing words, prepositional errors, article errors, punctuation errors, and many of the grammatical errors (e.g., errors in agreement and verb formation). (2) The magnitude of error coverage, in terms of the number of errors that can be corrected, is almost double of the number of correct words of the text. Although this is not the upper limit, this is what is practically feasible. We use engineering approaches to answer the first question and theoretical approaches to answer and support the second question. We show that finding inherent properties of a correct text using a corpus in the form of an n-gram data set is more appropriate and practical than using other approaches to detecting and correcting errors. Instead of using rule-based approaches and dictionaries, we argue that a corpus can effectively be used to infer the properties of these types of errors, and to detect and correct these errors. We test the robustness of the proposed approach separately for some individual error types, and then for all types of errors. The approach is language-independent, it can be applied to other languages, as long as n-grams are available. The results of this thesis thus suggest that unsupervised approaches, which are often dismissed in favor of supervised ones in the context of many Natural Language Processing (NLP) related tasks, may present an interesting array of NLP-related problem solving strengths.
author Islam, Md Aminul
author_facet Islam, Md Aminul
author_sort Islam, Md Aminul
title An Unsupervised Approach to Detecting and Correcting Errors in Text
title_short An Unsupervised Approach to Detecting and Correcting Errors in Text
title_full An Unsupervised Approach to Detecting and Correcting Errors in Text
title_fullStr An Unsupervised Approach to Detecting and Correcting Errors in Text
title_full_unstemmed An Unsupervised Approach to Detecting and Correcting Errors in Text
title_sort unsupervised approach to detecting and correcting errors in text
publishDate 2011
url http://hdl.handle.net/10393/20049
work_keys_str_mv AT islammdaminul anunsupervisedapproachtodetectingandcorrectingerrorsintext
AT islammdaminul unsupervisedapproachtodetectingandcorrectingerrorsintext
_version_ 1716579114597482496