An Unsupervised Approach to Detecting and Correcting Errors in Text

In practice, most approaches for text error detection and correction are based on a conventional domain-dependent background dictionary that represents a fixed and static collection of correct words of a given language and, as a result, satisfactory correction can only be achieved if the dictionary...

Full description

Bibliographic Details
Main Author:	Islam, Md Aminul
Language:	en
Published:	2011
Subjects:	Text Error Detection Spelling Error Google n-gram Unsupervised Text Error Correction
Online Access:	http://hdl.handle.net/10393/20049

id	ndltd-LACETR-oai-collectionscanada.gc.ca-OOU-OLD.-20049
record_format	oai_dc
spelling	ndltd-LACETR-oai-collectionscanada.gc.ca-OOU-OLD.-200492013-04-05T03:20:43ZAn Unsupervised Approach to Detecting and Correcting Errors in TextIslam, Md AminulText Error DetectionSpelling ErrorGoogle n-gramUnsupervisedText Error CorrectionIn practice, most approaches for text error detection and correction are based on a conventional domain-dependent background dictionary that represents a fixed and static collection of correct words of a given language and, as a result, satisfactory correction can only be achieved if the dictionary covers most tokens of the underlying correct text. Again, most approaches for text correction are for only one or at best a very few types of errors. The purpose of this thesis is to propose an unsupervised approach to detecting and correcting text errors, that can compete with supervised approaches and answer the following questions: Can an unsupervised approach efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature? What is the magnitude of error coverage, in terms of the number of errors that can be corrected? We conclude that (1) it is possible that an unsupervised approach can efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature. Error types include: real-word spelling errors, typographical errors, lexical choice errors, unwanted words, missing words, prepositional errors, article errors, punctuation errors, and many of the grammatical errors (e.g., errors in agreement and verb formation). (2) The magnitude of error coverage, in terms of the number of errors that can be corrected, is almost double of the number of correct words of the text. Although this is not the upper limit, this is what is practically feasible. We use engineering approaches to answer the first question and theoretical approaches to answer and support the second question. We show that finding inherent properties of a correct text using a corpus in the form of an n-gram data set is more appropriate and practical than using other approaches to detecting and correcting errors. Instead of using rule-based approaches and dictionaries, we argue that a corpus can effectively be used to infer the properties of these types of errors, and to detect and correct these errors. We test the robustness of the proposed approach separately for some individual error types, and then for all types of errors. The approach is language-independent, it can be applied to other languages, as long as n-grams are available. The results of this thesis thus suggest that unsupervised approaches, which are often dismissed in favor of supervised ones in the context of many Natural Language Processing (NLP) related tasks, may present an interesting array of NLP-related problem solving strengths.2011-06-01T19:11:01Z2011-06-01T19:11:01Z20112011-06-01Thèse / Thesishttp://hdl.handle.net/10393/20049en
collection	NDLTD
language	en
sources	NDLTD
topic	Text Error Detection Spelling Error Google n-gram Unsupervised Text Error Correction
spellingShingle	Text Error Detection Spelling Error Google n-gram Unsupervised Text Error Correction Islam, Md Aminul An Unsupervised Approach to Detecting and Correcting Errors in Text
description	In practice, most approaches for text error detection and correction are based on a conventional domain-dependent background dictionary that represents a fixed and static collection of correct words of a given language and, as a result, satisfactory correction can only be achieved if the dictionary covers most tokens of the underlying correct text. Again, most approaches for text correction are for only one or at best a very few types of errors. The purpose of this thesis is to propose an unsupervised approach to detecting and correcting text errors, that can compete with supervised approaches and answer the following questions: Can an unsupervised approach efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature? What is the magnitude of error coverage, in terms of the number of errors that can be corrected? We conclude that (1) it is possible that an unsupervised approach can efficiently detect and correct a text containing multiple errors of both syntactic and semantic nature. Error types include: real-word spelling errors, typographical errors, lexical choice errors, unwanted words, missing words, prepositional errors, article errors, punctuation errors, and many of the grammatical errors (e.g., errors in agreement and verb formation). (2) The magnitude of error coverage, in terms of the number of errors that can be corrected, is almost double of the number of correct words of the text. Although this is not the upper limit, this is what is practically feasible. We use engineering approaches to answer the first question and theoretical approaches to answer and support the second question. We show that finding inherent properties of a correct text using a corpus in the form of an n-gram data set is more appropriate and practical than using other approaches to detecting and correcting errors. Instead of using rule-based approaches and dictionaries, we argue that a corpus can effectively be used to infer the properties of these types of errors, and to detect and correct these errors. We test the robustness of the proposed approach separately for some individual error types, and then for all types of errors. The approach is language-independent, it can be applied to other languages, as long as n-grams are available. The results of this thesis thus suggest that unsupervised approaches, which are often dismissed in favor of supervised ones in the context of many Natural Language Processing (NLP) related tasks, may present an interesting array of NLP-related problem solving strengths.
author	Islam, Md Aminul
author_facet	Islam, Md Aminul
author_sort	Islam, Md Aminul
title	An Unsupervised Approach to Detecting and Correcting Errors in Text
title_short	An Unsupervised Approach to Detecting and Correcting Errors in Text
title_full	An Unsupervised Approach to Detecting and Correcting Errors in Text
title_fullStr	An Unsupervised Approach to Detecting and Correcting Errors in Text
title_full_unstemmed	An Unsupervised Approach to Detecting and Correcting Errors in Text
title_sort	unsupervised approach to detecting and correcting errors in text
publishDate	2011
url	http://hdl.handle.net/10393/20049
work_keys_str_mv	AT islammdaminul anunsupervisedapproachtodetectingandcorrectingerrorsintext AT islammdaminul unsupervisedapproachtodetectingandcorrectingerrorsintext
_version_	1716579114597482496

An Unsupervised Approach to Detecting and Correcting Errors in Text

Similar Items