Implications of Punctuation Mark Normalization on Text Retrieval

This research investigated issues related to normalizing punctuation marks from a text retrieval perspective. A punctuated-centric approach was undertaken by exploring changes in meanings, whitespaces, words retrievability, and other issues related to normalizing punctuation marks. To investigate...

Full description

Bibliographic Details
Main Author: Kim, Eungi
Other Authors: Moen, William E.
Format: Others
Language:English
Published: University of North Texas 2013
Subjects:
Online Access:https://digital.library.unt.edu/ark:/67531/metadc500160/
id ndltd-unt.edu-info-ark-67531-metadc500160
record_format oai_dc
spelling ndltd-unt.edu-info-ark-67531-metadc5001602017-03-17T08:41:07Z Implications of Punctuation Mark Normalization on Text Retrieval Kim, Eungi Punctuation marks text retrieval normalization information retrieval non-alphanumeric characters This research investigated issues related to normalizing punctuation marks from a text retrieval perspective. A punctuated-centric approach was undertaken by exploring changes in meanings, whitespaces, words retrievability, and other issues related to normalizing punctuation marks. To investigate punctuation normalization issues, various frequency counts of punctuation marks and punctuation patterns were conducted using the text drawn from the Gutenberg Project archive and the Usenet Newsgroup archive. A number of useful punctuation mark types that could aid in analyzing punctuation marks were discovered. This study identified two types of punctuation normalization procedures: (1) lexical independent (LI) punctuation normalization and (2) lexical oriented (LO) punctuation normalization. Using these two types of punctuation normalization procedures, this study discovered various effects of punctuation normalization in terms of different search query types. By analyzing the punctuation normalization problem in this manner, a wide range of issues were discovered such as: the need to define different types of searching, to disambiguate the role of punctuation marks, to normalize whitespaces, and indexing of punctuated terms. This study concluded that to achieve the most positive effect in a text retrieval environment, normalizing punctuation marks should be based on an extensive systematic analysis of punctuation marks and punctuation patterns and their related factors. The results of this study indicate that there were many challenges due to complexity of language. Further, this study recommends avoiding a simplistic approach to punctuation normalization. University of North Texas Moen, William E. O’Connor, Brian C. Wasson, Christina 2013-08 Thesis or Dissertation Text https://digital.library.unt.edu/ark:/67531/metadc500160/ ark: ark:/67531/metadc500160 English Public Kim, Eungi Copyright Copyright is held by the author, unless otherwise noted. All rights Reserved.
collection NDLTD
language English
format Others
sources NDLTD
topic Punctuation marks
text retrieval
normalization
information retrieval
non-alphanumeric characters
spellingShingle Punctuation marks
text retrieval
normalization
information retrieval
non-alphanumeric characters
Kim, Eungi
Implications of Punctuation Mark Normalization on Text Retrieval
description This research investigated issues related to normalizing punctuation marks from a text retrieval perspective. A punctuated-centric approach was undertaken by exploring changes in meanings, whitespaces, words retrievability, and other issues related to normalizing punctuation marks. To investigate punctuation normalization issues, various frequency counts of punctuation marks and punctuation patterns were conducted using the text drawn from the Gutenberg Project archive and the Usenet Newsgroup archive. A number of useful punctuation mark types that could aid in analyzing punctuation marks were discovered. This study identified two types of punctuation normalization procedures: (1) lexical independent (LI) punctuation normalization and (2) lexical oriented (LO) punctuation normalization. Using these two types of punctuation normalization procedures, this study discovered various effects of punctuation normalization in terms of different search query types. By analyzing the punctuation normalization problem in this manner, a wide range of issues were discovered such as: the need to define different types of searching, to disambiguate the role of punctuation marks, to normalize whitespaces, and indexing of punctuated terms. This study concluded that to achieve the most positive effect in a text retrieval environment, normalizing punctuation marks should be based on an extensive systematic analysis of punctuation marks and punctuation patterns and their related factors. The results of this study indicate that there were many challenges due to complexity of language. Further, this study recommends avoiding a simplistic approach to punctuation normalization.
author2 Moen, William E.
author_facet Moen, William E.
Kim, Eungi
author Kim, Eungi
author_sort Kim, Eungi
title Implications of Punctuation Mark Normalization on Text Retrieval
title_short Implications of Punctuation Mark Normalization on Text Retrieval
title_full Implications of Punctuation Mark Normalization on Text Retrieval
title_fullStr Implications of Punctuation Mark Normalization on Text Retrieval
title_full_unstemmed Implications of Punctuation Mark Normalization on Text Retrieval
title_sort implications of punctuation mark normalization on text retrieval
publisher University of North Texas
publishDate 2013
url https://digital.library.unt.edu/ark:/67531/metadc500160/
work_keys_str_mv AT kimeungi implicationsofpunctuationmarknormalizationontextretrieval
_version_ 1718432252820455424