An investigation of the linguistic characteristics of Japanese information retrieval

This dissertation examines and demonstrates the effective use of linguistic knowledge in information retrieval (IR) technology. This linguistic IR research has a long history of serious but unfortunately often unsuccessful endeavors, but our retrieval experiments generally confirmed a significant pe...

Full description

Bibliographic Details
Main Author: Fujii, Hideo
Language:ENG
Published: ScholarWorks@UMass Amherst 1998
Subjects:
Online Access:https://scholarworks.umass.edu/dissertations/AAI9823737
id ndltd-UMASS-oai-scholarworks.umass.edu-dissertations-2988
record_format oai_dc
spelling ndltd-UMASS-oai-scholarworks.umass.edu-dissertations-29882020-12-02T14:29:17Z An investigation of the linguistic characteristics of Japanese information retrieval Fujii, Hideo This dissertation examines and demonstrates the effective use of linguistic knowledge in information retrieval (IR) technology. This linguistic IR research has a long history of serious but unfortunately often unsuccessful endeavors, but our retrieval experiments generally confirmed a significant performance improvement by these linguistic techniques. These experiments were realized by using a Japanese corpus. Thus, this research also serves as a case study of "linguistic information retrieval" for Japanese, as opposed to English which has traditionally been the predominant language of study. The methodology which was taken in this study is called grammatical paraphrasing paradigm for the query formulation to translate a formal grammatical relationship into a retrieval strategy. To realize this paradigm, based on the theory of generative grammar, we developed a class of query strategies to be applied to a sentence in a base query having various valency structures such as transitivity or intransitivity in lexicon, or causativization or passivization in syntax. We call this class of strategies valency control strategies. The most distinctive advantage of this method is the capability to draw two contingent sets of dichotomous views. The first view is the valency dichotomy that reveals the difference in strategic gain between the monovalent (i.e., intransitive and passive) and bivalent (i.e., transitive and causative) strategies. The second view is the dichotomy within a system of linguistic components, where lexical and syntactical modules have separate retrieval mechanisms. After developing the general framework of valency control strategies from a linguistic background, especially involving the phenomenon of transitivity alternations which exist extensively in Japanese, we examined its effectiveness in a series of experiments. We found the following three uniquely important results. First, the overall result showed that most valency control query strategies considerably improved the precision. This means that linguistic knowledge is a highly valuable knowledge source in information retrieval. Second, in the valency dichotomy, the bivalent strategy improved the performance, but the monovalent method degraded it. This result indicates the usefulness of formally definable grammatical strategies in information retrieval. Third, in the linguistic module dichotomy, despite the conventional wisdom which emphasizes the local morpho-lexical information, the syntactical method was effective as well as the lexical method. Two additional experiments on potentialization and verbal nouns were carried out, as well. The potential query strategy on verbs, which does not change the valency, showed a moderate performance improvement between bivalent and monovalent. The performance of verbal noun strategies was not as encouraging as that of verb strategies. The genitive verbal noun strategy showed a particularly clear degradation, which is probably a reflection of past data in literature showing that phrase recognition achieved only limited retrieval improvement. Finally, this research also has a strong practical implication. We had two sets of experiments--one the relevance feedback method, the other the automatic query generation method. Our results showed that the automatic method works roughly as well as the relevance feedback method. This suggests that our method has significant practical applications because it does not rely on relevance information to improve the query performance. 1998-01-01T08:00:00Z text https://scholarworks.umass.edu/dissertations/AAI9823737 Doctoral Dissertations Available from Proquest ENG ScholarWorks@UMass Amherst Computer science
collection NDLTD
language ENG
sources NDLTD
topic Computer science
spellingShingle Computer science
Fujii, Hideo
An investigation of the linguistic characteristics of Japanese information retrieval
description This dissertation examines and demonstrates the effective use of linguistic knowledge in information retrieval (IR) technology. This linguistic IR research has a long history of serious but unfortunately often unsuccessful endeavors, but our retrieval experiments generally confirmed a significant performance improvement by these linguistic techniques. These experiments were realized by using a Japanese corpus. Thus, this research also serves as a case study of "linguistic information retrieval" for Japanese, as opposed to English which has traditionally been the predominant language of study. The methodology which was taken in this study is called grammatical paraphrasing paradigm for the query formulation to translate a formal grammatical relationship into a retrieval strategy. To realize this paradigm, based on the theory of generative grammar, we developed a class of query strategies to be applied to a sentence in a base query having various valency structures such as transitivity or intransitivity in lexicon, or causativization or passivization in syntax. We call this class of strategies valency control strategies. The most distinctive advantage of this method is the capability to draw two contingent sets of dichotomous views. The first view is the valency dichotomy that reveals the difference in strategic gain between the monovalent (i.e., intransitive and passive) and bivalent (i.e., transitive and causative) strategies. The second view is the dichotomy within a system of linguistic components, where lexical and syntactical modules have separate retrieval mechanisms. After developing the general framework of valency control strategies from a linguistic background, especially involving the phenomenon of transitivity alternations which exist extensively in Japanese, we examined its effectiveness in a series of experiments. We found the following three uniquely important results. First, the overall result showed that most valency control query strategies considerably improved the precision. This means that linguistic knowledge is a highly valuable knowledge source in information retrieval. Second, in the valency dichotomy, the bivalent strategy improved the performance, but the monovalent method degraded it. This result indicates the usefulness of formally definable grammatical strategies in information retrieval. Third, in the linguistic module dichotomy, despite the conventional wisdom which emphasizes the local morpho-lexical information, the syntactical method was effective as well as the lexical method. Two additional experiments on potentialization and verbal nouns were carried out, as well. The potential query strategy on verbs, which does not change the valency, showed a moderate performance improvement between bivalent and monovalent. The performance of verbal noun strategies was not as encouraging as that of verb strategies. The genitive verbal noun strategy showed a particularly clear degradation, which is probably a reflection of past data in literature showing that phrase recognition achieved only limited retrieval improvement. Finally, this research also has a strong practical implication. We had two sets of experiments--one the relevance feedback method, the other the automatic query generation method. Our results showed that the automatic method works roughly as well as the relevance feedback method. This suggests that our method has significant practical applications because it does not rely on relevance information to improve the query performance.
author Fujii, Hideo
author_facet Fujii, Hideo
author_sort Fujii, Hideo
title An investigation of the linguistic characteristics of Japanese information retrieval
title_short An investigation of the linguistic characteristics of Japanese information retrieval
title_full An investigation of the linguistic characteristics of Japanese information retrieval
title_fullStr An investigation of the linguistic characteristics of Japanese information retrieval
title_full_unstemmed An investigation of the linguistic characteristics of Japanese information retrieval
title_sort investigation of the linguistic characteristics of japanese information retrieval
publisher ScholarWorks@UMass Amherst
publishDate 1998
url https://scholarworks.umass.edu/dissertations/AAI9823737
work_keys_str_mv AT fujiihideo aninvestigationofthelinguisticcharacteristicsofjapaneseinformationretrieval
AT fujiihideo investigationofthelinguisticcharacteristicsofjapaneseinformationretrieval
_version_ 1719363746253504512