Experimenting with automatic concept identification in documents

Many businesses in a wide range of industries are collecting and using large sets of text documents with information on their own operations and their commercial environments, and consequently they are becoming increasingly interested in tools like Automatic Concept Identification to help them ma...

Full description

Bibliographic Details
Main Author: Mi, Andy Lu
Format: Others
Language:English
Published: 2009
Online Access:http://hdl.handle.net/2429/14359
id ndltd-UBC-oai-circle.library.ubc.ca-2429-14359
record_format oai_dc
spelling ndltd-UBC-oai-circle.library.ubc.ca-2429-143592018-01-05T17:37:13Z Experimenting with automatic concept identification in documents Mi, Andy Lu Many businesses in a wide range of industries are collecting and using large sets of text documents with information on their own operations and their commercial environments, and consequently they are becoming increasingly interested in tools like Automatic Concept Identification to help them manage their data files. This paper expands research into domain concepts and the processes of identifying them automatically from document collections with unrestricted yet narrowly defined domains. An automatic, scalable and consistent model of concept identification is proposed, integrating automatic text indexing techniques (for example stop-wording, stemming and phrase formation), a newly developed affinity measure, and Agglomerative Hierarchical Clustering techniques. To test the results of the proposed approach quantitatively, a system based on the proposed model has been developed and implemented, and three sensitivity studies have been conducted against three collections of technical white papers. This study contributes to the development of a word pair-wise affinity measure based on word co-occurrence, the distance between words being evaluated, and a variety of selection criteria and thresholds for index terms (e.g. Total Frequency and Document Frequency). This study's results concerning concept identification demonstrate that the proposed model generally delivers positive concept identification outcomes. The results of the sensitivity studies provide empirical evidence regarding the effects on concept identification outcomes generated by different index term selection thresholds, different sizes of co-occurrence windows, and different characteristics of document collections. Business, Sauder School of Management Information Systems, Division of Graduate 2009-10-29T17:55:54Z 2009-10-29T17:55:54Z 2003 2003-11 Text Thesis/Dissertation http://hdl.handle.net/2429/14359 eng For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. 5135191 bytes application/pdf
collection NDLTD
language English
format Others
sources NDLTD
description Many businesses in a wide range of industries are collecting and using large sets of text documents with information on their own operations and their commercial environments, and consequently they are becoming increasingly interested in tools like Automatic Concept Identification to help them manage their data files. This paper expands research into domain concepts and the processes of identifying them automatically from document collections with unrestricted yet narrowly defined domains. An automatic, scalable and consistent model of concept identification is proposed, integrating automatic text indexing techniques (for example stop-wording, stemming and phrase formation), a newly developed affinity measure, and Agglomerative Hierarchical Clustering techniques. To test the results of the proposed approach quantitatively, a system based on the proposed model has been developed and implemented, and three sensitivity studies have been conducted against three collections of technical white papers. This study contributes to the development of a word pair-wise affinity measure based on word co-occurrence, the distance between words being evaluated, and a variety of selection criteria and thresholds for index terms (e.g. Total Frequency and Document Frequency). This study's results concerning concept identification demonstrate that the proposed model generally delivers positive concept identification outcomes. The results of the sensitivity studies provide empirical evidence regarding the effects on concept identification outcomes generated by different index term selection thresholds, different sizes of co-occurrence windows, and different characteristics of document collections. === Business, Sauder School of === Management Information Systems, Division of === Graduate
author Mi, Andy Lu
spellingShingle Mi, Andy Lu
Experimenting with automatic concept identification in documents
author_facet Mi, Andy Lu
author_sort Mi, Andy Lu
title Experimenting with automatic concept identification in documents
title_short Experimenting with automatic concept identification in documents
title_full Experimenting with automatic concept identification in documents
title_fullStr Experimenting with automatic concept identification in documents
title_full_unstemmed Experimenting with automatic concept identification in documents
title_sort experimenting with automatic concept identification in documents
publishDate 2009
url http://hdl.handle.net/2429/14359
work_keys_str_mv AT miandylu experimentingwithautomaticconceptidentificationindocuments
_version_ 1718589597048373248