Experimenting with automatic concept identification in documents

Many businesses in a wide range of industries are collecting and using large sets of text documents with information on their own operations and their commercial environments, and consequently they are becoming increasingly interested in tools like Automatic Concept Identification to help them ma...

Full description

Bibliographic Details
Main Author: Mi, Andy Lu
Format: Others
Language:English
Published: 2009
Online Access:http://hdl.handle.net/2429/14359
Description
Summary:Many businesses in a wide range of industries are collecting and using large sets of text documents with information on their own operations and their commercial environments, and consequently they are becoming increasingly interested in tools like Automatic Concept Identification to help them manage their data files. This paper expands research into domain concepts and the processes of identifying them automatically from document collections with unrestricted yet narrowly defined domains. An automatic, scalable and consistent model of concept identification is proposed, integrating automatic text indexing techniques (for example stop-wording, stemming and phrase formation), a newly developed affinity measure, and Agglomerative Hierarchical Clustering techniques. To test the results of the proposed approach quantitatively, a system based on the proposed model has been developed and implemented, and three sensitivity studies have been conducted against three collections of technical white papers. This study contributes to the development of a word pair-wise affinity measure based on word co-occurrence, the distance between words being evaluated, and a variety of selection criteria and thresholds for index terms (e.g. Total Frequency and Document Frequency). This study's results concerning concept identification demonstrate that the proposed model generally delivers positive concept identification outcomes. The results of the sensitivity studies provide empirical evidence regarding the effects on concept identification outcomes generated by different index term selection thresholds, different sizes of co-occurrence windows, and different characteristics of document collections. === Business, Sauder School of === Management Information Systems, Division of === Graduate