Experimenting with automatic concept identification in documents
Many businesses in a wide range of industries are collecting and using large sets of text documents with information on their own operations and their commercial environments, and consequently they are becoming increasingly interested in tools like Automatic Concept Identification to help them ma...
Main Author: | |
---|---|
Format: | Others |
Language: | English |
Published: |
2009
|
Online Access: | http://hdl.handle.net/2429/14359 |
id |
ndltd-UBC-oai-circle.library.ubc.ca-2429-14359 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-UBC-oai-circle.library.ubc.ca-2429-143592018-01-05T17:37:13Z Experimenting with automatic concept identification in documents Mi, Andy Lu Many businesses in a wide range of industries are collecting and using large sets of text documents with information on their own operations and their commercial environments, and consequently they are becoming increasingly interested in tools like Automatic Concept Identification to help them manage their data files. This paper expands research into domain concepts and the processes of identifying them automatically from document collections with unrestricted yet narrowly defined domains. An automatic, scalable and consistent model of concept identification is proposed, integrating automatic text indexing techniques (for example stop-wording, stemming and phrase formation), a newly developed affinity measure, and Agglomerative Hierarchical Clustering techniques. To test the results of the proposed approach quantitatively, a system based on the proposed model has been developed and implemented, and three sensitivity studies have been conducted against three collections of technical white papers. This study contributes to the development of a word pair-wise affinity measure based on word co-occurrence, the distance between words being evaluated, and a variety of selection criteria and thresholds for index terms (e.g. Total Frequency and Document Frequency). This study's results concerning concept identification demonstrate that the proposed model generally delivers positive concept identification outcomes. The results of the sensitivity studies provide empirical evidence regarding the effects on concept identification outcomes generated by different index term selection thresholds, different sizes of co-occurrence windows, and different characteristics of document collections. Business, Sauder School of Management Information Systems, Division of Graduate 2009-10-29T17:55:54Z 2009-10-29T17:55:54Z 2003 2003-11 Text Thesis/Dissertation http://hdl.handle.net/2429/14359 eng For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. 5135191 bytes application/pdf |
collection |
NDLTD |
language |
English |
format |
Others
|
sources |
NDLTD |
description |
Many businesses in a wide range of industries are collecting and using large sets of text
documents with information on their own operations and their commercial environments,
and consequently they are becoming increasingly interested in tools like Automatic
Concept Identification to help them manage their data files. This paper expands research
into domain concepts and the processes of identifying them automatically from document
collections with unrestricted yet narrowly defined domains. An automatic, scalable and
consistent model of concept identification is proposed, integrating automatic text
indexing techniques (for example stop-wording, stemming and phrase formation), a
newly developed affinity measure, and Agglomerative Hierarchical Clustering
techniques. To test the results of the proposed approach quantitatively, a system based on
the proposed model has been developed and implemented, and three sensitivity studies
have been conducted against three collections of technical white papers.
This study contributes to the development of a word pair-wise affinity measure based on
word co-occurrence, the distance between words being evaluated, and a variety of
selection criteria and thresholds for index terms (e.g. Total Frequency and Document
Frequency). This study's results concerning concept identification demonstrate that the
proposed model generally delivers positive concept identification outcomes. The results
of the sensitivity studies provide empirical evidence regarding the effects on concept
identification outcomes generated by different index term selection thresholds, different
sizes of co-occurrence windows, and different characteristics of document collections. === Business, Sauder School of === Management Information Systems, Division of === Graduate |
author |
Mi, Andy Lu |
spellingShingle |
Mi, Andy Lu Experimenting with automatic concept identification in documents |
author_facet |
Mi, Andy Lu |
author_sort |
Mi, Andy Lu |
title |
Experimenting with automatic concept identification in documents |
title_short |
Experimenting with automatic concept identification in documents |
title_full |
Experimenting with automatic concept identification in documents |
title_fullStr |
Experimenting with automatic concept identification in documents |
title_full_unstemmed |
Experimenting with automatic concept identification in documents |
title_sort |
experimenting with automatic concept identification in documents |
publishDate |
2009 |
url |
http://hdl.handle.net/2429/14359 |
work_keys_str_mv |
AT miandylu experimentingwithautomaticconceptidentificationindocuments |
_version_ |
1718589597048373248 |