Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection

Humans are used to expressing themselves with written language and language provides a medium with which we can describe our experiences in detail incorporating individuality. Even though documents provide a rich source of information, it becomes very difficult to identify, extract, summarize and se...

Full description

Bibliographic Details
Main Authors: Nilupulee Nathawitharana, Damminda Alahakoon, Sumith Matharage
Format: Article
Language:English
Published: Australasian Association for Information Systems 2015-09-01
Series:Australasian Journal of Information Systems
Subjects:
Online Access:http://journal.acs.org.au/index.php/ajis/article/view/1180
id doaj-a0ebc6158491468e9438889d30cdbcd9
record_format Article
spelling doaj-a0ebc6158491468e9438889d30cdbcd92021-08-02T10:53:44ZengAustralasian Association for Information SystemsAustralasian Journal of Information Systems1449-86181449-86182015-09-0119010.3127/ajis.v19i0.1180542Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap DetectionNilupulee Nathawitharana0Damminda Alahakoon1Sumith MatharageLa Trobe UniversityLa Trobe UniversityHumans are used to expressing themselves with written language and language provides a medium with which we can describe our experiences in detail incorporating individuality. Even though documents provide a rich source of information, it becomes very difficult to identify, extract, summarize and search when vast amounts of documents are collected especially over time. Document clustering is a technique that has been widely used to group documents based on similarity of content represented by the words used. Once key groups are identified further drill down into sub-groupings is facilitated by the use of hierarchical clustering. Clustering and hierarchical clustering are very useful when applied to numerical and categorical data and cluster accuracy and purity measures exist to evaluate the outcomes of a clustering exercise. Although the same measures have been applied to text clustering, text clusters are based on words or terms which can be repeated across documents associated with different topics. Therefore text data cannot be considered as a direct ‘coding’ of a particular experience or situation in contrast to numerical and categorical data and term overlap is a very common characteristic in text clustering. In this paper we propose a new technique and methodology for term overlap capture from text documents, highlighting the different situations such overlap could signify and discuss why such understanding is important for obtaining value from text clustering. Experiments were conducted using a widely used text document collection where the proposed methodology allowed exploring the term diversity for a given document collection and obtain clusters with minimum term overlap.http://journal.acs.org.au/index.php/ajis/article/view/1180Term overlapGrowing Self Organizing MapHierarchical clusteringText document clustering
collection DOAJ
language English
format Article
sources DOAJ
author Nilupulee Nathawitharana
Damminda Alahakoon
Sumith Matharage
spellingShingle Nilupulee Nathawitharana
Damminda Alahakoon
Sumith Matharage
Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection
Australasian Journal of Information Systems
Term overlap
Growing Self Organizing Map
Hierarchical clustering
Text document clustering
author_facet Nilupulee Nathawitharana
Damminda Alahakoon
Sumith Matharage
author_sort Nilupulee Nathawitharana
title Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection
title_short Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection
title_full Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection
title_fullStr Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection
title_full_unstemmed Improving the Decision Value of Hierarchical Text Clustering Using Term Overlap Detection
title_sort improving the decision value of hierarchical text clustering using term overlap detection
publisher Australasian Association for Information Systems
series Australasian Journal of Information Systems
issn 1449-8618
1449-8618
publishDate 2015-09-01
description Humans are used to expressing themselves with written language and language provides a medium with which we can describe our experiences in detail incorporating individuality. Even though documents provide a rich source of information, it becomes very difficult to identify, extract, summarize and search when vast amounts of documents are collected especially over time. Document clustering is a technique that has been widely used to group documents based on similarity of content represented by the words used. Once key groups are identified further drill down into sub-groupings is facilitated by the use of hierarchical clustering. Clustering and hierarchical clustering are very useful when applied to numerical and categorical data and cluster accuracy and purity measures exist to evaluate the outcomes of a clustering exercise. Although the same measures have been applied to text clustering, text clusters are based on words or terms which can be repeated across documents associated with different topics. Therefore text data cannot be considered as a direct ‘coding’ of a particular experience or situation in contrast to numerical and categorical data and term overlap is a very common characteristic in text clustering. In this paper we propose a new technique and methodology for term overlap capture from text documents, highlighting the different situations such overlap could signify and discuss why such understanding is important for obtaining value from text clustering. Experiments were conducted using a widely used text document collection where the proposed methodology allowed exploring the term diversity for a given document collection and obtain clusters with minimum term overlap.
topic Term overlap
Growing Self Organizing Map
Hierarchical clustering
Text document clustering
url http://journal.acs.org.au/index.php/ajis/article/view/1180
work_keys_str_mv AT nilupuleenathawitharana improvingthedecisionvalueofhierarchicaltextclusteringusingtermoverlapdetection
AT dammindaalahakoon improvingthedecisionvalueofhierarchicaltextclusteringusingtermoverlapdetection
AT sumithmatharage improvingthedecisionvalueofhierarchicaltextclusteringusingtermoverlapdetection
_version_ 1721233703480328192