Bisecting Document Clustering Using Model-Based Methods

We all have access to large collections of digital text documents, which are useful only if we can make sense of them all and distill important information from them. Good document clustering algorithms that organize such information automatically in meaningful ways can make a difference in how effe...

Full description

Bibliographic Details
Main Author: Davis, Aaron Samuel
Format: Others
Published: BYU ScholarsArchive 2009
Subjects:
Online Access:https://scholarsarchive.byu.edu/etd/1938
https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=2937&context=etd
id ndltd-BGMYU2-oai-scholarsarchive.byu.edu-etd-2937
record_format oai_dc
spelling ndltd-BGMYU2-oai-scholarsarchive.byu.edu-etd-29372021-09-01T05:01:31Z Bisecting Document Clustering Using Model-Based Methods Davis, Aaron Samuel We all have access to large collections of digital text documents, which are useful only if we can make sense of them all and distill important information from them. Good document clustering algorithms that organize such information automatically in meaningful ways can make a difference in how effective we are at using that information. In this paper we use model-based document clustering algorithms as a base for bisecting methods in order to identify increasingly cohesive clusters from larger, more diverse clusters. We specifically use the EM algorithm and Gibbs Sampling on a mixture of multinomials as the base clustering algorithms on three data sets. Additionally, we apply a refinement step, using EM, to the final output of each clustering technique. Our results show improved agreement with human annotated document classes when compared to the existing base clustering algorithms, with marked improvement in two out of three data sets. 2009-12-09T08:00:00Z text application/pdf https://scholarsarchive.byu.edu/etd/1938 https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=2937&context=etd http://lib.byu.edu/about/copyright/ Theses and Dissertations BYU ScholarsArchive document clustering text mining model-based Computer Sciences
collection NDLTD
format Others
sources NDLTD
topic document clustering
text mining
model-based
Computer Sciences
spellingShingle document clustering
text mining
model-based
Computer Sciences
Davis, Aaron Samuel
Bisecting Document Clustering Using Model-Based Methods
description We all have access to large collections of digital text documents, which are useful only if we can make sense of them all and distill important information from them. Good document clustering algorithms that organize such information automatically in meaningful ways can make a difference in how effective we are at using that information. In this paper we use model-based document clustering algorithms as a base for bisecting methods in order to identify increasingly cohesive clusters from larger, more diverse clusters. We specifically use the EM algorithm and Gibbs Sampling on a mixture of multinomials as the base clustering algorithms on three data sets. Additionally, we apply a refinement step, using EM, to the final output of each clustering technique. Our results show improved agreement with human annotated document classes when compared to the existing base clustering algorithms, with marked improvement in two out of three data sets.
author Davis, Aaron Samuel
author_facet Davis, Aaron Samuel
author_sort Davis, Aaron Samuel
title Bisecting Document Clustering Using Model-Based Methods
title_short Bisecting Document Clustering Using Model-Based Methods
title_full Bisecting Document Clustering Using Model-Based Methods
title_fullStr Bisecting Document Clustering Using Model-Based Methods
title_full_unstemmed Bisecting Document Clustering Using Model-Based Methods
title_sort bisecting document clustering using model-based methods
publisher BYU ScholarsArchive
publishDate 2009
url https://scholarsarchive.byu.edu/etd/1938
https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=2937&context=etd
work_keys_str_mv AT davisaaronsamuel bisectingdocumentclusteringusingmodelbasedmethods
_version_ 1719473269534359552