Bisecting Document Clustering Using Model-Based Methods
We all have access to large collections of digital text documents, which are useful only if we can make sense of them all and distill important information from them. Good document clustering algorithms that organize such information automatically in meaningful ways can make a difference in how effe...
Main Author: | |
---|---|
Format: | Others |
Published: |
BYU ScholarsArchive
2009
|
Subjects: | |
Online Access: | https://scholarsarchive.byu.edu/etd/1938 https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=2937&context=etd |
id |
ndltd-BGMYU2-oai-scholarsarchive.byu.edu-etd-2937 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-BGMYU2-oai-scholarsarchive.byu.edu-etd-29372021-09-01T05:01:31Z Bisecting Document Clustering Using Model-Based Methods Davis, Aaron Samuel We all have access to large collections of digital text documents, which are useful only if we can make sense of them all and distill important information from them. Good document clustering algorithms that organize such information automatically in meaningful ways can make a difference in how effective we are at using that information. In this paper we use model-based document clustering algorithms as a base for bisecting methods in order to identify increasingly cohesive clusters from larger, more diverse clusters. We specifically use the EM algorithm and Gibbs Sampling on a mixture of multinomials as the base clustering algorithms on three data sets. Additionally, we apply a refinement step, using EM, to the final output of each clustering technique. Our results show improved agreement with human annotated document classes when compared to the existing base clustering algorithms, with marked improvement in two out of three data sets. 2009-12-09T08:00:00Z text application/pdf https://scholarsarchive.byu.edu/etd/1938 https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=2937&context=etd http://lib.byu.edu/about/copyright/ Theses and Dissertations BYU ScholarsArchive document clustering text mining model-based Computer Sciences |
collection |
NDLTD |
format |
Others
|
sources |
NDLTD |
topic |
document clustering text mining model-based Computer Sciences |
spellingShingle |
document clustering text mining model-based Computer Sciences Davis, Aaron Samuel Bisecting Document Clustering Using Model-Based Methods |
description |
We all have access to large collections of digital text documents, which are useful only if we can make sense of them all and distill important information from them. Good document clustering algorithms that organize such information automatically in meaningful ways can make a difference in how effective we are at using that information. In this paper we use model-based document clustering algorithms as a base for bisecting methods in order to identify increasingly cohesive clusters from larger, more diverse clusters. We specifically use the EM algorithm and Gibbs Sampling on a mixture of multinomials as the base clustering algorithms on three data sets. Additionally, we apply a refinement step, using EM, to the final output of each clustering technique. Our results show improved agreement with human annotated document classes when compared to the existing base clustering algorithms, with marked improvement in two out of three data sets. |
author |
Davis, Aaron Samuel |
author_facet |
Davis, Aaron Samuel |
author_sort |
Davis, Aaron Samuel |
title |
Bisecting Document Clustering Using Model-Based Methods |
title_short |
Bisecting Document Clustering Using Model-Based Methods |
title_full |
Bisecting Document Clustering Using Model-Based Methods |
title_fullStr |
Bisecting Document Clustering Using Model-Based Methods |
title_full_unstemmed |
Bisecting Document Clustering Using Model-Based Methods |
title_sort |
bisecting document clustering using model-based methods |
publisher |
BYU ScholarsArchive |
publishDate |
2009 |
url |
https://scholarsarchive.byu.edu/etd/1938 https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=2937&context=etd |
work_keys_str_mv |
AT davisaaronsamuel bisectingdocumentclusteringusingmodelbasedmethods |
_version_ |
1719473269534359552 |