Probabilistic Explicit Topic Modeling
Latent Dirichlet Allocation (LDA) is widely used for automatic discovery of latent topics in document corpora. However, output from analysis using an LDA topic model suffers from a lack of identifiability between topics not only across corpora, but across runs of the algorithm. The output is also is...
Main Author: | |
---|---|
Format: | Others |
Published: |
BYU ScholarsArchive
2013
|
Subjects: | |
Online Access: | https://scholarsarchive.byu.edu/etd/4027 https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=5026&context=etd |
id |
ndltd-BGMYU2-oai-scholarsarchive.byu.edu-etd-5026 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-BGMYU2-oai-scholarsarchive.byu.edu-etd-50262019-05-16T03:20:05Z Probabilistic Explicit Topic Modeling Hansen, Joshua Aaron Latent Dirichlet Allocation (LDA) is widely used for automatic discovery of latent topics in document corpora. However, output from analysis using an LDA topic model suffers from a lack of identifiability between topics not only across corpora, but across runs of the algorithm. The output is also isolated from enriching information from knowledge sources such as Wikipedia and is difficult for humans to interpret due to a lack of meaningful topic labels. This thesis introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). LDA-STWD directly substitutes precomputed counts for LDA topic-word counts, leveraging existing Gibbs sampler inference. EDA defines an entirely new explicit topic model and derives the inference method from first principles. Both of these methods approximate topic-word distributions a priori using word distributions from Wikipedia articles, with each article corresponding to one topic and the article title being used as a topic label. By this means, LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess the effectiveness of LDA-STWD and EDA by means of three tasks: document classification, topic label generation, and document label generation. Label quality is quantified by means of user studies. We show that a competing non-probabilistic explicit topic model handily beats both LDA-STWD and EDA as a dimensionality reduction technique in a document classification task. Surprisingly, we find that topic labels from another approach using LDA and post hoc topic labeling (called LDA+Lau) are on one corpus preferred over topic labels prespecified from Wikipedia. Finally, we show that LDA-STWD improves substantially upon the performance of the state of the art in document labeling. 2013-04-21T07:00:00Z text application/pdf https://scholarsarchive.byu.edu/etd/4027 https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=5026&context=etd http://lib.byu.edu/about/copyright/ All Theses and Dissertations BYU ScholarsArchive topic modeling machine learning Wikipedia Computer Sciences |
collection |
NDLTD |
format |
Others
|
sources |
NDLTD |
topic |
topic modeling machine learning Wikipedia Computer Sciences |
spellingShingle |
topic modeling machine learning Wikipedia Computer Sciences Hansen, Joshua Aaron Probabilistic Explicit Topic Modeling |
description |
Latent Dirichlet Allocation (LDA) is widely used for automatic discovery of latent topics in document corpora. However, output from analysis using an LDA topic model suffers from a lack of identifiability between topics not only across corpora, but across runs of the algorithm. The output is also isolated from enriching information from knowledge sources such as Wikipedia and is difficult for humans to interpret due to a lack of meaningful topic labels. This thesis introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). LDA-STWD directly substitutes precomputed counts for LDA topic-word counts, leveraging existing Gibbs sampler inference. EDA defines an entirely new explicit topic model and derives the inference method from first principles. Both of these methods approximate topic-word distributions a priori using word distributions from Wikipedia articles, with each article corresponding to one topic and the article title being used as a topic label. By this means, LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess the effectiveness of LDA-STWD and EDA by means of three tasks: document classification, topic label generation, and document label generation. Label quality is quantified by means of user studies. We show that a competing non-probabilistic explicit topic model handily beats both LDA-STWD and EDA as a dimensionality reduction technique in a document classification task. Surprisingly, we find that topic labels from another approach using LDA and post hoc topic labeling (called LDA+Lau) are on one corpus preferred over topic labels prespecified from Wikipedia. Finally, we show that LDA-STWD improves substantially upon the performance of the state of the art in document labeling. |
author |
Hansen, Joshua Aaron |
author_facet |
Hansen, Joshua Aaron |
author_sort |
Hansen, Joshua Aaron |
title |
Probabilistic Explicit Topic Modeling |
title_short |
Probabilistic Explicit Topic Modeling |
title_full |
Probabilistic Explicit Topic Modeling |
title_fullStr |
Probabilistic Explicit Topic Modeling |
title_full_unstemmed |
Probabilistic Explicit Topic Modeling |
title_sort |
probabilistic explicit topic modeling |
publisher |
BYU ScholarsArchive |
publishDate |
2013 |
url |
https://scholarsarchive.byu.edu/etd/4027 https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=5026&context=etd |
work_keys_str_mv |
AT hansenjoshuaaaron probabilisticexplicittopicmodeling |
_version_ |
1719185267071385600 |