Probabilistic Explicit Topic Modeling

Latent Dirichlet Allocation (LDA) is widely used for automatic discovery of latent topics in document corpora. However, output from analysis using an LDA topic model suffers from a lack of identifiability between topics not only across corpora, but across runs of the algorithm. The output is also is...

Full description

Bibliographic Details
Main Author: Hansen, Joshua Aaron
Format: Others
Published: BYU ScholarsArchive 2013
Subjects:
Online Access:https://scholarsarchive.byu.edu/etd/4027
https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=5026&context=etd
id ndltd-BGMYU2-oai-scholarsarchive.byu.edu-etd-5026
record_format oai_dc
spelling ndltd-BGMYU2-oai-scholarsarchive.byu.edu-etd-50262019-05-16T03:20:05Z Probabilistic Explicit Topic Modeling Hansen, Joshua Aaron Latent Dirichlet Allocation (LDA) is widely used for automatic discovery of latent topics in document corpora. However, output from analysis using an LDA topic model suffers from a lack of identifiability between topics not only across corpora, but across runs of the algorithm. The output is also isolated from enriching information from knowledge sources such as Wikipedia and is difficult for humans to interpret due to a lack of meaningful topic labels. This thesis introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). LDA-STWD directly substitutes precomputed counts for LDA topic-word counts, leveraging existing Gibbs sampler inference. EDA defines an entirely new explicit topic model and derives the inference method from first principles. Both of these methods approximate topic-word distributions a priori using word distributions from Wikipedia articles, with each article corresponding to one topic and the article title being used as a topic label. By this means, LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess the effectiveness of LDA-STWD and EDA by means of three tasks: document classification, topic label generation, and document label generation. Label quality is quantified by means of user studies. We show that a competing non-probabilistic explicit topic model handily beats both LDA-STWD and EDA as a dimensionality reduction technique in a document classification task. Surprisingly, we find that topic labels from another approach using LDA and post hoc topic labeling (called LDA+Lau) are on one corpus preferred over topic labels prespecified from Wikipedia. Finally, we show that LDA-STWD improves substantially upon the performance of the state of the art in document labeling. 2013-04-21T07:00:00Z text application/pdf https://scholarsarchive.byu.edu/etd/4027 https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=5026&context=etd http://lib.byu.edu/about/copyright/ All Theses and Dissertations BYU ScholarsArchive topic modeling machine learning Wikipedia Computer Sciences
collection NDLTD
format Others
sources NDLTD
topic topic modeling
machine learning
Wikipedia
Computer Sciences
spellingShingle topic modeling
machine learning
Wikipedia
Computer Sciences
Hansen, Joshua Aaron
Probabilistic Explicit Topic Modeling
description Latent Dirichlet Allocation (LDA) is widely used for automatic discovery of latent topics in document corpora. However, output from analysis using an LDA topic model suffers from a lack of identifiability between topics not only across corpora, but across runs of the algorithm. The output is also isolated from enriching information from knowledge sources such as Wikipedia and is difficult for humans to interpret due to a lack of meaningful topic labels. This thesis introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). LDA-STWD directly substitutes precomputed counts for LDA topic-word counts, leveraging existing Gibbs sampler inference. EDA defines an entirely new explicit topic model and derives the inference method from first principles. Both of these methods approximate topic-word distributions a priori using word distributions from Wikipedia articles, with each article corresponding to one topic and the article title being used as a topic label. By this means, LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess the effectiveness of LDA-STWD and EDA by means of three tasks: document classification, topic label generation, and document label generation. Label quality is quantified by means of user studies. We show that a competing non-probabilistic explicit topic model handily beats both LDA-STWD and EDA as a dimensionality reduction technique in a document classification task. Surprisingly, we find that topic labels from another approach using LDA and post hoc topic labeling (called LDA+Lau) are on one corpus preferred over topic labels prespecified from Wikipedia. Finally, we show that LDA-STWD improves substantially upon the performance of the state of the art in document labeling.
author Hansen, Joshua Aaron
author_facet Hansen, Joshua Aaron
author_sort Hansen, Joshua Aaron
title Probabilistic Explicit Topic Modeling
title_short Probabilistic Explicit Topic Modeling
title_full Probabilistic Explicit Topic Modeling
title_fullStr Probabilistic Explicit Topic Modeling
title_full_unstemmed Probabilistic Explicit Topic Modeling
title_sort probabilistic explicit topic modeling
publisher BYU ScholarsArchive
publishDate 2013
url https://scholarsarchive.byu.edu/etd/4027
https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=5026&context=etd
work_keys_str_mv AT hansenjoshuaaaron probabilisticexplicittopicmodeling
_version_ 1719185267071385600