Probabilistic Explicit Topic Modeling

Latent Dirichlet Allocation (LDA) is widely used for automatic discovery of latent topics in document corpora. However, output from analysis using an LDA topic model suffers from a lack of identifiability between topics not only across corpora, but across runs of the algorithm. The output is also is...

Full description

Bibliographic Details
Main Author:	Hansen, Joshua Aaron
Format:	Others
Published:	BYU ScholarsArchive 2013
Subjects:	topic modeling machine learning Wikipedia Computer Sciences
Online Access:	https://scholarsarchive.byu.edu/etd/4027 https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=5026&context=etd

id	ndltd-BGMYU2-oai-scholarsarchive.byu.edu-etd-5026
record_format	oai_dc
spelling	ndltd-BGMYU2-oai-scholarsarchive.byu.edu-etd-50262019-05-16T03:20:05Z Probabilistic Explicit Topic Modeling Hansen, Joshua Aaron Latent Dirichlet Allocation (LDA) is widely used for automatic discovery of latent topics in document corpora. However, output from analysis using an LDA topic model suffers from a lack of identifiability between topics not only across corpora, but across runs of the algorithm. The output is also isolated from enriching information from knowledge sources such as Wikipedia and is difficult for humans to interpret due to a lack of meaningful topic labels. This thesis introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). LDA-STWD directly substitutes precomputed counts for LDA topic-word counts, leveraging existing Gibbs sampler inference. EDA defines an entirely new explicit topic model and derives the inference method from first principles. Both of these methods approximate topic-word distributions a priori using word distributions from Wikipedia articles, with each article corresponding to one topic and the article title being used as a topic label. By this means, LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess the effectiveness of LDA-STWD and EDA by means of three tasks: document classification, topic label generation, and document label generation. Label quality is quantified by means of user studies. We show that a competing non-probabilistic explicit topic model handily beats both LDA-STWD and EDA as a dimensionality reduction technique in a document classification task. Surprisingly, we find that topic labels from another approach using LDA and post hoc topic labeling (called LDA+Lau) are on one corpus preferred over topic labels prespecified from Wikipedia. Finally, we show that LDA-STWD improves substantially upon the performance of the state of the art in document labeling. 2013-04-21T07:00:00Z text application/pdf https://scholarsarchive.byu.edu/etd/4027 https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=5026&context=etd http://lib.byu.edu/about/copyright/ All Theses and Dissertations BYU ScholarsArchive topic modeling machine learning Wikipedia Computer Sciences
collection	NDLTD
format	Others
sources	NDLTD
topic	topic modeling machine learning Wikipedia Computer Sciences
spellingShingle	topic modeling machine learning Wikipedia Computer Sciences Hansen, Joshua Aaron Probabilistic Explicit Topic Modeling
description	Latent Dirichlet Allocation (LDA) is widely used for automatic discovery of latent topics in document corpora. However, output from analysis using an LDA topic model suffers from a lack of identifiability between topics not only across corpora, but across runs of the algorithm. The output is also isolated from enriching information from knowledge sources such as Wikipedia and is difficult for humans to interpret due to a lack of meaningful topic labels. This thesis introduces two methods for probabilistic explicit topic modeling that address these issues: Latent Dirichlet Allocation with Static Topic-Word Distributions (LDA-STWD), and Explicit Dirichlet Allocation (EDA). LDA-STWD directly substitutes precomputed counts for LDA topic-word counts, leveraging existing Gibbs sampler inference. EDA defines an entirely new explicit topic model and derives the inference method from first principles. Both of these methods approximate topic-word distributions a priori using word distributions from Wikipedia articles, with each article corresponding to one topic and the article title being used as a topic label. By this means, LDA-STWD and EDA overcome the nonidentifiability, isolation, and unintepretability of LDA output. We assess the effectiveness of LDA-STWD and EDA by means of three tasks: document classification, topic label generation, and document label generation. Label quality is quantified by means of user studies. We show that a competing non-probabilistic explicit topic model handily beats both LDA-STWD and EDA as a dimensionality reduction technique in a document classification task. Surprisingly, we find that topic labels from another approach using LDA and post hoc topic labeling (called LDA+Lau) are on one corpus preferred over topic labels prespecified from Wikipedia. Finally, we show that LDA-STWD improves substantially upon the performance of the state of the art in document labeling.
author	Hansen, Joshua Aaron
author_facet	Hansen, Joshua Aaron
author_sort	Hansen, Joshua Aaron
title	Probabilistic Explicit Topic Modeling
title_short	Probabilistic Explicit Topic Modeling
title_full	Probabilistic Explicit Topic Modeling
title_fullStr	Probabilistic Explicit Topic Modeling
title_full_unstemmed	Probabilistic Explicit Topic Modeling
title_sort	probabilistic explicit topic modeling
publisher	BYU ScholarsArchive
publishDate	2013
url	https://scholarsarchive.byu.edu/etd/4027 https://scholarsarchive.byu.edu/cgi/viewcontent.cgi?article=5026&context=etd
work_keys_str_mv	AT hansenjoshuaaaron probabilisticexplicittopicmodeling
_version_	1719185267071385600

Probabilistic Explicit Topic Modeling

Similar Items