Semi-supervised topic models applied to mathematical document classification

Our objective is to build a mathematical document classifier: a machine which for a given mathematical document $\mathbf{x}$, determines the mathematical subject area $\cc$. In particular, we wish to construct the function $f$ such that $f(\mathbf{x}, \TTheta) = \cc$ where $f$ requires the possibly...

Full description

Bibliographic Details
Main Author: Evans, Ieuan
Other Authors: Davenport, James ; Hall, Peter
Published: University of Bath 2017
Subjects:
004
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.715299
id ndltd-bl.uk-oai-ethos.bl.uk-715299
record_format oai_dc
spelling ndltd-bl.uk-oai-ethos.bl.uk-7152992019-03-14T03:24:50ZSemi-supervised topic models applied to mathematical document classificationEvans, IeuanDavenport, James ; Hall, Peter2017Our objective is to build a mathematical document classifier: a machine which for a given mathematical document $\mathbf{x}$, determines the mathematical subject area $\cc$. In particular, we wish to construct the function $f$ such that $f(\mathbf{x}, \TTheta) = \cc$ where $f$ requires the possibly unknown parameters $\TTheta$ which may be estimated using an existing corpus of labelled documents. The novelty here is that our proposed classifiers will observe a mathematical document over dual vocabularies. In particular, as a collection of both words and mathematical symbols. In this thesis, we predominantly review the claims made in \cite{Watt}: mathematical document classification is possible via symbol frequency analysis. In particular, we investigate whether this claim is justified: \cite{Watt} contains no experimental evidence which supports this. Furthermore, we extend this research further and investigate whether the inclusion of mathematical notational information improves classification accuracy over the existing single vocabulary approaches. To do so, we review a selection of machine learning methods for document classification and refine and extend these models to incorporate mathematical notational information and investigate whether these models yield higher classification performance over existing word only versions. In this research, we develop the novel mathematical document models ``Dual Latent Dirichlet Allocation'' and ``Dual Pachinko Allocation'' which are extensions to the existing topic models ``Latent Dirichlet Allocation'' and ``Pachinko Allocation'' respectively. Our proposed models observe mathematical documents over two separate vocabularies (words and mathematical symbols). Furthermore, we present Online Variational Bayes for Pachinko Allocation and our proposed models to allow for fast parameter estimation over a single pass of the data. We perform systematic analysis on these models, and we verify the claims made in \cite{Watt}, and furthermore, we observe that the inclusion of symbol data via Dual Pachinko Allocation only yields in an increase of classification performance over the single vocabulary variants and the prior art in this field.004University of Bathhttps://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.715299Electronic Thesis or Dissertation
collection NDLTD
sources NDLTD
topic 004
spellingShingle 004
Evans, Ieuan
Semi-supervised topic models applied to mathematical document classification
description Our objective is to build a mathematical document classifier: a machine which for a given mathematical document $\mathbf{x}$, determines the mathematical subject area $\cc$. In particular, we wish to construct the function $f$ such that $f(\mathbf{x}, \TTheta) = \cc$ where $f$ requires the possibly unknown parameters $\TTheta$ which may be estimated using an existing corpus of labelled documents. The novelty here is that our proposed classifiers will observe a mathematical document over dual vocabularies. In particular, as a collection of both words and mathematical symbols. In this thesis, we predominantly review the claims made in \cite{Watt}: mathematical document classification is possible via symbol frequency analysis. In particular, we investigate whether this claim is justified: \cite{Watt} contains no experimental evidence which supports this. Furthermore, we extend this research further and investigate whether the inclusion of mathematical notational information improves classification accuracy over the existing single vocabulary approaches. To do so, we review a selection of machine learning methods for document classification and refine and extend these models to incorporate mathematical notational information and investigate whether these models yield higher classification performance over existing word only versions. In this research, we develop the novel mathematical document models ``Dual Latent Dirichlet Allocation'' and ``Dual Pachinko Allocation'' which are extensions to the existing topic models ``Latent Dirichlet Allocation'' and ``Pachinko Allocation'' respectively. Our proposed models observe mathematical documents over two separate vocabularies (words and mathematical symbols). Furthermore, we present Online Variational Bayes for Pachinko Allocation and our proposed models to allow for fast parameter estimation over a single pass of the data. We perform systematic analysis on these models, and we verify the claims made in \cite{Watt}, and furthermore, we observe that the inclusion of symbol data via Dual Pachinko Allocation only yields in an increase of classification performance over the single vocabulary variants and the prior art in this field.
author2 Davenport, James ; Hall, Peter
author_facet Davenport, James ; Hall, Peter
Evans, Ieuan
author Evans, Ieuan
author_sort Evans, Ieuan
title Semi-supervised topic models applied to mathematical document classification
title_short Semi-supervised topic models applied to mathematical document classification
title_full Semi-supervised topic models applied to mathematical document classification
title_fullStr Semi-supervised topic models applied to mathematical document classification
title_full_unstemmed Semi-supervised topic models applied to mathematical document classification
title_sort semi-supervised topic models applied to mathematical document classification
publisher University of Bath
publishDate 2017
url https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.715299
work_keys_str_mv AT evansieuan semisupervisedtopicmodelsappliedtomathematicaldocumentclassification
_version_ 1719002677664284672