Semi-supervised topic models applied to mathematical document classification

Our objective is to build a mathematical document classifier: a machine which for a given mathematical document $\mathbf{x}$, determines the mathematical subject area $\cc$. In particular, we wish to construct the function $f$ such that $f(\mathbf{x}, \TTheta) = \cc$ where $f$ requires the possibly...

Full description

Bibliographic Details
Main Author: Evans, Ieuan
Other Authors: Davenport, James ; Hall, Peter
Published: University of Bath 2017
Subjects:
004
Online Access:https://ethos.bl.uk/OrderDetails.do?uin=uk.bl.ethos.715299
Description
Summary:Our objective is to build a mathematical document classifier: a machine which for a given mathematical document $\mathbf{x}$, determines the mathematical subject area $\cc$. In particular, we wish to construct the function $f$ such that $f(\mathbf{x}, \TTheta) = \cc$ where $f$ requires the possibly unknown parameters $\TTheta$ which may be estimated using an existing corpus of labelled documents. The novelty here is that our proposed classifiers will observe a mathematical document over dual vocabularies. In particular, as a collection of both words and mathematical symbols. In this thesis, we predominantly review the claims made in \cite{Watt}: mathematical document classification is possible via symbol frequency analysis. In particular, we investigate whether this claim is justified: \cite{Watt} contains no experimental evidence which supports this. Furthermore, we extend this research further and investigate whether the inclusion of mathematical notational information improves classification accuracy over the existing single vocabulary approaches. To do so, we review a selection of machine learning methods for document classification and refine and extend these models to incorporate mathematical notational information and investigate whether these models yield higher classification performance over existing word only versions. In this research, we develop the novel mathematical document models ``Dual Latent Dirichlet Allocation'' and ``Dual Pachinko Allocation'' which are extensions to the existing topic models ``Latent Dirichlet Allocation'' and ``Pachinko Allocation'' respectively. Our proposed models observe mathematical documents over two separate vocabularies (words and mathematical symbols). Furthermore, we present Online Variational Bayes for Pachinko Allocation and our proposed models to allow for fast parameter estimation over a single pass of the data. We perform systematic analysis on these models, and we verify the claims made in \cite{Watt}, and furthermore, we observe that the inclusion of symbol data via Dual Pachinko Allocation only yields in an increase of classification performance over the single vocabulary variants and the prior art in this field.