Machine Translation of Mathematical Text

We have implemented a machine translation system, the PolyMath Translator, for LaTeX documents containing mathematical text. The current implementation translates English LaTeX to French LaTeX, attaining a BLEU score of 53.6 on a held-out test corpus of mathematical sentences. It produces LaTeX docu...

Full description

Bibliographic Details
Main Authors: Aditya Ohri, Tanya Schmah
Format: Article
Language:English
Published: IEEE 2021-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/9369381/
id doaj-50a08e5d402341d98f0e6bf7609fb299
record_format Article
spelling doaj-50a08e5d402341d98f0e6bf7609fb2992021-03-30T14:58:16ZengIEEEIEEE Access2169-35362021-01-019380783808610.1109/ACCESS.2021.30637159369381Machine Translation of Mathematical TextAditya Ohri0https://orcid.org/0000-0002-2045-1576Tanya Schmah1https://orcid.org/0000-0002-0404-8824Department of Mathematics and Statistics, University of Ottawa, Ottawa, ON, CanadaDepartment of Mathematics and Statistics, University of Ottawa, Ottawa, ON, CanadaWe have implemented a machine translation system, the PolyMath Translator, for LaTeX documents containing mathematical text. The current implementation translates English LaTeX to French LaTeX, attaining a BLEU score of 53.6 on a held-out test corpus of mathematical sentences. It produces LaTeX documents that can be compiled to PDF without further editing. The system first converts the body of an input LaTeX document into English sentences containing math tokens, using the pandoc universal document converter to parse LaTeX input. We have trained a Transformer-based translator model, using OpenNMT, on a combined corpus containing a small proportion of domain-specific sentences. Our full system uses this Transformer model and also Google Translate with a custom glossary, the latter being used as a backup to better handle linguistic features that do not appear in our training dataset. Google Translate is used when the Transformer model does not have confidence in its translation, as determined by a high perplexity score. Ablation testing demonstrates that the tokenization of symbolic expressions is essential to the high quality of translations produced by our system. We have published our test corpus of mathematical text. The PolyMath Translator is available as a web service at <uri>http://www.polymathtrans.ai</uri>.https://ieeexplore.ieee.org/document/9369381/Machine translationnatural language processingmulti-layer neural networkLaTeX
collection DOAJ
language English
format Article
sources DOAJ
author Aditya Ohri
Tanya Schmah
spellingShingle Aditya Ohri
Tanya Schmah
Machine Translation of Mathematical Text
IEEE Access
Machine translation
natural language processing
multi-layer neural network
LaTeX
author_facet Aditya Ohri
Tanya Schmah
author_sort Aditya Ohri
title Machine Translation of Mathematical Text
title_short Machine Translation of Mathematical Text
title_full Machine Translation of Mathematical Text
title_fullStr Machine Translation of Mathematical Text
title_full_unstemmed Machine Translation of Mathematical Text
title_sort machine translation of mathematical text
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2021-01-01
description We have implemented a machine translation system, the PolyMath Translator, for LaTeX documents containing mathematical text. The current implementation translates English LaTeX to French LaTeX, attaining a BLEU score of 53.6 on a held-out test corpus of mathematical sentences. It produces LaTeX documents that can be compiled to PDF without further editing. The system first converts the body of an input LaTeX document into English sentences containing math tokens, using the pandoc universal document converter to parse LaTeX input. We have trained a Transformer-based translator model, using OpenNMT, on a combined corpus containing a small proportion of domain-specific sentences. Our full system uses this Transformer model and also Google Translate with a custom glossary, the latter being used as a backup to better handle linguistic features that do not appear in our training dataset. Google Translate is used when the Transformer model does not have confidence in its translation, as determined by a high perplexity score. Ablation testing demonstrates that the tokenization of symbolic expressions is essential to the high quality of translations produced by our system. We have published our test corpus of mathematical text. The PolyMath Translator is available as a web service at <uri>http://www.polymathtrans.ai</uri>.
topic Machine translation
natural language processing
multi-layer neural network
LaTeX
url https://ieeexplore.ieee.org/document/9369381/
work_keys_str_mv AT adityaohri machinetranslationofmathematicaltext
AT tanyaschmah machinetranslationofmathematicaltext
_version_ 1724180207165767680