The Rare Word Issue in Natural Language Generation: A Character-Based Solution

In this paper, we analyze the problem of generating fluent English utterances from tabular data, focusing on the development of a sequence-to-sequence neural model which shows two major features: the ability to read and generate character-wise, and the ability to switch between generating and copyin...

Full description

Bibliographic Details
Main Authors:	Giovanni Bonetta, Marco Roberti, Rossella Cancelliere, Patrick Gallinari
Format:	Article
Language:	English
Published:	MDPI AG 2021-03-01
Series:	Informatics
Subjects:	data-to-text generation deep learning sequence-to-sequence models natural language processing
Online Access:	https://www.mdpi.com/2227-9709/8/1/20

id	doaj-327dd184f0ce439198b6223c0f8bd30d
record_format	Article
spelling	doaj-327dd184f0ce439198b6223c0f8bd30d2021-03-24T00:01:44ZengMDPI AGInformatics2227-97092021-03-018202010.3390/informatics8010020The Rare Word Issue in Natural Language Generation: A Character-Based SolutionGiovanni Bonetta0Marco Roberti1Rossella Cancelliere2Patrick Gallinari3Department of Computer Science, University of Turin, 10149 Turin, ItalyDepartment of Computer Science, University of Turin, 10149 Turin, ItalyDepartment of Computer Science, University of Turin, 10149 Turin, ItalyLaboratoire d’Informatique de Paris 6, Sorbonne University, CNRS, 75005 Paris, FranceIn this paper, we analyze the problem of generating fluent English utterances from tabular data, focusing on the development of a sequence-to-sequence neural model which shows two major features: the ability to read and generate character-wise, and the ability to switch between generating and copying characters from the input: an essential feature when inputs contain rare words like proper names, telephone numbers, or foreign words. Working with characters instead of words is a challenge that can bring problems such as increasing the difficulty of the training phase and a bigger error probability during inference. Nevertheless, our work shows that these issues can be solved and efforts are repaid by the creation of a fully end-to-end system, whose inputs and outputs are not constrained to be part of a predefined vocabulary, like in word-based models. Furthermore, our copying technique is integrated with an innovative shift mechanism, which enhances the ability to produce outputs directly from inputs. We assess performance on the E2E dataset, the benchmark used for the E2E NLG challenge, and on a modified version of it, created to highlight the rare word copying capabilities of our model. The results demonstrate clear improvements over the baseline and promising performance compared to recent techniques in the literature.https://www.mdpi.com/2227-9709/8/1/20data-to-text generationdeep learningsequence-to-sequence modelsnatural language processing
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Giovanni Bonetta Marco Roberti Rossella Cancelliere Patrick Gallinari
spellingShingle	Giovanni Bonetta Marco Roberti Rossella Cancelliere Patrick Gallinari The Rare Word Issue in Natural Language Generation: A Character-Based Solution Informatics data-to-text generation deep learning sequence-to-sequence models natural language processing
author_facet	Giovanni Bonetta Marco Roberti Rossella Cancelliere Patrick Gallinari
author_sort	Giovanni Bonetta
title	The Rare Word Issue in Natural Language Generation: A Character-Based Solution
title_short	The Rare Word Issue in Natural Language Generation: A Character-Based Solution
title_full	The Rare Word Issue in Natural Language Generation: A Character-Based Solution
title_fullStr	The Rare Word Issue in Natural Language Generation: A Character-Based Solution
title_full_unstemmed	The Rare Word Issue in Natural Language Generation: A Character-Based Solution
title_sort	rare word issue in natural language generation: a character-based solution
publisher	MDPI AG
series	Informatics
issn	2227-9709
publishDate	2021-03-01
description	In this paper, we analyze the problem of generating fluent English utterances from tabular data, focusing on the development of a sequence-to-sequence neural model which shows two major features: the ability to read and generate character-wise, and the ability to switch between generating and copying characters from the input: an essential feature when inputs contain rare words like proper names, telephone numbers, or foreign words. Working with characters instead of words is a challenge that can bring problems such as increasing the difficulty of the training phase and a bigger error probability during inference. Nevertheless, our work shows that these issues can be solved and efforts are repaid by the creation of a fully end-to-end system, whose inputs and outputs are not constrained to be part of a predefined vocabulary, like in word-based models. Furthermore, our copying technique is integrated with an innovative shift mechanism, which enhances the ability to produce outputs directly from inputs. We assess performance on the E2E dataset, the benchmark used for the E2E NLG challenge, and on a modified version of it, created to highlight the rare word copying capabilities of our model. The results demonstrate clear improvements over the baseline and promising performance compared to recent techniques in the literature.
topic	data-to-text generation deep learning sequence-to-sequence models natural language processing
url	https://www.mdpi.com/2227-9709/8/1/20
work_keys_str_mv	AT giovannibonetta therarewordissueinnaturallanguagegenerationacharacterbasedsolution AT marcoroberti therarewordissueinnaturallanguagegenerationacharacterbasedsolution AT rossellacancelliere therarewordissueinnaturallanguagegenerationacharacterbasedsolution AT patrickgallinari therarewordissueinnaturallanguagegenerationacharacterbasedsolution AT giovannibonetta rarewordissueinnaturallanguagegenerationacharacterbasedsolution AT marcoroberti rarewordissueinnaturallanguagegenerationacharacterbasedsolution AT rossellacancelliere rarewordissueinnaturallanguagegenerationacharacterbasedsolution AT patrickgallinari rarewordissueinnaturallanguagegenerationacharacterbasedsolution
_version_	1724205464937299968

The Rare Word Issue in Natural Language Generation: A Character-Based Solution

Similar Items