Self-Supervised Contextual Data Augmentation for Natural Language Processing

In this paper, we propose a novel data augmentation method with respect to the target context of the data via self-supervised learning. Instead of looking for the exact synonyms of masked words, the proposed method finds words that can replace the original words considering the context. For self-sup...

Full description

Bibliographic Details
Main Authors:	Dongju Park, Chang Wook Ahn
Format:	Article
Language:	English
Published:	MDPI AG 2019-11-01
Series:	Symmetry
Subjects:	data augmentation self-supervised learning natural language processing text classification
Online Access:	https://www.mdpi.com/2073-8994/11/11/1393

id	doaj-558c881d11294cbab46edf7fc9008f74
record_format	Article
spelling	doaj-558c881d11294cbab46edf7fc9008f742020-11-24T21:50:05ZengMDPI AGSymmetry2073-89942019-11-011111139310.3390/sym11111393sym11111393Self-Supervised Contextual Data Augmentation for Natural Language ProcessingDongju Park0Chang Wook Ahn1Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, KoreaElectrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju 61005, KoreaIn this paper, we propose a novel data augmentation method with respect to the target context of the data via self-supervised learning. Instead of looking for the exact synonyms of masked words, the proposed method finds words that can replace the original words considering the context. For self-supervised learning, we can employ the masked language model (MLM), which masks a specific word within a sentence and obtains the original word. The MLM learns the context of a sentence through asymmetrical inputs and outputs. However, without using the existing MLM, we propose a label-masked language model (LMLM) that can include label information for the mask tokens used in the MLM to effectively use the MLM in data with label information. The augmentation method performs self-supervised learning using LMLM and then implements data augmentation through the trained model. We demonstrate that our proposed method improves the classification accuracy of recurrent neural networks and convolutional neural network-based classifiers through several experiments for text classification benchmark datasets, including the Stanford Sentiment Treebank-5 (SST5), the Stanford Sentiment Treebank-2 (SST2), the subjectivity (Subj), the Multi-Perspective Question Answering (MPQA), the Movie Reviews (MR), and the Text Retrieval Conference (TREC) datasets. In addition, since the proposed method does not use external data, it can eliminate the time spent collecting external data, or pre-training using external data.https://www.mdpi.com/2073-8994/11/11/1393data augmentationself-supervised learningnatural language processingtext classification
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Dongju Park Chang Wook Ahn
spellingShingle	Dongju Park Chang Wook Ahn Self-Supervised Contextual Data Augmentation for Natural Language Processing Symmetry data augmentation self-supervised learning natural language processing text classification
author_facet	Dongju Park Chang Wook Ahn
author_sort	Dongju Park
title	Self-Supervised Contextual Data Augmentation for Natural Language Processing
title_short	Self-Supervised Contextual Data Augmentation for Natural Language Processing
title_full	Self-Supervised Contextual Data Augmentation for Natural Language Processing
title_fullStr	Self-Supervised Contextual Data Augmentation for Natural Language Processing
title_full_unstemmed	Self-Supervised Contextual Data Augmentation for Natural Language Processing
title_sort	self-supervised contextual data augmentation for natural language processing
publisher	MDPI AG
series	Symmetry
issn	2073-8994
publishDate	2019-11-01
description	In this paper, we propose a novel data augmentation method with respect to the target context of the data via self-supervised learning. Instead of looking for the exact synonyms of masked words, the proposed method finds words that can replace the original words considering the context. For self-supervised learning, we can employ the masked language model (MLM), which masks a specific word within a sentence and obtains the original word. The MLM learns the context of a sentence through asymmetrical inputs and outputs. However, without using the existing MLM, we propose a label-masked language model (LMLM) that can include label information for the mask tokens used in the MLM to effectively use the MLM in data with label information. The augmentation method performs self-supervised learning using LMLM and then implements data augmentation through the trained model. We demonstrate that our proposed method improves the classification accuracy of recurrent neural networks and convolutional neural network-based classifiers through several experiments for text classification benchmark datasets, including the Stanford Sentiment Treebank-5 (SST5), the Stanford Sentiment Treebank-2 (SST2), the subjectivity (Subj), the Multi-Perspective Question Answering (MPQA), the Movie Reviews (MR), and the Text Retrieval Conference (TREC) datasets. In addition, since the proposed method does not use external data, it can eliminate the time spent collecting external data, or pre-training using external data.
topic	data augmentation self-supervised learning natural language processing text classification
url	https://www.mdpi.com/2073-8994/11/11/1393
work_keys_str_mv	AT dongjupark selfsupervisedcontextualdataaugmentationfornaturallanguageprocessing AT changwookahn selfsupervisedcontextualdataaugmentationfornaturallanguageprocessing
_version_	1725885458327535616

Self-Supervised Contextual Data Augmentation for Natural Language Processing

Similar Items