Filtering explicit text content with deep learning techniques based paraphrasing

碩士 === 中原大學 === 資訊工程研究所 === 107 === The prosperity of social media has made information transmission more quick. Now, the speed of getting new information is faster than newspapers or magazines, but it also means that the content of the text may be full of pornography, violence, drug, racial discrim...

Full description

Bibliographic Details
Main Authors: Chi-Chang Hsieh, 謝其璋
Other Authors: Shih-Wen Ke
Format: Others
Language:zh-TW
Published: 2019
Online Access:http://ndltd.ncl.edu.tw/handle/gse596
Description
Summary:碩士 === 中原大學 === 資訊工程研究所 === 107 === The prosperity of social media has made information transmission more quick. Now, the speed of getting new information is faster than newspapers or magazines, but it also means that the content of the text may be full of pornography, violence, drug, racial discrimination, gender discrimination, etc. The explicit text is different from the grading system of movies or book. The text on the Internet cannot be filtered. It is often only through the administrator or the reporting system that the explicit text can be removed. The purpose of this experiment is to use the deep learning to filter explicit text and paraphrase them. In this experiment, because there is no parallel corpus with explicit text, so we use the explicit text dataset with different parallel corpora from Quora, CoCo, and MSRP datasets to produce a corpus with explicit text. The deep learning method is trained to produce a model that can filter explicit text and paraphrase them. In Experiment 1, we trained by Residual LSTM, LSTM, and Gru, and used BLEU and ROUGE automatic evaluation methods to evaluate which model is better. In Experiment 2, we use Quora dataset to make a questionnaire and sent to 5 subjects for manual evaluation. Finally, the results were compared with the results of Experiment 1. The results show that our methods can effectively remove explicit text after deep learning methods, but the effect of paraphrasing has room for improvement. In Experiment 1, we used BLEU and ROUGE to do automatic evaluation. Gru is better than Residual LSTM and LSTM in the results. In Experiment 2, we used the manual evaluation method of the questionnaire to evaluate. The results showed that the Residual LSTM was highly consistent in the subjects. In the paraphrase evaluation, only the first step of the test can be performed in an automatic evaluation, which can ensure the completeness of the sentence, but it is necessary to select which deep learning method is better still need a manual evaluation method to detect.