Summary: | 碩士 === 國立交通大學 === 資訊科學與工程研究所 === 106 === With the advancement of data-driven approach, the lack of corpora has become the main obstacle of the natural language processing research. Compared with English corpora, publicly available Mandarin corpora is even more lacking. Our paper purposes to solve this problem by using existing question answering dataset and knowledge base to create a new Mandarin question answering dataset.
In this study, we first collect the data from CN-DBpedia and question answering dataset from WebQA and web crawler, and propose a method to combine them in the form of pairs as our training data, and then using sequence-to-sequence model to generate questions from knowledge base. The generated questions then incorporate with entities in knowledge base as the answers to create a new Mandarin question answering dataset. In our experiment, we develop a template-based question generation baseline in order to evaluate our model by human evaluation. Our model achieves an acceptable performance compare to the template-based baseline.
|