Summary: | 碩士 === 國立中央大學 === 資訊工程學系 === 104 === Natural language processing (NLP) for classical Chinese is very challenging because the lack of resources. Current works focused mainly on named entity recognition (NER), sentence segmentation and word segmentation and still have much work left to implement a meticulous event extraction system for classical Chinese.
Current event extraction methods need to specify the target event type in advance, which is a high threshold for historical texts. The lack of word boundaries and POS tags are also the obvious barriers to apply these methods. Thus, we develop a tool that can classify paragraphs into event categories, which will make it easier to develop new extraction tools. We first use the Paragraph Vector model for texts embedding and apply unsupervised text clustering to group paragraphs into clusters by their event type. Then use categorized data for training an automatic text classifier.
In this thesis, we propose an unsupervised event type identification approach based on paragraph embedding and apply to the Ming Shilu, focusing on events involving “wei-so”. We also develop a web interface for users to overview the thread of the event. We believe such a tool can help historians to systematically analyze the evolution of historical events. This system also provides a new research direction for mining historical texts and creates a foundation for future work in event extraction of historical texts.
|