Summary: | 碩士 === 國立臺灣大學 === 工程科學及海洋工程學研究所 === 106 === Ancient documents tend to be scanned into image files since the invention of the computer. However, these image files are not easy for searching by keywords, so it is necessary to transform them into digitized words. Optical character recognition (OCR) is the key of this process, but document images need to be pre-processed for OCR performing smoothly. This study attempts to automate the pre-processes, such as page segmentation and component connection, on The Crystal, or “Jing Bao,” published in the early 20th century, when the young Republic was just born in China.
Page segmentation have been studied for many years, but the existing methods rely on sufficient blank space between the components to separate them, which is not applicable to The Crystal for its compact arrangement. This study proposes a method for detecting boundaries in The Crystal using the convolutional neural network (CNN). The position of the boundaries can distinguish the components apart and achieve page segmentation. Writing direction based component connection is to connect all the components belong to the same article. This study connects components by visiting each component along the direction of writing, and determines a new article when encountering a title. Finally, this study combines five methods, including the above methods, and proposes a set of digitization process for The Crystal: page segmentation, component classification, punctuation removal, text recognition, and component connection.
The proposed page segmentation method has an mean IoU (intersection over union) of 83.98% on the components in single page of The Crystal. In the component connection method, while only 9 out of 13 articles are connected successfully, the error area is small. It is confirmed that the proposed method can effectively segment the pages of the closely arranged publications, and also demonstrates the effectiveness of the component connection method.
|