Research on the Image Processing and Recognition Technology of Chinese Publications in the Early 20th Century

碩士 === 國立臺灣大學 === 工程科學及海洋工程學研究所 === 106 === Ancient documents tend to be scanned into image files since the invention of the computer. However, these image files are not easy for searching by keywords, so it is necessary to transform them into digitized words. Optical character recognition (OCR) is...

Full description

Bibliographic Details
Main Authors: Sher-Win Chen, 陳善文
Other Authors: 黃乾綱
Format: Others
Language:zh-TW
Published: 2018
Online Access:http://ndltd.ncl.edu.tw/handle/be48d9
id ndltd-TW-106NTU05345029
record_format oai_dc
spelling ndltd-TW-106NTU053450292019-05-30T03:50:44Z http://ndltd.ncl.edu.tw/handle/be48d9 Research on the Image Processing and Recognition Technology of Chinese Publications in the Early 20th Century 民初漢字出版品數位化技術之研究 Sher-Win Chen 陳善文 碩士 國立臺灣大學 工程科學及海洋工程學研究所 106 Ancient documents tend to be scanned into image files since the invention of the computer. However, these image files are not easy for searching by keywords, so it is necessary to transform them into digitized words. Optical character recognition (OCR) is the key of this process, but document images need to be pre-processed for OCR performing smoothly. This study attempts to automate the pre-processes, such as page segmentation and component connection, on The Crystal, or “Jing Bao,” published in the early 20th century, when the young Republic was just born in China. Page segmentation have been studied for many years, but the existing methods rely on sufficient blank space between the components to separate them, which is not applicable to The Crystal for its compact arrangement. This study proposes a method for detecting boundaries in The Crystal using the convolutional neural network (CNN). The position of the boundaries can distinguish the components apart and achieve page segmentation. Writing direction based component connection is to connect all the components belong to the same article. This study connects components by visiting each component along the direction of writing, and determines a new article when encountering a title. Finally, this study combines five methods, including the above methods, and proposes a set of digitization process for The Crystal: page segmentation, component classification, punctuation removal, text recognition, and component connection. The proposed page segmentation method has an mean IoU (intersection over union) of 83.98% on the components in single page of The Crystal. In the component connection method, while only 9 out of 13 articles are connected successfully, the error area is small. It is confirmed that the proposed method can effectively segment the pages of the closely arranged publications, and also demonstrates the effectiveness of the component connection method. 黃乾綱 2018 學位論文 ; thesis 70 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立臺灣大學 === 工程科學及海洋工程學研究所 === 106 === Ancient documents tend to be scanned into image files since the invention of the computer. However, these image files are not easy for searching by keywords, so it is necessary to transform them into digitized words. Optical character recognition (OCR) is the key of this process, but document images need to be pre-processed for OCR performing smoothly. This study attempts to automate the pre-processes, such as page segmentation and component connection, on The Crystal, or “Jing Bao,” published in the early 20th century, when the young Republic was just born in China. Page segmentation have been studied for many years, but the existing methods rely on sufficient blank space between the components to separate them, which is not applicable to The Crystal for its compact arrangement. This study proposes a method for detecting boundaries in The Crystal using the convolutional neural network (CNN). The position of the boundaries can distinguish the components apart and achieve page segmentation. Writing direction based component connection is to connect all the components belong to the same article. This study connects components by visiting each component along the direction of writing, and determines a new article when encountering a title. Finally, this study combines five methods, including the above methods, and proposes a set of digitization process for The Crystal: page segmentation, component classification, punctuation removal, text recognition, and component connection. The proposed page segmentation method has an mean IoU (intersection over union) of 83.98% on the components in single page of The Crystal. In the component connection method, while only 9 out of 13 articles are connected successfully, the error area is small. It is confirmed that the proposed method can effectively segment the pages of the closely arranged publications, and also demonstrates the effectiveness of the component connection method.
author2 黃乾綱
author_facet 黃乾綱
Sher-Win Chen
陳善文
author Sher-Win Chen
陳善文
spellingShingle Sher-Win Chen
陳善文
Research on the Image Processing and Recognition Technology of Chinese Publications in the Early 20th Century
author_sort Sher-Win Chen
title Research on the Image Processing and Recognition Technology of Chinese Publications in the Early 20th Century
title_short Research on the Image Processing and Recognition Technology of Chinese Publications in the Early 20th Century
title_full Research on the Image Processing and Recognition Technology of Chinese Publications in the Early 20th Century
title_fullStr Research on the Image Processing and Recognition Technology of Chinese Publications in the Early 20th Century
title_full_unstemmed Research on the Image Processing and Recognition Technology of Chinese Publications in the Early 20th Century
title_sort research on the image processing and recognition technology of chinese publications in the early 20th century
publishDate 2018
url http://ndltd.ncl.edu.tw/handle/be48d9
work_keys_str_mv AT sherwinchen researchontheimageprocessingandrecognitiontechnologyofchinesepublicationsintheearly20thcentury
AT chénshànwén researchontheimageprocessingandrecognitiontechnologyofchinesepublicationsintheearly20thcentury
AT sherwinchen mínchūhànzìchūbǎnpǐnshùwèihuàjìshùzhīyánjiū
AT chénshànwén mínchūhànzìchūbǎnpǐnshùwèihuàjìshùzhīyánjiū
_version_ 1719195279856500736