The Analysis of Use Optical Character Recognition to Establish the Full-text Retrieval Database:A Case Study of the Anthology of Chinese Literature in Ming

碩士 === 國立政治大學 === 圖書資訊與檔案學研究所 === 105 === Digital Archives, placed in the network system for users to browse, change the collection into the digital images, and can help to preserve the collection and promote the content information. However, in the era of information explosion, Digital Archives can...

Full description

Bibliographic Details
Main Authors: Tsai, Han Wei, 蔡瀚緯
Other Authors: Lin, Chiao Min
Format: Others
Language:zh-TW
Online Access:http://ndltd.ncl.edu.tw/handle/fa92n7
Description
Summary:碩士 === 國立政治大學 === 圖書資訊與檔案學研究所 === 105 === Digital Archives, placed in the network system for users to browse, change the collection into the digital images, and can help to preserve the collection and promote the content information. However, in the era of information explosion, Digital Archives can’t help users to retrieve the information in the collection by simply recording metadata. So, only when built into the full text retrieval can Digital Archives provide users with a quick retrieval of the information they want. And the Optical Character Recognition (OCR) can help to output the full text information. The study explores the ancient books’ format and impact of image quality on the recognition results by recognizing the ancient books of the Ming dynasty with the OCR software. The study also explores institutional as well as individual views and considerations by in-depth interviewing institutional staff with experiences in the full text of Digital Archives plan. From the result we can discover that though the ancient books’ format and image quality do have influences on the recognition results, the overall interview suggests that the technology has overcome the limitation of the format under the high requirement for the image quality; that is, the quality of ancient books’ images is the most influential factor in the recognition results. Although the OCR already has the breakthrough in assisting the establishment of the full text database, most institutions have not yet applied this technology to full-textualization of the Digital Archives due to technical unfamiliar, budget, human resources and other factors. The study suggests that if some day one institution is interested in working on the the full text of the Digital Archives project, it firstly needs to develop a proper SOP and needs to understand the conditions of their ready-to-be-textualized collections so that it can adopt a suitable input mode. Secondly, this institution needs to communicate with the OCR company more so that it can realize whether the chosen collection fits the cost-effectiveness. Finally, under the considerations of both the institution and users, the study suggests that institutions can cooperate with OCR companies in the future, so users can choose collections for OCR recognition on their own and give the full text to the institutions as feedback after proofreading. This can not only understand users’ needs but also reduce the cost of the proofreading for the institution.