The Analysis of Use Optical Character Recognition to Establish the Full-text Retrieval Database:A Case Study of the Anthology of Chinese Literature in Ming

碩士 === 國立政治大學 === 圖書資訊與檔案學研究所 === 105 === Digital Archives, placed in the network system for users to browse, change the collection into the digital images, and can help to preserve the collection and promote the content information. However, in the era of information explosion, Digital Archives can...

Full description

Bibliographic Details
Main Authors: Tsai, Han Wei, 蔡瀚緯
Other Authors: Lin, Chiao Min
Format: Others
Language:zh-TW
Online Access:http://ndltd.ncl.edu.tw/handle/fa92n7
id ndltd-TW-105NCCU5447031
record_format oai_dc
spelling ndltd-TW-105NCCU54470312019-05-15T23:39:15Z http://ndltd.ncl.edu.tw/handle/fa92n7 The Analysis of Use Optical Character Recognition to Establish the Full-text Retrieval Database:A Case Study of the Anthology of Chinese Literature in Ming 運用光學字元辨識技術建置數位典藏全文資料庫之評估:以明人文集為例 Tsai, Han Wei 蔡瀚緯 碩士 國立政治大學 圖書資訊與檔案學研究所 105 Digital Archives, placed in the network system for users to browse, change the collection into the digital images, and can help to preserve the collection and promote the content information. However, in the era of information explosion, Digital Archives can’t help users to retrieve the information in the collection by simply recording metadata. So, only when built into the full text retrieval can Digital Archives provide users with a quick retrieval of the information they want. And the Optical Character Recognition (OCR) can help to output the full text information. The study explores the ancient books’ format and impact of image quality on the recognition results by recognizing the ancient books of the Ming dynasty with the OCR software. The study also explores institutional as well as individual views and considerations by in-depth interviewing institutional staff with experiences in the full text of Digital Archives plan. From the result we can discover that though the ancient books’ format and image quality do have influences on the recognition results, the overall interview suggests that the technology has overcome the limitation of the format under the high requirement for the image quality; that is, the quality of ancient books’ images is the most influential factor in the recognition results. Although the OCR already has the breakthrough in assisting the establishment of the full text database, most institutions have not yet applied this technology to full-textualization of the Digital Archives due to technical unfamiliar, budget, human resources and other factors. The study suggests that if some day one institution is interested in working on the the full text of the Digital Archives project, it firstly needs to develop a proper SOP and needs to understand the conditions of their ready-to-be-textualized collections so that it can adopt a suitable input mode. Secondly, this institution needs to communicate with the OCR company more so that it can realize whether the chosen collection fits the cost-effectiveness. Finally, under the considerations of both the institution and users, the study suggests that institutions can cooperate with OCR companies in the future, so users can choose collections for OCR recognition on their own and give the full text to the institutions as feedback after proofreading. This can not only understand users’ needs but also reduce the cost of the proofreading for the institution. Lin, Chiao Min 林巧敏 學位論文 ; thesis 173 zh-TW
collection NDLTD
language zh-TW
format Others
sources NDLTD
description 碩士 === 國立政治大學 === 圖書資訊與檔案學研究所 === 105 === Digital Archives, placed in the network system for users to browse, change the collection into the digital images, and can help to preserve the collection and promote the content information. However, in the era of information explosion, Digital Archives can’t help users to retrieve the information in the collection by simply recording metadata. So, only when built into the full text retrieval can Digital Archives provide users with a quick retrieval of the information they want. And the Optical Character Recognition (OCR) can help to output the full text information. The study explores the ancient books’ format and impact of image quality on the recognition results by recognizing the ancient books of the Ming dynasty with the OCR software. The study also explores institutional as well as individual views and considerations by in-depth interviewing institutional staff with experiences in the full text of Digital Archives plan. From the result we can discover that though the ancient books’ format and image quality do have influences on the recognition results, the overall interview suggests that the technology has overcome the limitation of the format under the high requirement for the image quality; that is, the quality of ancient books’ images is the most influential factor in the recognition results. Although the OCR already has the breakthrough in assisting the establishment of the full text database, most institutions have not yet applied this technology to full-textualization of the Digital Archives due to technical unfamiliar, budget, human resources and other factors. The study suggests that if some day one institution is interested in working on the the full text of the Digital Archives project, it firstly needs to develop a proper SOP and needs to understand the conditions of their ready-to-be-textualized collections so that it can adopt a suitable input mode. Secondly, this institution needs to communicate with the OCR company more so that it can realize whether the chosen collection fits the cost-effectiveness. Finally, under the considerations of both the institution and users, the study suggests that institutions can cooperate with OCR companies in the future, so users can choose collections for OCR recognition on their own and give the full text to the institutions as feedback after proofreading. This can not only understand users’ needs but also reduce the cost of the proofreading for the institution.
author2 Lin, Chiao Min
author_facet Lin, Chiao Min
Tsai, Han Wei
蔡瀚緯
author Tsai, Han Wei
蔡瀚緯
spellingShingle Tsai, Han Wei
蔡瀚緯
The Analysis of Use Optical Character Recognition to Establish the Full-text Retrieval Database:A Case Study of the Anthology of Chinese Literature in Ming
author_sort Tsai, Han Wei
title The Analysis of Use Optical Character Recognition to Establish the Full-text Retrieval Database:A Case Study of the Anthology of Chinese Literature in Ming
title_short The Analysis of Use Optical Character Recognition to Establish the Full-text Retrieval Database:A Case Study of the Anthology of Chinese Literature in Ming
title_full The Analysis of Use Optical Character Recognition to Establish the Full-text Retrieval Database:A Case Study of the Anthology of Chinese Literature in Ming
title_fullStr The Analysis of Use Optical Character Recognition to Establish the Full-text Retrieval Database:A Case Study of the Anthology of Chinese Literature in Ming
title_full_unstemmed The Analysis of Use Optical Character Recognition to Establish the Full-text Retrieval Database:A Case Study of the Anthology of Chinese Literature in Ming
title_sort analysis of use optical character recognition to establish the full-text retrieval database:a case study of the anthology of chinese literature in ming
url http://ndltd.ncl.edu.tw/handle/fa92n7
work_keys_str_mv AT tsaihanwei theanalysisofuseopticalcharacterrecognitiontoestablishthefulltextretrievaldatabaseacasestudyoftheanthologyofchineseliteratureinming
AT càihànwěi theanalysisofuseopticalcharacterrecognitiontoestablishthefulltextretrievaldatabaseacasestudyoftheanthologyofchineseliteratureinming
AT tsaihanwei yùnyòngguāngxuézìyuánbiànshíjìshùjiànzhìshùwèidiǎncángquánwénzīliàokùzhīpínggūyǐmíngrénwénjíwèilì
AT càihànwěi yùnyòngguāngxuézìyuánbiànshíjìshùjiànzhìshùwèidiǎncángquánwénzīliàokùzhīpínggūyǐmíngrénwénjíwèilì
AT tsaihanwei analysisofuseopticalcharacterrecognitiontoestablishthefulltextretrievaldatabaseacasestudyoftheanthologyofchineseliteratureinming
AT càihànwěi analysisofuseopticalcharacterrecognitiontoestablishthefulltextretrievaldatabaseacasestudyoftheanthologyofchineseliteratureinming
_version_ 1719150209355743232