Multi-kernel Chinese Characters Recognition and A Simplified Language Model Used in General Document Processing Systems
碩士 === 國立交通大學 === 資訊工程系 === 88 === The goal of this thesis is to propose a general Chinese document processing systems which consists of three modules: preprocessing, recognition kernel, and postprocessing. In the preprocessing module, input images probably have small skew angles. These skew angles...
Main Authors: | , |
---|---|
Other Authors: | |
Format: | Others |
Language: | en_US |
Published: |
2000
|
Online Access: | http://ndltd.ncl.edu.tw/handle/69736609054301724780 |
id |
ndltd-TW-088NCTU0392070 |
---|---|
record_format |
oai_dc |
spelling |
ndltd-TW-088NCTU03920702015-10-13T10:59:52Z http://ndltd.ncl.edu.tw/handle/69736609054301724780 Multi-kernel Chinese Characters Recognition and A Simplified Language Model Used in General Document Processing Systems 中文文件處理系統中使用之多核心辨識方法與簡化型語言模式 Zhao,San-Lung 趙善隆 碩士 國立交通大學 資訊工程系 88 The goal of this thesis is to propose a general Chinese document processing systems which consists of three modules: preprocessing, recognition kernel, and postprocessing. In the preprocessing module, input images probably have small skew angles. These skew angles will affect the performance of character segmentation and character recognition. A skew angle detection method is used and a modified rotate transform is proposed to rotate document images. In our system, sentences and characters must be extracted for recognition engines. For this purpose, document images must be segmented into text blocks, text lines, and character images. After we detect the punctuation marks in the character images, we construct sentences from character images. In the recognition module, we use two recognition engines to recognize the character images. Contour directional features and crossing count features are selected for kernel 1 and Oka''s cellular features and peripheral background area features are selected for kernel 2. The weights of these kernels and features are related to the relative stroke widths of character images which provide measurements about character image quality. When we construct recognition engines, the features are trained from a character image database selecting from document images. To provide more robust training features to increase the recognition rate, bad features instead of bad images are removed in the feature training process. In the post-processing module, a simplified language model is used. The model includes word selection bound setting, matching order establishing, fast word matching, and most-confident word selection. By using this model, the processing can be speed-up. The experiments performed on more than 40 articles images show the system we propose here is very effective and efficient. Lee, Hsi-Jian 李錫堅 2000 學位論文 ; thesis 61 en_US |
collection |
NDLTD |
language |
en_US |
format |
Others
|
sources |
NDLTD |
description |
碩士 === 國立交通大學 === 資訊工程系 === 88 === The goal of this thesis is to propose a general Chinese document processing systems which consists of three modules: preprocessing, recognition kernel, and postprocessing. In the preprocessing module, input images probably have small skew angles. These skew angles will affect the performance of character segmentation and character recognition. A skew angle detection method is used and a modified rotate transform is proposed to rotate document images. In our system, sentences and characters must be extracted for recognition engines. For this purpose, document images must be segmented into text blocks, text lines, and character images. After we detect the punctuation marks in the character images, we construct sentences from character images.
In the recognition module, we use two recognition engines to recognize the character images. Contour directional features and crossing count features are selected for kernel 1 and Oka''s cellular features and peripheral background area features are selected for kernel 2. The weights of these kernels and features are related to the relative stroke widths of character images which provide measurements about character image quality. When we construct recognition engines, the features are trained from a character image database selecting from document images. To provide more robust training features to increase the recognition rate, bad features instead of bad images are removed in the feature training process.
In the post-processing module, a simplified language model is used. The model includes word selection bound setting, matching order establishing, fast word matching, and most-confident word selection. By using this model, the processing can be speed-up.
The experiments performed on more than 40 articles images show the system we propose here is very effective and efficient.
|
author2 |
Lee, Hsi-Jian |
author_facet |
Lee, Hsi-Jian Zhao,San-Lung 趙善隆 |
author |
Zhao,San-Lung 趙善隆 |
spellingShingle |
Zhao,San-Lung 趙善隆 Multi-kernel Chinese Characters Recognition and A Simplified Language Model Used in General Document Processing Systems |
author_sort |
Zhao,San-Lung |
title |
Multi-kernel Chinese Characters Recognition and A Simplified Language Model Used in General Document Processing Systems |
title_short |
Multi-kernel Chinese Characters Recognition and A Simplified Language Model Used in General Document Processing Systems |
title_full |
Multi-kernel Chinese Characters Recognition and A Simplified Language Model Used in General Document Processing Systems |
title_fullStr |
Multi-kernel Chinese Characters Recognition and A Simplified Language Model Used in General Document Processing Systems |
title_full_unstemmed |
Multi-kernel Chinese Characters Recognition and A Simplified Language Model Used in General Document Processing Systems |
title_sort |
multi-kernel chinese characters recognition and a simplified language model used in general document processing systems |
publishDate |
2000 |
url |
http://ndltd.ncl.edu.tw/handle/69736609054301724780 |
work_keys_str_mv |
AT zhaosanlung multikernelchinesecharactersrecognitionandasimplifiedlanguagemodelusedingeneraldocumentprocessingsystems AT zhàoshànlóng multikernelchinesecharactersrecognitionandasimplifiedlanguagemodelusedingeneraldocumentprocessingsystems AT zhaosanlung zhōngwénwénjiànchùlǐxìtǒngzhōngshǐyòngzhīduōhéxīnbiànshífāngfǎyǔjiǎnhuàxíngyǔyánmóshì AT zhàoshànlóng zhōngwénwénjiànchùlǐxìtǒngzhōngshǐyòngzhīduōhéxīnbiànshífāngfǎyǔjiǎnhuàxíngyǔyánmóshì |
_version_ |
1716835374325563392 |