Summary: | We present a method for identifying the discrete script or cursive language contained in a document image in only one step. The method depends on extracting a set of global templates that are shared between scripts and languages having common symbol shapes. This results in a small number of templates in addition to saving in processing time and memory requirement during program execution. A key point in our approach is that we perform one-dimensional normalization such that the width to height ratio is retained. This preserves the relative geometrical attributes of symbols, which adds to the discriminating power of our algorithm and produces small-size templates. Our algorithm requires less than 15 seconds using Pentium III (866MHz and 128 MB RAM) to identify the discrete script/cursive language of a document. The very encouraging results of our approach in terms of accuracy and speed make it suitable for use in commercial OCR products. Keywords: Document understanding, Script and language identification, Normalization, Template matching
|