Summary: | Till date, many Indian government organizations do not have robust software to search for words from scanned office documents having complex multilingual Indian scripts. Manual search of such a multilingual Indian document will take few minutes and there will be tens of thousands of documents to be searched for the desired content. Manual search of such a huge number of scanned Indian documents will be tedious, which requires robust automatic searching software. This led us to work toward indexing of aged printed multilingual Indian office documents. This paper presents a novel geometrical technique to group the components which belong to a text line of a document having multi-orientations and a novel approach to find the local skew of Devanagari word. The performance of the proposed technique was evaluated using 280 printed Indian documents with around 6000 text lines having English, Devanagari, and Marathi scripts and achieved 99% success rate for line segmentation indicates the legitimacy of the proposed method. To further assess the performance of the proposed method, we have considered publicly available Tobacco800 document image database and achieved significant performance results as compared with few of the prominent methods from the literature. Keywords: Character recognition, Character segmentation, Document analysis, Skew correction, Rough skew, Text line extraction, Word segmentation
|