A novel local skew correction and segmentation approach for printed multilingual Indian documents

Till date, many Indian government organizations do not have robust software to search for words from scanned office documents having complex multilingual Indian scripts. Manual search of such a multilingual Indian document will take few minutes and there will be tens of thousands of documents to be...

Full description

Bibliographic Details
Main Authors: Narasimha Reddy Soora, Parag S. Deshpande
Format: Article
Language:English
Published: Elsevier 2018-09-01
Series:Alexandria Engineering Journal
Online Access:http://www.sciencedirect.com/science/article/pii/S1110016817302053
id doaj-ebec612493e940b998477d2aa24ec600
record_format Article
spelling doaj-ebec612493e940b998477d2aa24ec6002021-06-02T06:21:25ZengElsevierAlexandria Engineering Journal1110-01682018-09-0157316091618A novel local skew correction and segmentation approach for printed multilingual Indian documentsNarasimha Reddy Soora0Parag S. Deshpande1Visvesvaraya National Institute of Technology, Nagpur 440010, India; Corresponding author. Fax: +91 712 2223969.Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology, Nagpur 440010, IndiaTill date, many Indian government organizations do not have robust software to search for words from scanned office documents having complex multilingual Indian scripts. Manual search of such a multilingual Indian document will take few minutes and there will be tens of thousands of documents to be searched for the desired content. Manual search of such a huge number of scanned Indian documents will be tedious, which requires robust automatic searching software. This led us to work toward indexing of aged printed multilingual Indian office documents. This paper presents a novel geometrical technique to group the components which belong to a text line of a document having multi-orientations and a novel approach to find the local skew of Devanagari word. The performance of the proposed technique was evaluated using 280 printed Indian documents with around 6000 text lines having English, Devanagari, and Marathi scripts and achieved 99% success rate for line segmentation indicates the legitimacy of the proposed method. To further assess the performance of the proposed method, we have considered publicly available Tobacco800 document image database and achieved significant performance results as compared with few of the prominent methods from the literature. Keywords: Character recognition, Character segmentation, Document analysis, Skew correction, Rough skew, Text line extraction, Word segmentationhttp://www.sciencedirect.com/science/article/pii/S1110016817302053
collection DOAJ
language English
format Article
sources DOAJ
author Narasimha Reddy Soora
Parag S. Deshpande
spellingShingle Narasimha Reddy Soora
Parag S. Deshpande
A novel local skew correction and segmentation approach for printed multilingual Indian documents
Alexandria Engineering Journal
author_facet Narasimha Reddy Soora
Parag S. Deshpande
author_sort Narasimha Reddy Soora
title A novel local skew correction and segmentation approach for printed multilingual Indian documents
title_short A novel local skew correction and segmentation approach for printed multilingual Indian documents
title_full A novel local skew correction and segmentation approach for printed multilingual Indian documents
title_fullStr A novel local skew correction and segmentation approach for printed multilingual Indian documents
title_full_unstemmed A novel local skew correction and segmentation approach for printed multilingual Indian documents
title_sort novel local skew correction and segmentation approach for printed multilingual indian documents
publisher Elsevier
series Alexandria Engineering Journal
issn 1110-0168
publishDate 2018-09-01
description Till date, many Indian government organizations do not have robust software to search for words from scanned office documents having complex multilingual Indian scripts. Manual search of such a multilingual Indian document will take few minutes and there will be tens of thousands of documents to be searched for the desired content. Manual search of such a huge number of scanned Indian documents will be tedious, which requires robust automatic searching software. This led us to work toward indexing of aged printed multilingual Indian office documents. This paper presents a novel geometrical technique to group the components which belong to a text line of a document having multi-orientations and a novel approach to find the local skew of Devanagari word. The performance of the proposed technique was evaluated using 280 printed Indian documents with around 6000 text lines having English, Devanagari, and Marathi scripts and achieved 99% success rate for line segmentation indicates the legitimacy of the proposed method. To further assess the performance of the proposed method, we have considered publicly available Tobacco800 document image database and achieved significant performance results as compared with few of the prominent methods from the literature. Keywords: Character recognition, Character segmentation, Document analysis, Skew correction, Rough skew, Text line extraction, Word segmentation
url http://www.sciencedirect.com/science/article/pii/S1110016817302053
work_keys_str_mv AT narasimhareddysoora anovellocalskewcorrectionandsegmentationapproachforprintedmultilingualindiandocuments
AT paragsdeshpande anovellocalskewcorrectionandsegmentationapproachforprintedmultilingualindiandocuments
AT narasimhareddysoora novellocalskewcorrectionandsegmentationapproachforprintedmultilingualindiandocuments
AT paragsdeshpande novellocalskewcorrectionandsegmentationapproachforprintedmultilingualindiandocuments
_version_ 1721407769256394752