A Structure-Based Method for Building a Database of Extracted Figures from Scientific Documents: A Case Study of Iran Scientific Information Database (GANJ)

Figures in scientific documents are rich source of information. The first step in retrieving information from such figures is to build a valid figure database. To this end, we developed a system for generating figure database from scholarly Persian documents, in large scale. The first step is to par...

Full description

Bibliographic Details
Main Authors:	Azadeh Fakhrzadeh, Amir Hossein Seddighi
Format:	Article
Language:	fas
Published:	Iranian Research Institute for Information and Technology 2020-06-01
Series:	Iranian Journal of Information Processing & Management
Subjects:	image processing image extraction metadata extraction information technology
Online Access:	http://jipm.irandoc.ac.ir/article-1-4220-en.html

id	doaj-dd2ddd364bfc478588cf6f4a5f9182a1
record_format	Article
spelling	doaj-dd2ddd364bfc478588cf6f4a5f9182a12020-11-25T02:52:01ZfasIranian Research Institute for Information and TechnologyIranian Journal of Information Processing & Management2251-82232251-82312020-06-01353729754A Structure-Based Method for Building a Database of Extracted Figures from Scientific Documents: A Case Study of Iran Scientific Information Database (GANJ)Azadeh Fakhrzadeh0Amir Hossein Seddighi1 Iranian Research Institute for Information Science and Technology (IranDoc) Iranian Research Institute for Information Science and Technology (IranDoc) Figures in scientific documents are rich source of information. The first step in retrieving information from such figures is to build a valid figure database. To this end, we developed a system for generating figure database from scholarly Persian documents, in large scale. The first step is to parse files and extract figures and their corresponding descriptions. There are two general approaches for extracting figures from documents, one is based on image processing methods and another one is based on processing the file primitives. The focus of this paper is on later one. This approach is shown to be a better choice for the search engines because of its speed and scalability properties. We propose a structure based method that extracts the figures and their descriptions by analyzing the file layout. This information is saved in a database with a specific structure and is indexed for retrieval in the search engine. The proposed algorithm was implemented in Python programming language. As a benchmark we used the basic method in the literature which is based on the processing PDF file. We employed the proposed method in a case study on Iran scientific information database (Ganj). In this regard, 150 scientific documents were randomly chosen from Ganj database and analyzed using two mentioned methods. Based on our experimental results, the proposed method is more efficient than the basic method especially for Persian documents. There many unanswered challenges for Persian documents when using the basic method. The number of noise images resulted from the basic method is high and Persian text extracted is not well organized. Our proposed method overcomes some of these drawbacks and is recommended for generating figure database from scientific Persian documents. The proposed method is able to correctly extract about 40% of the images with their corresponding descriptions which is 10% better than the basic method.http://jipm.irandoc.ac.ir/article-1-4220-en.htmlimage processingimage extractionmetadata extractioninformation technology
collection	DOAJ
language	fas
format	Article
sources	DOAJ
author	Azadeh Fakhrzadeh Amir Hossein Seddighi
spellingShingle	Azadeh Fakhrzadeh Amir Hossein Seddighi A Structure-Based Method for Building a Database of Extracted Figures from Scientific Documents: A Case Study of Iran Scientific Information Database (GANJ) Iranian Journal of Information Processing & Management image processing image extraction metadata extraction information technology
author_facet	Azadeh Fakhrzadeh Amir Hossein Seddighi
author_sort	Azadeh Fakhrzadeh
title	A Structure-Based Method for Building a Database of Extracted Figures from Scientific Documents: A Case Study of Iran Scientific Information Database (GANJ)
title_short	A Structure-Based Method for Building a Database of Extracted Figures from Scientific Documents: A Case Study of Iran Scientific Information Database (GANJ)
title_full	A Structure-Based Method for Building a Database of Extracted Figures from Scientific Documents: A Case Study of Iran Scientific Information Database (GANJ)
title_fullStr	A Structure-Based Method for Building a Database of Extracted Figures from Scientific Documents: A Case Study of Iran Scientific Information Database (GANJ)
title_full_unstemmed	A Structure-Based Method for Building a Database of Extracted Figures from Scientific Documents: A Case Study of Iran Scientific Information Database (GANJ)
title_sort	structure-based method for building a database of extracted figures from scientific documents: a case study of iran scientific information database (ganj)
publisher	Iranian Research Institute for Information and Technology
series	Iranian Journal of Information Processing & Management
issn	2251-8223 2251-8231
publishDate	2020-06-01
description	Figures in scientific documents are rich source of information. The first step in retrieving information from such figures is to build a valid figure database. To this end, we developed a system for generating figure database from scholarly Persian documents, in large scale. The first step is to parse files and extract figures and their corresponding descriptions. There are two general approaches for extracting figures from documents, one is based on image processing methods and another one is based on processing the file primitives. The focus of this paper is on later one. This approach is shown to be a better choice for the search engines because of its speed and scalability properties. We propose a structure based method that extracts the figures and their descriptions by analyzing the file layout. This information is saved in a database with a specific structure and is indexed for retrieval in the search engine. The proposed algorithm was implemented in Python programming language. As a benchmark we used the basic method in the literature which is based on the processing PDF file. We employed the proposed method in a case study on Iran scientific information database (Ganj). In this regard, 150 scientific documents were randomly chosen from Ganj database and analyzed using two mentioned methods. Based on our experimental results, the proposed method is more efficient than the basic method especially for Persian documents. There many unanswered challenges for Persian documents when using the basic method. The number of noise images resulted from the basic method is high and Persian text extracted is not well organized. Our proposed method overcomes some of these drawbacks and is recommended for generating figure database from scientific Persian documents. The proposed method is able to correctly extract about 40% of the images with their corresponding descriptions which is 10% better than the basic method.
topic	image processing image extraction metadata extraction information technology
url	http://jipm.irandoc.ac.ir/article-1-4220-en.html
work_keys_str_mv	AT azadehfakhrzadeh astructurebasedmethodforbuildingadatabaseofextractedfiguresfromscientificdocumentsacasestudyofiranscientificinformationdatabaseganj AT amirhosseinseddighi astructurebasedmethodforbuildingadatabaseofextractedfiguresfromscientificdocumentsacasestudyofiranscientificinformationdatabaseganj AT azadehfakhrzadeh structurebasedmethodforbuildingadatabaseofextractedfiguresfromscientificdocumentsacasestudyofiranscientificinformationdatabaseganj AT amirhosseinseddighi structurebasedmethodforbuildingadatabaseofextractedfiguresfromscientificdocumentsacasestudyofiranscientificinformationdatabaseganj
_version_	1724731930845380608

A Structure-Based Method for Building a Database of Extracted Figures from Scientific Documents: A Case Study of Iran Scientific Information Database (GANJ)

Similar Items