Framework for automatic information extraction from research papers on nanocrystal devices

To support nanocrystal device development, we have been working on a computational framework to utilize information in research papers on nanocrystal devices. We developed an annotated corpus called “ NaDev” (Nanocrystal Device Development) for this purpose. We also proposed an automatic information...

Full description

Bibliographic Details
Main Authors:	Thaer M. Dieb, Masaharu Yoshioka, Shinjiro Hara, Marcus C. Newton
Format:	Article
Language:	English
Published:	Beilstein-Institut 2015-09-01
Series:	Beilstein Journal of Nanotechnology
Subjects:	annotated corpus automatic information extraction nanocrystal device development nanoinformatics text mining
Online Access:	https://doi.org/10.3762/bjnano.6.190

id	doaj-7decfe282d004e8987879d5eaeaa1f16
record_format	Article
spelling	doaj-7decfe282d004e8987879d5eaeaa1f162020-11-24T21:48:53ZengBeilstein-InstitutBeilstein Journal of Nanotechnology2190-42862015-09-01611872188210.3762/bjnano.6.1902190-4286-6-190Framework for automatic information extraction from research papers on nanocrystal devicesThaer M. Dieb0Masaharu Yoshioka1Shinjiro Hara2Marcus C. Newton3Graduate School of Information Science and Technology, Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo, Hokkaido, 060-0814, JapanGraduate School of Information Science and Technology, Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo, Hokkaido, 060-0814, JapanResearch Center for Integrated Quantum Electronics, Hokkaido University, Kita 13, Nishi 8, Sapporo 060-8628, JapanPhysics & Astronomy, University of Southampton, Southampton, SO17 1BJ, UKTo support nanocrystal device development, we have been working on a computational framework to utilize information in research papers on nanocrystal devices. We developed an annotated corpus called “ NaDev” (Nanocrystal Device Development) for this purpose. We also proposed an automatic information extraction system called “NaDevEx” (Nanocrystal Device Automatic Information Extraction Framework). NaDevEx aims at extracting information from research papers on nanocrystal devices using the NaDev corpus and machine-learning techniques. However, the characteristics of NaDevEx were not examined in detail. In this paper, we conduct system evaluation experiments for NaDevEx using the NaDev corpus. We discuss three main issues: system performance, compared with human annotators; the effect of paper type (synthesis or characterization) on system performance; and the effects of domain knowledge features (e.g., a chemical named entity recognition system and list of names of physical quantities) on system performance. We found that overall system performance was 89% in precision and 69% in recall. If we consider identification of terms that intersect with correct terms for the same information category as the correct identification, i.e., loose agreement (in many cases, we can find that appropriate head nouns such as temperature or pressure loosely match between two terms), the overall performance is 95% in precision and 74% in recall. The system performance is almost comparable with results of human annotators for information categories with rich domain knowledge information (source material). However, for other information categories, given the relatively large number of terms that exist only in one paper, recall of individual information categories is not high (39–73%); however, precision is better (75–97%). The average performance for synthesis papers is better than that for characterization papers because of the lack of training examples for characterization papers. Based on these results, we discuss future research plans for improving the performance of the system.https://doi.org/10.3762/bjnano.6.190annotated corpusautomatic information extractionnanocrystal device developmentnanoinformaticstext mining
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Thaer M. Dieb Masaharu Yoshioka Shinjiro Hara Marcus C. Newton
spellingShingle	Thaer M. Dieb Masaharu Yoshioka Shinjiro Hara Marcus C. Newton Framework for automatic information extraction from research papers on nanocrystal devices Beilstein Journal of Nanotechnology annotated corpus automatic information extraction nanocrystal device development nanoinformatics text mining
author_facet	Thaer M. Dieb Masaharu Yoshioka Shinjiro Hara Marcus C. Newton
author_sort	Thaer M. Dieb
title	Framework for automatic information extraction from research papers on nanocrystal devices
title_short	Framework for automatic information extraction from research papers on nanocrystal devices
title_full	Framework for automatic information extraction from research papers on nanocrystal devices
title_fullStr	Framework for automatic information extraction from research papers on nanocrystal devices
title_full_unstemmed	Framework for automatic information extraction from research papers on nanocrystal devices
title_sort	framework for automatic information extraction from research papers on nanocrystal devices
publisher	Beilstein-Institut
series	Beilstein Journal of Nanotechnology
issn	2190-4286
publishDate	2015-09-01
description	To support nanocrystal device development, we have been working on a computational framework to utilize information in research papers on nanocrystal devices. We developed an annotated corpus called “ NaDev” (Nanocrystal Device Development) for this purpose. We also proposed an automatic information extraction system called “NaDevEx” (Nanocrystal Device Automatic Information Extraction Framework). NaDevEx aims at extracting information from research papers on nanocrystal devices using the NaDev corpus and machine-learning techniques. However, the characteristics of NaDevEx were not examined in detail. In this paper, we conduct system evaluation experiments for NaDevEx using the NaDev corpus. We discuss three main issues: system performance, compared with human annotators; the effect of paper type (synthesis or characterization) on system performance; and the effects of domain knowledge features (e.g., a chemical named entity recognition system and list of names of physical quantities) on system performance. We found that overall system performance was 89% in precision and 69% in recall. If we consider identification of terms that intersect with correct terms for the same information category as the correct identification, i.e., loose agreement (in many cases, we can find that appropriate head nouns such as temperature or pressure loosely match between two terms), the overall performance is 95% in precision and 74% in recall. The system performance is almost comparable with results of human annotators for information categories with rich domain knowledge information (source material). However, for other information categories, given the relatively large number of terms that exist only in one paper, recall of individual information categories is not high (39–73%); however, precision is better (75–97%). The average performance for synthesis papers is better than that for characterization papers because of the lack of training examples for characterization papers. Based on these results, we discuss future research plans for improving the performance of the system.
topic	annotated corpus automatic information extraction nanocrystal device development nanoinformatics text mining
url	https://doi.org/10.3762/bjnano.6.190
work_keys_str_mv	AT thaermdieb frameworkforautomaticinformationextractionfromresearchpapersonnanocrystaldevices AT masaharuyoshioka frameworkforautomaticinformationextractionfromresearchpapersonnanocrystaldevices AT shinjirohara frameworkforautomaticinformationextractionfromresearchpapersonnanocrystaldevices AT marcuscnewton frameworkforautomaticinformationextractionfromresearchpapersonnanocrystaldevices
_version_	1725890798454571008

Framework for automatic information extraction from research papers on nanocrystal devices

Similar Items