A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications

Purpose: Our study proposes a bootstrapping-based method to automatically extract data-usage statements from academic texts. Design/methodology/approach: The method for data-usage statements extraction starts with seed entities and iteratively learns patterns and data-usage statements from unlabele...

Full description

Bibliographic Details
Main Authors: Qiuzi Zhang, Qikai Cheng, Yong Huang, Wei Lu
Format: Article
Language:English
Published: Chinese Academy of Sciences 2016-03-01
Series:Journal of Data and Information Science
Subjects:
Online Access:http://www.jdis.org/CN/html/Article8610.htm
id doaj-ca5ae15221ed4e82b876286458b3f68d
record_format Article
spelling doaj-ca5ae15221ed4e82b876286458b3f68d2020-11-24T22:23:07ZengChinese Academy of SciencesJournal of Data and Information Science2096-157X2096-157X2016-03-0111698510.20309/jdis.201606A Bootstrapping-based Method to Automatically Identify Data-usage Statements in PublicationsQiuzi Zhang0Qikai Cheng1Yong Huang2Wei Lu3School of Information Management, Wuhan University, Wuhan 430072, ChinaSchool of Information Management, Wuhan University, Wuhan 430072, ChinaSchool of Information Management, Wuhan University, Wuhan 430072, ChinaSchool of Information Management, Wuhan University, Wuhan 430072, ChinaPurpose: Our study proposes a bootstrapping-based method to automatically extract data-usage statements from academic texts. Design/methodology/approach: The method for data-usage statements extraction starts with seed entities and iteratively learns patterns and data-usage statements from unlabeled text. In each iteration, new patterns are constructed and added to the pattern list based on their calculated score. Three seed-selection strategies are also proposed in this paper. Findings: The performance of the method is verified by means of experiments on real data collected from computer science journals. The results show that the method can achieve satisfactory performance regarding precision of extraction and extensibility of obtained patterns. Research limitations: While the triple representation of sentences is effective and efficient for extracting data-usage statements, it is unable to handle complex sentences. Additional features that can address complex sentences should thus be explored in the future. Practical implications: Data-usage statements extraction is beneficial for data-repository construction and facilitates research on data-usage tracking, dataset-based scholar search, and dataset evaluation. Originality/value: To the best of our knowledge, this paper is among the first to address the important task of automatically extracting data-usage statements from real data. http://www.jdis.org/CN/html/Article8610.htmData-usage statements extractionInformation extractionBootstrappingUnsupervised learningAcademic text-mining
collection DOAJ
language English
format Article
sources DOAJ
author Qiuzi Zhang
Qikai Cheng
Yong Huang
Wei Lu
spellingShingle Qiuzi Zhang
Qikai Cheng
Yong Huang
Wei Lu
A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications
Journal of Data and Information Science
Data-usage statements extraction
Information extraction
Bootstrapping
Unsupervised learning
Academic text-mining
author_facet Qiuzi Zhang
Qikai Cheng
Yong Huang
Wei Lu
author_sort Qiuzi Zhang
title A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications
title_short A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications
title_full A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications
title_fullStr A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications
title_full_unstemmed A Bootstrapping-based Method to Automatically Identify Data-usage Statements in Publications
title_sort bootstrapping-based method to automatically identify data-usage statements in publications
publisher Chinese Academy of Sciences
series Journal of Data and Information Science
issn 2096-157X
2096-157X
publishDate 2016-03-01
description Purpose: Our study proposes a bootstrapping-based method to automatically extract data-usage statements from academic texts. Design/methodology/approach: The method for data-usage statements extraction starts with seed entities and iteratively learns patterns and data-usage statements from unlabeled text. In each iteration, new patterns are constructed and added to the pattern list based on their calculated score. Three seed-selection strategies are also proposed in this paper. Findings: The performance of the method is verified by means of experiments on real data collected from computer science journals. The results show that the method can achieve satisfactory performance regarding precision of extraction and extensibility of obtained patterns. Research limitations: While the triple representation of sentences is effective and efficient for extracting data-usage statements, it is unable to handle complex sentences. Additional features that can address complex sentences should thus be explored in the future. Practical implications: Data-usage statements extraction is beneficial for data-repository construction and facilitates research on data-usage tracking, dataset-based scholar search, and dataset evaluation. Originality/value: To the best of our knowledge, this paper is among the first to address the important task of automatically extracting data-usage statements from real data.
topic Data-usage statements extraction
Information extraction
Bootstrapping
Unsupervised learning
Academic text-mining
url http://www.jdis.org/CN/html/Article8610.htm
work_keys_str_mv AT qiuzizhang abootstrappingbasedmethodtoautomaticallyidentifydatausagestatementsinpublications
AT qikaicheng abootstrappingbasedmethodtoautomaticallyidentifydatausagestatementsinpublications
AT yonghuang abootstrappingbasedmethodtoautomaticallyidentifydatausagestatementsinpublications
AT weilu abootstrappingbasedmethodtoautomaticallyidentifydatausagestatementsinpublications
AT qiuzizhang bootstrappingbasedmethodtoautomaticallyidentifydatausagestatementsinpublications
AT qikaicheng bootstrappingbasedmethodtoautomaticallyidentifydatausagestatementsinpublications
AT yonghuang bootstrappingbasedmethodtoautomaticallyidentifydatausagestatementsinpublications
AT weilu bootstrappingbasedmethodtoautomaticallyidentifydatausagestatementsinpublications
_version_ 1725765817550766080