The Automatic Detection of Dataset Names in Scientific Articles

We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing o...

Full description

Bibliographic Details
Main Authors: Jenny Heddes, Pim Meerdink, Miguel Pieters, Maarten Marx
Format: Article
Language:English
Published: MDPI AG 2021-08-01
Series:Data
Subjects:
Online Access:https://www.mdpi.com/2306-5729/6/8/84
id doaj-9abf478613034bf4a805fc7db92064ec
record_format Article
spelling doaj-9abf478613034bf4a805fc7db92064ec2021-08-26T13:39:43ZengMDPI AGData2306-57292021-08-016848410.3390/data6080084The Automatic Detection of Dataset Names in Scientific ArticlesJenny Heddes0Pim Meerdink1Miguel Pieters2Maarten Marx3Informatics Institute, Faculty of Science, University of Amsterdam, Science Park 908, 1098 XH Amsterdam, The NetherlandsInformatics Institute, Faculty of Science, University of Amsterdam, Science Park 908, 1098 XH Amsterdam, The NetherlandsInformatics Institute, Faculty of Science, University of Amsterdam, Science Park 908, 1098 XH Amsterdam, The NetherlandsInformatics Institute, Faculty of Science, University of Amsterdam, Science Park 908, 1098 XH Amsterdam, The NetherlandsWe study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub.https://www.mdpi.com/2306-5729/6/8/84dataset extractionscientific information extractionnamed entity recognitionBERTSciBERT
collection DOAJ
language English
format Article
sources DOAJ
author Jenny Heddes
Pim Meerdink
Miguel Pieters
Maarten Marx
spellingShingle Jenny Heddes
Pim Meerdink
Miguel Pieters
Maarten Marx
The Automatic Detection of Dataset Names in Scientific Articles
Data
dataset extraction
scientific information extraction
named entity recognition
BERT
SciBERT
author_facet Jenny Heddes
Pim Meerdink
Miguel Pieters
Maarten Marx
author_sort Jenny Heddes
title The Automatic Detection of Dataset Names in Scientific Articles
title_short The Automatic Detection of Dataset Names in Scientific Articles
title_full The Automatic Detection of Dataset Names in Scientific Articles
title_fullStr The Automatic Detection of Dataset Names in Scientific Articles
title_full_unstemmed The Automatic Detection of Dataset Names in Scientific Articles
title_sort automatic detection of dataset names in scientific articles
publisher MDPI AG
series Data
issn 2306-5729
publishDate 2021-08-01
description We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub.
topic dataset extraction
scientific information extraction
named entity recognition
BERT
SciBERT
url https://www.mdpi.com/2306-5729/6/8/84
work_keys_str_mv AT jennyheddes theautomaticdetectionofdatasetnamesinscientificarticles
AT pimmeerdink theautomaticdetectionofdatasetnamesinscientificarticles
AT miguelpieters theautomaticdetectionofdatasetnamesinscientificarticles
AT maartenmarx theautomaticdetectionofdatasetnamesinscientificarticles
AT jennyheddes automaticdetectionofdatasetnamesinscientificarticles
AT pimmeerdink automaticdetectionofdatasetnamesinscientificarticles
AT miguelpieters automaticdetectionofdatasetnamesinscientificarticles
AT maartenmarx automaticdetectionofdatasetnamesinscientificarticles
_version_ 1721194046867636224