The Automatic Detection of Dataset Names in Scientific Articles
We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing o...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-08-01
|
Series: | Data |
Subjects: | |
Online Access: | https://www.mdpi.com/2306-5729/6/8/84 |
id |
doaj-9abf478613034bf4a805fc7db92064ec |
---|---|
record_format |
Article |
spelling |
doaj-9abf478613034bf4a805fc7db92064ec2021-08-26T13:39:43ZengMDPI AGData2306-57292021-08-016848410.3390/data6080084The Automatic Detection of Dataset Names in Scientific ArticlesJenny Heddes0Pim Meerdink1Miguel Pieters2Maarten Marx3Informatics Institute, Faculty of Science, University of Amsterdam, Science Park 908, 1098 XH Amsterdam, The NetherlandsInformatics Institute, Faculty of Science, University of Amsterdam, Science Park 908, 1098 XH Amsterdam, The NetherlandsInformatics Institute, Faculty of Science, University of Amsterdam, Science Park 908, 1098 XH Amsterdam, The NetherlandsInformatics Institute, Faculty of Science, University of Amsterdam, Science Park 908, 1098 XH Amsterdam, The NetherlandsWe study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub.https://www.mdpi.com/2306-5729/6/8/84dataset extractionscientific information extractionnamed entity recognitionBERTSciBERT |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Jenny Heddes Pim Meerdink Miguel Pieters Maarten Marx |
spellingShingle |
Jenny Heddes Pim Meerdink Miguel Pieters Maarten Marx The Automatic Detection of Dataset Names in Scientific Articles Data dataset extraction scientific information extraction named entity recognition BERT SciBERT |
author_facet |
Jenny Heddes Pim Meerdink Miguel Pieters Maarten Marx |
author_sort |
Jenny Heddes |
title |
The Automatic Detection of Dataset Names in Scientific Articles |
title_short |
The Automatic Detection of Dataset Names in Scientific Articles |
title_full |
The Automatic Detection of Dataset Names in Scientific Articles |
title_fullStr |
The Automatic Detection of Dataset Names in Scientific Articles |
title_full_unstemmed |
The Automatic Detection of Dataset Names in Scientific Articles |
title_sort |
automatic detection of dataset names in scientific articles |
publisher |
MDPI AG |
series |
Data |
issn |
2306-5729 |
publishDate |
2021-08-01 |
description |
We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub. |
topic |
dataset extraction scientific information extraction named entity recognition BERT SciBERT |
url |
https://www.mdpi.com/2306-5729/6/8/84 |
work_keys_str_mv |
AT jennyheddes theautomaticdetectionofdatasetnamesinscientificarticles AT pimmeerdink theautomaticdetectionofdatasetnamesinscientificarticles AT miguelpieters theautomaticdetectionofdatasetnamesinscientificarticles AT maartenmarx theautomaticdetectionofdatasetnamesinscientificarticles AT jennyheddes automaticdetectionofdatasetnamesinscientificarticles AT pimmeerdink automaticdetectionofdatasetnamesinscientificarticles AT miguelpieters automaticdetectionofdatasetnamesinscientificarticles AT maartenmarx automaticdetectionofdatasetnamesinscientificarticles |
_version_ |
1721194046867636224 |