Reproducibility in Computational Linguistics: Are We Willing to Share?

This study focuses on an essential precondition for reproducibility in computational linguistics: the willingness of authors to share relevant source code and data. Ten years after Ted Pedersen’s influential “Last Words” contribution in Computational Linguistics, we investigate to what extent resear...

Full description

Bibliographic Details
Main Authors: Martijn Wieling, Josine Rawee, Gertjan van Noord
Format: Article
Language:English
Published: The MIT Press 2018-12-01
Series:Computational Linguistics
Online Access:https://www.mitpressjournals.org/doi/pdf/10.1162/coli_a_00330
id doaj-155f8b6efe174058b23fd84833a03cc7
record_format Article
spelling doaj-155f8b6efe174058b23fd84833a03cc72020-11-25T01:17:13ZengThe MIT PressComputational Linguistics1530-93122018-12-0144464164910.1162/coli_a_00330coli_a_00330Reproducibility in Computational Linguistics: Are We Willing to Share?Martijn Wieling0Josine Rawee1Gertjan van Noord2University of Groningen, Center for Language and Cognition, Groningen. wieling@gmail.comMaster’s student, University of Groningen, Center for Language and Cognition, Groningen. josine@rawee.nlUniversity of Groningen, Center for Language and Cognition, Groningen. g.j.m.van.noord@rug.nlThis study focuses on an essential precondition for reproducibility in computational linguistics: the willingness of authors to share relevant source code and data. Ten years after Ted Pedersen’s influential “Last Words” contribution in Computational Linguistics, we investigate to what extent researchers in computational linguistics are willing and able to share their data and code. We surveyed all 395 full papers presented at the 2011 and 2016 ACL Annual Meetings, and identified whether links to data and code were provided. If working links were not provided, authors were requested to provide this information. Although data were often available, code was shared less often. When working links to code or data were not provided in the paper, authors provided the code in about one third of cases. For a selection of ten papers, we attempted to reproduce the results using the provided data and code. We were able to reproduce the results approximately for six papers. For only a single paper did we obtain the exact same results. Our findings show that even though the situation appears to have improved comparing 2016 to 2011, empiricism in computational linguistics still largely remains a matter of faith. Nevertheless, we are somewhat optimistic about the future. Ensuring reproducibility is not only important for the field as a whole, but also seems worthwhile for individual researchers: The median citation count for studies with working links to the source code is higher.https://www.mitpressjournals.org/doi/pdf/10.1162/coli_a_00330
collection DOAJ
language English
format Article
sources DOAJ
author Martijn Wieling
Josine Rawee
Gertjan van Noord
spellingShingle Martijn Wieling
Josine Rawee
Gertjan van Noord
Reproducibility in Computational Linguistics: Are We Willing to Share?
Computational Linguistics
author_facet Martijn Wieling
Josine Rawee
Gertjan van Noord
author_sort Martijn Wieling
title Reproducibility in Computational Linguistics: Are We Willing to Share?
title_short Reproducibility in Computational Linguistics: Are We Willing to Share?
title_full Reproducibility in Computational Linguistics: Are We Willing to Share?
title_fullStr Reproducibility in Computational Linguistics: Are We Willing to Share?
title_full_unstemmed Reproducibility in Computational Linguistics: Are We Willing to Share?
title_sort reproducibility in computational linguistics: are we willing to share?
publisher The MIT Press
series Computational Linguistics
issn 1530-9312
publishDate 2018-12-01
description This study focuses on an essential precondition for reproducibility in computational linguistics: the willingness of authors to share relevant source code and data. Ten years after Ted Pedersen’s influential “Last Words” contribution in Computational Linguistics, we investigate to what extent researchers in computational linguistics are willing and able to share their data and code. We surveyed all 395 full papers presented at the 2011 and 2016 ACL Annual Meetings, and identified whether links to data and code were provided. If working links were not provided, authors were requested to provide this information. Although data were often available, code was shared less often. When working links to code or data were not provided in the paper, authors provided the code in about one third of cases. For a selection of ten papers, we attempted to reproduce the results using the provided data and code. We were able to reproduce the results approximately for six papers. For only a single paper did we obtain the exact same results. Our findings show that even though the situation appears to have improved comparing 2016 to 2011, empiricism in computational linguistics still largely remains a matter of faith. Nevertheless, we are somewhat optimistic about the future. Ensuring reproducibility is not only important for the field as a whole, but also seems worthwhile for individual researchers: The median citation count for studies with working links to the source code is higher.
url https://www.mitpressjournals.org/doi/pdf/10.1162/coli_a_00330
work_keys_str_mv AT martijnwieling reproducibilityincomputationallinguisticsarewewillingtoshare
AT josinerawee reproducibilityincomputationallinguisticsarewewillingtoshare
AT gertjanvannoord reproducibilityincomputationallinguisticsarewewillingtoshare
_version_ 1725147345164697600