Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study.

This study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are "inv...

Full description

Bibliographic Details
Main Authors: Kevin B Read, Jerry R Sheehan, Michael F Huerta, Lou S Knecht, James G Mork, Betsy L Humphreys, NIH Big Data Annotator Group
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2015-01-01
Series:PLoS ONE
Online Access:http://europepmc.org/articles/PMC4514623?pdf=render
id doaj-984256e1d6484bc6a7dc97b6e33f707a
record_format Article
spelling doaj-984256e1d6484bc6a7dc97b6e33f707a2020-11-25T02:32:28ZengPublic Library of Science (PLoS)PLoS ONE1932-62032015-01-01107e013273510.1371/journal.pone.0132735Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study.Kevin B ReadJerry R SheehanMichael F HuertaLou S KnechtJames G MorkBetsy L HumphreysNIH Big Data Annotator GroupThis study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are "invisible" or not deposited in a known repository.We analyzed NIH-funded journal articles that were published in 2011, cited in PubMed and deposited in PubMed Central (PMC) to identify those that indicate data were submitted to a known repository. After excluding those articles, we analyzed a random sample of the remaining articles to estimate how many and what types of invisible datasets were used in each article.About 12% of the articles explicitly mention deposition of datasets in recognized repositories, leaving 88% that are invisible datasets. Among articles with invisible datasets, we found an average of 2.9 to 3.4 datasets, suggesting there were approximately 200,000 to 235,000 invisible datasets generated from NIH-funded research published in 2011. Approximately 87% of the invisible datasets consist of data newly collected for the research reported; 13% reflect reuse of existing data. More than 50% of the datasets were derived from live human or non-human animal subjects.In addition to providing a rough estimate of the total number of datasets produced per year by NIH-funded researchers, this study identifies additional issues that must be addressed to improve the discoverability of and access to biomedical research data: the definition of a "dataset," determination of which (if any) data are valuable for archiving and preservation, and better methods for estimating the number of datasets of interest. Lack of consensus amongst annotators about the number of datasets in a given article reinforces the need for a principled way of thinking about how to identify and characterize biomedical datasets.http://europepmc.org/articles/PMC4514623?pdf=render
collection DOAJ
language English
format Article
sources DOAJ
author Kevin B Read
Jerry R Sheehan
Michael F Huerta
Lou S Knecht
James G Mork
Betsy L Humphreys
NIH Big Data Annotator Group
spellingShingle Kevin B Read
Jerry R Sheehan
Michael F Huerta
Lou S Knecht
James G Mork
Betsy L Humphreys
NIH Big Data Annotator Group
Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study.
PLoS ONE
author_facet Kevin B Read
Jerry R Sheehan
Michael F Huerta
Lou S Knecht
James G Mork
Betsy L Humphreys
NIH Big Data Annotator Group
author_sort Kevin B Read
title Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study.
title_short Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study.
title_full Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study.
title_fullStr Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study.
title_full_unstemmed Sizing the Problem of Improving Discovery and Access to NIH-Funded Data: A Preliminary Study.
title_sort sizing the problem of improving discovery and access to nih-funded data: a preliminary study.
publisher Public Library of Science (PLoS)
series PLoS ONE
issn 1932-6203
publishDate 2015-01-01
description This study informs efforts to improve the discoverability of and access to biomedical datasets by providing a preliminary estimate of the number and type of datasets generated annually by research funded by the U.S. National Institutes of Health (NIH). It focuses on those datasets that are "invisible" or not deposited in a known repository.We analyzed NIH-funded journal articles that were published in 2011, cited in PubMed and deposited in PubMed Central (PMC) to identify those that indicate data were submitted to a known repository. After excluding those articles, we analyzed a random sample of the remaining articles to estimate how many and what types of invisible datasets were used in each article.About 12% of the articles explicitly mention deposition of datasets in recognized repositories, leaving 88% that are invisible datasets. Among articles with invisible datasets, we found an average of 2.9 to 3.4 datasets, suggesting there were approximately 200,000 to 235,000 invisible datasets generated from NIH-funded research published in 2011. Approximately 87% of the invisible datasets consist of data newly collected for the research reported; 13% reflect reuse of existing data. More than 50% of the datasets were derived from live human or non-human animal subjects.In addition to providing a rough estimate of the total number of datasets produced per year by NIH-funded researchers, this study identifies additional issues that must be addressed to improve the discoverability of and access to biomedical research data: the definition of a "dataset," determination of which (if any) data are valuable for archiving and preservation, and better methods for estimating the number of datasets of interest. Lack of consensus amongst annotators about the number of datasets in a given article reinforces the need for a principled way of thinking about how to identify and characterize biomedical datasets.
url http://europepmc.org/articles/PMC4514623?pdf=render
work_keys_str_mv AT kevinbread sizingtheproblemofimprovingdiscoveryandaccesstonihfundeddataapreliminarystudy
AT jerryrsheehan sizingtheproblemofimprovingdiscoveryandaccesstonihfundeddataapreliminarystudy
AT michaelfhuerta sizingtheproblemofimprovingdiscoveryandaccesstonihfundeddataapreliminarystudy
AT lousknecht sizingtheproblemofimprovingdiscoveryandaccesstonihfundeddataapreliminarystudy
AT jamesgmork sizingtheproblemofimprovingdiscoveryandaccesstonihfundeddataapreliminarystudy
AT betsylhumphreys sizingtheproblemofimprovingdiscoveryandaccesstonihfundeddataapreliminarystudy
AT nihbigdataannotatorgroup sizingtheproblemofimprovingdiscoveryandaccesstonihfundeddataapreliminarystudy
_version_ 1724818929265672192