Analysis of data dictionary formats of HIV clinical trials.
<h4>Background</h4>Efforts to define research Common Data Elements try to harmonize data collection across clinical studies.<h4>Objective</h4>Our goal was to analyze the quality and usability of data dictionaries of HIV studies.<h4>Methods</h4>For the clinical dom...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2020-01-01
|
Series: | PLoS ONE |
Online Access: | https://doi.org/10.1371/journal.pone.0240047 |
id |
doaj-86544a2d90a34fc3b7e053f70276377e |
---|---|
record_format |
Article |
spelling |
doaj-86544a2d90a34fc3b7e053f70276377e2021-03-04T11:11:38ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-011510e024004710.1371/journal.pone.0240047Analysis of data dictionary formats of HIV clinical trials.Craig S MayerNick WilliamsVojtech Huser<h4>Background</h4>Efforts to define research Common Data Elements try to harmonize data collection across clinical studies.<h4>Objective</h4>Our goal was to analyze the quality and usability of data dictionaries of HIV studies.<h4>Methods</h4>For the clinical domain of HIV, we searched data sharing platforms and acquired a set of 18 HIV related studies from which we analyzed 26 328 data elements. We identified existing standards for creating a data dictionary and reviewed their use. To facilitate aggregation across studies, we defined three types of data dictionary (data element, forms, and permissible values) and created a simple information model for each type.<h4>Results</h4>An average study had 427 data elements (ranging from 46 elements to 9 945 elements). In terms of data type, 48.6% of data elements were string, 47.8% were numeric, 3.0% were date and 0.6% were date-time. No study in our sample explicitly declared a data element as a categorical variable and rather considered them either strings or numeric. Only for 61% of studies were we able to obtain permissible values. The majority of studies used CSV files to share a data dictionary while 22% of the studies used a non-computable, PDF format. All studies grouped their data elements. The average number of groups or forms per study was 24 (ranging between 2 and 124 groups/forms). An accurate and well formatted data dictionary facilitates error-free secondary analysis and can help with data de-identification.<h4>Conclusion</h4>We saw features of data dictionaries that made them difficult to use and understand. This included multiple data dictionary files or non-machine-readable documents, data elements included in data but not in the dictionary or missing data types or descriptions. Building on experience with aggregating data elements across a large set of studies, we created a set of recommendations (called CONSIDER statement) that can guide optimal data sharing of future studies.https://doi.org/10.1371/journal.pone.0240047 |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Craig S Mayer Nick Williams Vojtech Huser |
spellingShingle |
Craig S Mayer Nick Williams Vojtech Huser Analysis of data dictionary formats of HIV clinical trials. PLoS ONE |
author_facet |
Craig S Mayer Nick Williams Vojtech Huser |
author_sort |
Craig S Mayer |
title |
Analysis of data dictionary formats of HIV clinical trials. |
title_short |
Analysis of data dictionary formats of HIV clinical trials. |
title_full |
Analysis of data dictionary formats of HIV clinical trials. |
title_fullStr |
Analysis of data dictionary formats of HIV clinical trials. |
title_full_unstemmed |
Analysis of data dictionary formats of HIV clinical trials. |
title_sort |
analysis of data dictionary formats of hiv clinical trials. |
publisher |
Public Library of Science (PLoS) |
series |
PLoS ONE |
issn |
1932-6203 |
publishDate |
2020-01-01 |
description |
<h4>Background</h4>Efforts to define research Common Data Elements try to harmonize data collection across clinical studies.<h4>Objective</h4>Our goal was to analyze the quality and usability of data dictionaries of HIV studies.<h4>Methods</h4>For the clinical domain of HIV, we searched data sharing platforms and acquired a set of 18 HIV related studies from which we analyzed 26 328 data elements. We identified existing standards for creating a data dictionary and reviewed their use. To facilitate aggregation across studies, we defined three types of data dictionary (data element, forms, and permissible values) and created a simple information model for each type.<h4>Results</h4>An average study had 427 data elements (ranging from 46 elements to 9 945 elements). In terms of data type, 48.6% of data elements were string, 47.8% were numeric, 3.0% were date and 0.6% were date-time. No study in our sample explicitly declared a data element as a categorical variable and rather considered them either strings or numeric. Only for 61% of studies were we able to obtain permissible values. The majority of studies used CSV files to share a data dictionary while 22% of the studies used a non-computable, PDF format. All studies grouped their data elements. The average number of groups or forms per study was 24 (ranging between 2 and 124 groups/forms). An accurate and well formatted data dictionary facilitates error-free secondary analysis and can help with data de-identification.<h4>Conclusion</h4>We saw features of data dictionaries that made them difficult to use and understand. This included multiple data dictionary files or non-machine-readable documents, data elements included in data but not in the dictionary or missing data types or descriptions. Building on experience with aggregating data elements across a large set of studies, we created a set of recommendations (called CONSIDER statement) that can guide optimal data sharing of future studies. |
url |
https://doi.org/10.1371/journal.pone.0240047 |
work_keys_str_mv |
AT craigsmayer analysisofdatadictionaryformatsofhivclinicaltrials AT nickwilliams analysisofdatadictionaryformatsofhivclinicaltrials AT vojtechhuser analysisofdatadictionaryformatsofhivclinicaltrials |
_version_ |
1714804658225020928 |