FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data [version 1; peer review: 2 approved]

Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys f...

Full description

Bibliographic Details
Main Authors: Matt Buranosky, Elmar Stellnberger, Emily Pfaff, David Diaz-Sanchez, Cavin Ward-Caviness
Format: Article
Language:English
Published: F1000 Research Ltd 2018-10-01
Series:F1000Research
Online Access:https://f1000research.com/articles/7-1667/v1
id doaj-abcdd374ef564bca8511f13adee513a8
record_format Article
spelling doaj-abcdd374ef564bca8511f13adee513a82020-11-25T03:48:15ZengF1000 Research LtdF1000Research2046-14022018-10-01710.12688/f1000research.16483.118017FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data [version 1; peer review: 2 approved]Matt Buranosky0Elmar Stellnberger1Emily Pfaff2David Diaz-Sanchez3Cavin Ward-Caviness4National Health and Environmental Effects Research Laboratory, United States Environmental Protection Agency, Chapel Hill, NC, USAUniversity of Klagenfurt, Klagenfurt, AustriaUniversity of North Carolina at Chapel Hill, Chapel Hill, NC, USANational Health and Environmental Effects Research Laboratory, United States Environmental Protection Agency, Chapel Hill, NC, USANational Health and Environmental Effects Research Laboratory, United States Environmental Protection Agency, Chapel Hill, NC, USAFunctional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys from them. The runtime and memory costs associated with seven published FD discovery algorithms are given with an overview of their theoretical foundations. We conclude that FD_Mine is the most efficient FD discovery algorithm when applied to datasets with many rows (> 100,000 rows) and few columns (< 14 columns). This puts it in a special position to rule mine clinical and demographic datasets, which often consist of long and narrow sets of participant records. The structure of FD Mine is described and supplemented with a formal proof of the equivalence pruning method used. FDTool is a re-implementation of FD Mine with additional features added to improve performance and automate typical processes in database architecture. The experimental results of applying FDTool to 12 datasets of different dimensions are summarized in terms of the number of FDs checked, the number of FDs found, and the time it takes for the code to terminate. We find that the number of attributes in a dataset has a much greater effect on the runtime and memory costs of FDTool than does row count. The last section explains in detail how the FDTool application can be accessed, executed, and further developed.https://f1000research.com/articles/7-1667/v1
collection DOAJ
language English
format Article
sources DOAJ
author Matt Buranosky
Elmar Stellnberger
Emily Pfaff
David Diaz-Sanchez
Cavin Ward-Caviness
spellingShingle Matt Buranosky
Elmar Stellnberger
Emily Pfaff
David Diaz-Sanchez
Cavin Ward-Caviness
FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data [version 1; peer review: 2 approved]
F1000Research
author_facet Matt Buranosky
Elmar Stellnberger
Emily Pfaff
David Diaz-Sanchez
Cavin Ward-Caviness
author_sort Matt Buranosky
title FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data [version 1; peer review: 2 approved]
title_short FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data [version 1; peer review: 2 approved]
title_full FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data [version 1; peer review: 2 approved]
title_fullStr FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data [version 1; peer review: 2 approved]
title_full_unstemmed FDTool: a Python application to mine for functional dependencies and candidate keys in tabular data [version 1; peer review: 2 approved]
title_sort fdtool: a python application to mine for functional dependencies and candidate keys in tabular data [version 1; peer review: 2 approved]
publisher F1000 Research Ltd
series F1000Research
issn 2046-1402
publishDate 2018-10-01
description Functional dependencies (FDs) and candidate keys are essential for table decomposition, database normalization, and data cleansing. In this paper, we present FDTool, a command line Python application to discover minimal FDs in tabular datasets and infer equivalent attribute sets and candidate keys from them. The runtime and memory costs associated with seven published FD discovery algorithms are given with an overview of their theoretical foundations. We conclude that FD_Mine is the most efficient FD discovery algorithm when applied to datasets with many rows (> 100,000 rows) and few columns (< 14 columns). This puts it in a special position to rule mine clinical and demographic datasets, which often consist of long and narrow sets of participant records. The structure of FD Mine is described and supplemented with a formal proof of the equivalence pruning method used. FDTool is a re-implementation of FD Mine with additional features added to improve performance and automate typical processes in database architecture. The experimental results of applying FDTool to 12 datasets of different dimensions are summarized in terms of the number of FDs checked, the number of FDs found, and the time it takes for the code to terminate. We find that the number of attributes in a dataset has a much greater effect on the runtime and memory costs of FDTool than does row count. The last section explains in detail how the FDTool application can be accessed, executed, and further developed.
url https://f1000research.com/articles/7-1667/v1
work_keys_str_mv AT mattburanosky fdtoolapythonapplicationtomineforfunctionaldependenciesandcandidatekeysintabulardataversion1peerreview2approved
AT elmarstellnberger fdtoolapythonapplicationtomineforfunctionaldependenciesandcandidatekeysintabulardataversion1peerreview2approved
AT emilypfaff fdtoolapythonapplicationtomineforfunctionaldependenciesandcandidatekeysintabulardataversion1peerreview2approved
AT daviddiazsanchez fdtoolapythonapplicationtomineforfunctionaldependenciesandcandidatekeysintabulardataversion1peerreview2approved
AT cavinwardcaviness fdtoolapythonapplicationtomineforfunctionaldependenciesandcandidatekeysintabulardataversion1peerreview2approved
_version_ 1724499357173022720