Data Civilizer 2.0: a holistic framework for data preparation and analytics

© 2019 VLDB Endowment. Data scientists spend over 80% of their time (1) parameter-tuning machine learning models and (2) iterating between data cleaning and machine learning model execution. While there are existing efforts to support the first requirement, there is currently no integrated workflow...

Full description

Bibliographic Details
Main Authors: Rezig, El Kindi (Author), Cao, Lei (Author), Stonebraker, Michael (Author), Simonini, Giovanni (Author), Tao, Wenbo (Author), Madden, Samuel R (Author), Ouzzani, Mourad (Author), Tang, Nan (Author), Elmagarmid, Ahmed K (Author)
Other Authors: Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory (Contributor), Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Contributor)
Format: Article
Language:English
Published: VLDB Endowment, 2021-12-20T15:57:22Z.
Subjects:
Online Access:Get fulltext
LEADER 01945 am a22002773u 4500
001 137529.2
042 |a dc 
100 1 0 |a Rezig, El Kindi  |e author 
100 1 0 |a Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory  |e contributor 
100 1 0 |a Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science  |e contributor 
700 1 0 |a Cao, Lei  |e author 
700 1 0 |a Stonebraker, Michael  |e author 
700 1 0 |a Simonini, Giovanni  |e author 
700 1 0 |a Tao, Wenbo  |e author 
700 1 0 |a Madden, Samuel R  |e author 
700 1 0 |a Ouzzani, Mourad  |e author 
700 1 0 |a Tang, Nan  |e author 
700 1 0 |a Elmagarmid, Ahmed K  |e author 
245 0 0 |a Data Civilizer 2.0: a holistic framework for data preparation and analytics 
260 |b VLDB Endowment,   |c 2021-12-20T15:57:22Z. 
856 |z Get fulltext  |u https://hdl.handle.net/1721.1/137529.2 
520 |a © 2019 VLDB Endowment. Data scientists spend over 80% of their time (1) parameter-tuning machine learning models and (2) iterating between data cleaning and machine learning model execution. While there are existing efforts to support the first requirement, there is currently no integrated workflow system that couples data cleaning and machine learning development. The previous version of Data Civilizer was geared towards data cleaning and discovery using a set of pre-defined tools. In this paper, we introduce Data Civilizer 2.0, an end-to-end workflow system satisfying both requirements. In addition, this system also supports a sophisticated data debugger and a workflow visualization system. In this demo, we will show how we used Data Civilizer 2.0 to help scientists at the Massachusetts General Hospital build their cleaning and machine learning pipeline on their 30TB brain activity dataset. 
546 |a en 
655 7 |a Article 
773 |t 10.14778/3352063.3352108 
773 |t Proceedings of the VLDB Endowment