RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation

<jats:p> <jats:italic>Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers?</jats:italic> We answer this question by presenting RPT, a denoising autoencoder for <jats:italic>tuple-to-X</jats:...

Full description

Bibliographic Details
Main Authors: Tang, Nan (Author), Fan, Ju (Author), Li, Fangyi (Author), Tu, Jianhong (Author), Du, Xiaoyong (Author), Li, Guoliang (Author), Madden, Sam (Author), Ouzzani, Mourad (Author)
Format: Article
Language:English
Published: VLDB Endowment, 2022-07-15T16:13:16Z.
Subjects:
Online Access:Get fulltext
LEADER 02127 am a22002413u 4500
001 143770
042 |a dc 
100 1 0 |a Tang, Nan  |e author 
700 1 0 |a Fan, Ju  |e author 
700 1 0 |a Li, Fangyi  |e author 
700 1 0 |a Tu, Jianhong  |e author 
700 1 0 |a Du, Xiaoyong  |e author 
700 1 0 |a Li, Guoliang  |e author 
700 1 0 |a Madden, Sam  |e author 
700 1 0 |a Ouzzani, Mourad  |e author 
245 0 0 |a RPT: relational pre-trained transformer is almost all you need towards democratizing data preparation 
260 |b VLDB Endowment,   |c 2022-07-15T16:13:16Z. 
856 |z Get fulltext  |u https://hdl.handle.net/1721.1/143770 
520 |a <jats:p> <jats:italic>Can AI help automate human-easy but computer-hard data preparation tasks that burden data scientists, practitioners, and crowd workers?</jats:italic> We answer this question by presenting RPT, a denoising autoencoder for <jats:italic>tuple-to-X</jats:italic> models (" <jats:italic>X</jats:italic> " could be tuple, token, label, JSON, and so on). RPT is pre-trained for a <jats:italic>tuple-to-tuple</jats:italic> model by corrupting the input tuple and then learning a model to reconstruct the original tuple. It adopts a Transformer-based neural translation architecture that consists of a bidirectional encoder (similar to BERT) and a left-to-right autoregressive decoder (similar to GPT), leading to a generalization of both BERT and GPT. The pre-trained RPT can already support several common data preparation tasks such as data cleaning, auto-completion and schema matching. Better still, RPT can be fine-tuned on a wide range of data preparation tasks, such as value normalization, data transformation, data annotation, etc. To complement RPT, we also discuss several appealing techniques such as collaborative training and few-shot learning for entity resolution, and few-shot learning and NLP question-answering for information extraction. In addition, we identify a series of research opportunities to advance the field of data preparation. </jats:p> 
546 |a en 
655 7 |a Article 
773 |t 10.14778/3457390.3457391 
773 |t Proceedings of the VLDB Endowment