Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets

Recent technological and computational advances have enabled the collection of data at an unprecedented rate. On the one hand, the large amount of data suddenly available has opened up new opportunities for new data-driven research but, on the other hand, it has brought into light new obstacles and...

Full description

Bibliographic Details
Main Authors: Stefano Garlaschi, Anna Fochesato, Anna Tovo
Format: Article
Language:English
Published: MDPI AG 2020-09-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/22/10/1084
id doaj-32742db92dc941cc872bf04cfda5be40
record_format Article
spelling doaj-32742db92dc941cc872bf04cfda5be402020-11-25T03:30:30ZengMDPI AGEntropy1099-43002020-09-01221084108410.3390/e22101084Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big DatasetsStefano Garlaschi0Anna Fochesato1Anna Tovo2Dipartimento di Fisica e Astronomia “Galileo Galilei”, Università degli studi di Padova, Via Marzolo 8, 35131 Padova, ItalyFondazione The Microsoft Research—University of Trento, Centre for Computational and Systems Biology (COSBI), Piazza Manifattura 1, 38068 Rovereto, ItalyDipartimento di Fisica e Astronomia “Galileo Galilei”, Università degli studi di Padova, Via Marzolo 8, 35131 Padova, ItalyRecent technological and computational advances have enabled the collection of data at an unprecedented rate. On the one hand, the large amount of data suddenly available has opened up new opportunities for new data-driven research but, on the other hand, it has brought into light new obstacles and challenges related to storage and analysis limits. Here, we strengthen an upscaling approach borrowed from theoretical ecology that allows us to infer with small errors relevant patterns of a dataset in its entirety, although only a limited fraction of it has been analysed. In particular we show that, after reducing the input amount of information on the system under study, by applying our framework it is still possible to recover two statistical patterns of interest of the entire dataset. Tested against big ecological, human activity and genomics data, our framework was successful in the reconstruction of global statistics related to both the number of types and their abundances while starting from limited presence/absence information on small random samples of the datasets. These results pave the way for future applications of our procedure in different life science contexts, from social activities to natural ecosystems.https://www.mdpi.com/1099-4300/22/10/1084upscaling life science datastatistical patterns inferencebig data storage reduction
collection DOAJ
language English
format Article
sources DOAJ
author Stefano Garlaschi
Anna Fochesato
Anna Tovo
spellingShingle Stefano Garlaschi
Anna Fochesato
Anna Tovo
Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets
Entropy
upscaling life science data
statistical patterns inference
big data storage reduction
author_facet Stefano Garlaschi
Anna Fochesato
Anna Tovo
author_sort Stefano Garlaschi
title Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets
title_short Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets
title_full Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets
title_fullStr Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets
title_full_unstemmed Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets
title_sort upscaling statistical patterns from reduced storage in social and life science big datasets
publisher MDPI AG
series Entropy
issn 1099-4300
publishDate 2020-09-01
description Recent technological and computational advances have enabled the collection of data at an unprecedented rate. On the one hand, the large amount of data suddenly available has opened up new opportunities for new data-driven research but, on the other hand, it has brought into light new obstacles and challenges related to storage and analysis limits. Here, we strengthen an upscaling approach borrowed from theoretical ecology that allows us to infer with small errors relevant patterns of a dataset in its entirety, although only a limited fraction of it has been analysed. In particular we show that, after reducing the input amount of information on the system under study, by applying our framework it is still possible to recover two statistical patterns of interest of the entire dataset. Tested against big ecological, human activity and genomics data, our framework was successful in the reconstruction of global statistics related to both the number of types and their abundances while starting from limited presence/absence information on small random samples of the datasets. These results pave the way for future applications of our procedure in different life science contexts, from social activities to natural ecosystems.
topic upscaling life science data
statistical patterns inference
big data storage reduction
url https://www.mdpi.com/1099-4300/22/10/1084
work_keys_str_mv AT stefanogarlaschi upscalingstatisticalpatternsfromreducedstorageinsocialandlifesciencebigdatasets
AT annafochesato upscalingstatisticalpatternsfromreducedstorageinsocialandlifesciencebigdatasets
AT annatovo upscalingstatisticalpatternsfromreducedstorageinsocialandlifesciencebigdatasets
_version_ 1724575235633577984