Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets
Recent technological and computational advances have enabled the collection of data at an unprecedented rate. On the one hand, the large amount of data suddenly available has opened up new opportunities for new data-driven research but, on the other hand, it has brought into light new obstacles and...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2020-09-01
|
Series: | Entropy |
Subjects: | |
Online Access: | https://www.mdpi.com/1099-4300/22/10/1084 |
id |
doaj-32742db92dc941cc872bf04cfda5be40 |
---|---|
record_format |
Article |
spelling |
doaj-32742db92dc941cc872bf04cfda5be402020-11-25T03:30:30ZengMDPI AGEntropy1099-43002020-09-01221084108410.3390/e22101084Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big DatasetsStefano Garlaschi0Anna Fochesato1Anna Tovo2Dipartimento di Fisica e Astronomia “Galileo Galilei”, Università degli studi di Padova, Via Marzolo 8, 35131 Padova, ItalyFondazione The Microsoft Research—University of Trento, Centre for Computational and Systems Biology (COSBI), Piazza Manifattura 1, 38068 Rovereto, ItalyDipartimento di Fisica e Astronomia “Galileo Galilei”, Università degli studi di Padova, Via Marzolo 8, 35131 Padova, ItalyRecent technological and computational advances have enabled the collection of data at an unprecedented rate. On the one hand, the large amount of data suddenly available has opened up new opportunities for new data-driven research but, on the other hand, it has brought into light new obstacles and challenges related to storage and analysis limits. Here, we strengthen an upscaling approach borrowed from theoretical ecology that allows us to infer with small errors relevant patterns of a dataset in its entirety, although only a limited fraction of it has been analysed. In particular we show that, after reducing the input amount of information on the system under study, by applying our framework it is still possible to recover two statistical patterns of interest of the entire dataset. Tested against big ecological, human activity and genomics data, our framework was successful in the reconstruction of global statistics related to both the number of types and their abundances while starting from limited presence/absence information on small random samples of the datasets. These results pave the way for future applications of our procedure in different life science contexts, from social activities to natural ecosystems.https://www.mdpi.com/1099-4300/22/10/1084upscaling life science datastatistical patterns inferencebig data storage reduction |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Stefano Garlaschi Anna Fochesato Anna Tovo |
spellingShingle |
Stefano Garlaschi Anna Fochesato Anna Tovo Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets Entropy upscaling life science data statistical patterns inference big data storage reduction |
author_facet |
Stefano Garlaschi Anna Fochesato Anna Tovo |
author_sort |
Stefano Garlaschi |
title |
Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets |
title_short |
Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets |
title_full |
Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets |
title_fullStr |
Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets |
title_full_unstemmed |
Upscaling Statistical Patterns from Reduced Storage in Social and Life Science Big Datasets |
title_sort |
upscaling statistical patterns from reduced storage in social and life science big datasets |
publisher |
MDPI AG |
series |
Entropy |
issn |
1099-4300 |
publishDate |
2020-09-01 |
description |
Recent technological and computational advances have enabled the collection of data at an unprecedented rate. On the one hand, the large amount of data suddenly available has opened up new opportunities for new data-driven research but, on the other hand, it has brought into light new obstacles and challenges related to storage and analysis limits. Here, we strengthen an upscaling approach borrowed from theoretical ecology that allows us to infer with small errors relevant patterns of a dataset in its entirety, although only a limited fraction of it has been analysed. In particular we show that, after reducing the input amount of information on the system under study, by applying our framework it is still possible to recover two statistical patterns of interest of the entire dataset. Tested against big ecological, human activity and genomics data, our framework was successful in the reconstruction of global statistics related to both the number of types and their abundances while starting from limited presence/absence information on small random samples of the datasets. These results pave the way for future applications of our procedure in different life science contexts, from social activities to natural ecosystems. |
topic |
upscaling life science data statistical patterns inference big data storage reduction |
url |
https://www.mdpi.com/1099-4300/22/10/1084 |
work_keys_str_mv |
AT stefanogarlaschi upscalingstatisticalpatternsfromreducedstorageinsocialandlifesciencebigdatasets AT annafochesato upscalingstatisticalpatternsfromreducedstorageinsocialandlifesciencebigdatasets AT annatovo upscalingstatisticalpatternsfromreducedstorageinsocialandlifesciencebigdatasets |
_version_ |
1724575235633577984 |