Controlled feature selection and compressive big data analytics: Applications to biomedical and health studies.

The theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distri...

Full description

Bibliographic Details
Main Authors:	Simeone Marino, Jiachen Xu, Yi Zhao, Nina Zhou, Yiwang Zhou, Ivo D Dinov
Format:	Article
Language:	English
Published:	Public Library of Science (PLoS) 2018-01-01
Series:	PLoS ONE
Online Access:	http://europepmc.org/articles/PMC6116997?pdf=render

id	doaj-e9222b29d0d34f76a750e3c0078db055
record_format	Article
spelling	doaj-e9222b29d0d34f76a750e3c0078db0552020-11-24T20:50:51ZengPublic Library of Science (PLoS)PLoS ONE1932-62032018-01-01138e020267410.1371/journal.pone.0202674Controlled feature selection and compressive big data analytics: Applications to biomedical and health studies.Simeone MarinoJiachen XuYi ZhaoNina ZhouYiwang ZhouIvo D DinovThe theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. Repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable algorithm for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. Albeit not fully developed yet, a CBDA mathematical framework will enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches via CBDA. We implemented the high-throughput CBDA method using pure R as well as via the graphical pipeline environment. To validate the technique, we used several simulated datasets as well as a real neuroimaging-genetics of Alzheimer's disease case-study. The CBDA approach may be customized to provide generic representation of complex multimodal datasets and to provide stable scientific inference for large, incomplete, and multisource datasets.http://europepmc.org/articles/PMC6116997?pdf=render
collection	DOAJ
language	English
format	Article
sources	DOAJ
author	Simeone Marino Jiachen Xu Yi Zhao Nina Zhou Yiwang Zhou Ivo D Dinov
spellingShingle	Simeone Marino Jiachen Xu Yi Zhao Nina Zhou Yiwang Zhou Ivo D Dinov Controlled feature selection and compressive big data analytics: Applications to biomedical and health studies. PLoS ONE
author_facet	Simeone Marino Jiachen Xu Yi Zhao Nina Zhou Yiwang Zhou Ivo D Dinov
author_sort	Simeone Marino
title	Controlled feature selection and compressive big data analytics: Applications to biomedical and health studies.
title_short	Controlled feature selection and compressive big data analytics: Applications to biomedical and health studies.
title_full	Controlled feature selection and compressive big data analytics: Applications to biomedical and health studies.
title_fullStr	Controlled feature selection and compressive big data analytics: Applications to biomedical and health studies.
title_full_unstemmed	Controlled feature selection and compressive big data analytics: Applications to biomedical and health studies.
title_sort	controlled feature selection and compressive big data analytics: applications to biomedical and health studies.
publisher	Public Library of Science (PLoS)
series	PLoS ONE
issn	1932-6203
publishDate	2018-01-01
description	The theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. Repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable algorithm for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. Albeit not fully developed yet, a CBDA mathematical framework will enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches via CBDA. We implemented the high-throughput CBDA method using pure R as well as via the graphical pipeline environment. To validate the technique, we used several simulated datasets as well as a real neuroimaging-genetics of Alzheimer's disease case-study. The CBDA approach may be customized to provide generic representation of complex multimodal datasets and to provide stable scientific inference for large, incomplete, and multisource datasets.
url	http://europepmc.org/articles/PMC6116997?pdf=render
work_keys_str_mv	AT simeonemarino controlledfeatureselectionandcompressivebigdataanalyticsapplicationstobiomedicalandhealthstudies AT jiachenxu controlledfeatureselectionandcompressivebigdataanalyticsapplicationstobiomedicalandhealthstudies AT yizhao controlledfeatureselectionandcompressivebigdataanalyticsapplicationstobiomedicalandhealthstudies AT ninazhou controlledfeatureselectionandcompressivebigdataanalyticsapplicationstobiomedicalandhealthstudies AT yiwangzhou controlledfeatureselectionandcompressivebigdataanalyticsapplicationstobiomedicalandhealthstudies AT ivoddinov controlledfeatureselectionandcompressivebigdataanalyticsapplicationstobiomedicalandhealthstudies
_version_	1716803374462009344

Controlled feature selection and compressive big data analytics: Applications to biomedical and health studies.

Similar Items