Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data

The distance-based linear model (DB-LM) extends the classical linear regression to the framework of mixed-type predictors or when the only available information is a distance matrix between regressors (as it sometimes happens with big data). The main drawback of these DB methods is their computation...

Full description

Bibliographic Details
Main Authors: Amparo Baíllo, Aurea Grané
Format: Article
Language:English
Published: MDPI AG 2021-09-01
Series:Mathematics
Subjects:
Online Access:https://www.mdpi.com/2227-7390/9/18/2247
id doaj-cff5f5416c35451f952b496195b0e842
record_format Article
spelling doaj-cff5f5416c35451f952b496195b0e8422021-09-26T00:38:12ZengMDPI AGMathematics2227-73902021-09-0192247224710.3390/math9182247Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type DataAmparo Baíllo0Aurea Grané1Departamento de Matemáticas, Universidad Autónoma de Madrid, 28049 Madrid, SpainStatistics Department, Universidad Carlos III de Madrid, 28903 Getafe, SpainThe distance-based linear model (DB-LM) extends the classical linear regression to the framework of mixed-type predictors or when the only available information is a distance matrix between regressors (as it sometimes happens with big data). The main drawback of these DB methods is their computational cost, particularly due to the eigendecomposition of the Gram matrix. In this context, ensemble regression techniques provide a useful alternative to fitting the model to the whole sample. This work analyzes the performance of three subsampling and aggregation techniques in DB regression on two specific large, real datasets. We also analyze, via simulations, the performance of bagging and DB logistic regression in the classification problem with mixed-type features and large sample sizes.https://www.mdpi.com/2227-7390/9/18/2247big dataclassificationdissimilaritiesensemblegeneralized linear modelGower’s metric
collection DOAJ
language English
format Article
sources DOAJ
author Amparo Baíllo
Aurea Grané
spellingShingle Amparo Baíllo
Aurea Grané
Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data
Mathematics
big data
classification
dissimilarities
ensemble
generalized linear model
Gower’s metric
author_facet Amparo Baíllo
Aurea Grané
author_sort Amparo Baíllo
title Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data
title_short Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data
title_full Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data
title_fullStr Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data
title_full_unstemmed Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data
title_sort subsampling and aggregation: a solution to the scalability problem in distance-based prediction for mixed-type data
publisher MDPI AG
series Mathematics
issn 2227-7390
publishDate 2021-09-01
description The distance-based linear model (DB-LM) extends the classical linear regression to the framework of mixed-type predictors or when the only available information is a distance matrix between regressors (as it sometimes happens with big data). The main drawback of these DB methods is their computational cost, particularly due to the eigendecomposition of the Gram matrix. In this context, ensemble regression techniques provide a useful alternative to fitting the model to the whole sample. This work analyzes the performance of three subsampling and aggregation techniques in DB regression on two specific large, real datasets. We also analyze, via simulations, the performance of bagging and DB logistic regression in the classification problem with mixed-type features and large sample sizes.
topic big data
classification
dissimilarities
ensemble
generalized linear model
Gower’s metric
url https://www.mdpi.com/2227-7390/9/18/2247
work_keys_str_mv AT amparobaillo subsamplingandaggregationasolutiontothescalabilityproblemindistancebasedpredictionformixedtypedata
AT aureagrane subsamplingandaggregationasolutiontothescalabilityproblemindistancebasedpredictionformixedtypedata
_version_ 1716870206296424448