Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data
The distance-based linear model (DB-LM) extends the classical linear regression to the framework of mixed-type predictors or when the only available information is a distance matrix between regressors (as it sometimes happens with big data). The main drawback of these DB methods is their computation...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2021-09-01
|
Series: | Mathematics |
Subjects: | |
Online Access: | https://www.mdpi.com/2227-7390/9/18/2247 |
id |
doaj-cff5f5416c35451f952b496195b0e842 |
---|---|
record_format |
Article |
spelling |
doaj-cff5f5416c35451f952b496195b0e8422021-09-26T00:38:12ZengMDPI AGMathematics2227-73902021-09-0192247224710.3390/math9182247Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type DataAmparo Baíllo0Aurea Grané1Departamento de Matemáticas, Universidad Autónoma de Madrid, 28049 Madrid, SpainStatistics Department, Universidad Carlos III de Madrid, 28903 Getafe, SpainThe distance-based linear model (DB-LM) extends the classical linear regression to the framework of mixed-type predictors or when the only available information is a distance matrix between regressors (as it sometimes happens with big data). The main drawback of these DB methods is their computational cost, particularly due to the eigendecomposition of the Gram matrix. In this context, ensemble regression techniques provide a useful alternative to fitting the model to the whole sample. This work analyzes the performance of three subsampling and aggregation techniques in DB regression on two specific large, real datasets. We also analyze, via simulations, the performance of bagging and DB logistic regression in the classification problem with mixed-type features and large sample sizes.https://www.mdpi.com/2227-7390/9/18/2247big dataclassificationdissimilaritiesensemblegeneralized linear modelGower’s metric |
collection |
DOAJ |
language |
English |
format |
Article |
sources |
DOAJ |
author |
Amparo Baíllo Aurea Grané |
spellingShingle |
Amparo Baíllo Aurea Grané Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data Mathematics big data classification dissimilarities ensemble generalized linear model Gower’s metric |
author_facet |
Amparo Baíllo Aurea Grané |
author_sort |
Amparo Baíllo |
title |
Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data |
title_short |
Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data |
title_full |
Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data |
title_fullStr |
Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data |
title_full_unstemmed |
Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data |
title_sort |
subsampling and aggregation: a solution to the scalability problem in distance-based prediction for mixed-type data |
publisher |
MDPI AG |
series |
Mathematics |
issn |
2227-7390 |
publishDate |
2021-09-01 |
description |
The distance-based linear model (DB-LM) extends the classical linear regression to the framework of mixed-type predictors or when the only available information is a distance matrix between regressors (as it sometimes happens with big data). The main drawback of these DB methods is their computational cost, particularly due to the eigendecomposition of the Gram matrix. In this context, ensemble regression techniques provide a useful alternative to fitting the model to the whole sample. This work analyzes the performance of three subsampling and aggregation techniques in DB regression on two specific large, real datasets. We also analyze, via simulations, the performance of bagging and DB logistic regression in the classification problem with mixed-type features and large sample sizes. |
topic |
big data classification dissimilarities ensemble generalized linear model Gower’s metric |
url |
https://www.mdpi.com/2227-7390/9/18/2247 |
work_keys_str_mv |
AT amparobaillo subsamplingandaggregationasolutiontothescalabilityproblemindistancebasedpredictionformixedtypedata AT aureagrane subsamplingandaggregationasolutiontothescalabilityproblemindistancebasedpredictionformixedtypedata |
_version_ |
1716870206296424448 |