Coming to Grips with Age Prediction on Imbalanced Multimodal Community Question Answering Data

For almost every online service, it is fundamental to understand patterns, differences and trends revealed by age demographic analysis—for example, take the discovery of malicious activity, including identity theft, violation of community guidelines and fake profiles. In the particular case of platf...

Full description

Bibliographic Details
Main Authors: Alejandro Figueroa, Billy Peralta, Orietta Nicolis
Format: Article
Language:English
Published: MDPI AG 2021-01-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/12/2/48
id doaj-62adbd7a504b4f7cb0c5c9bc57c4a01b
record_format Article
spelling doaj-62adbd7a504b4f7cb0c5c9bc57c4a01b2021-01-22T00:04:30ZengMDPI AGInformation2078-24892021-01-0112484810.3390/info12020048Coming to Grips with Age Prediction on Imbalanced Multimodal Community Question Answering DataAlejandro Figueroa0Billy Peralta1Orietta Nicolis2Departamento de Ciencias de la Ingeniería, Facultad de Ingeniería, Universidad Andres Bello, Antonio Varas 880, 8370146 Santiago, ChileDepartamento de Ciencias de la Ingeniería, Facultad de Ingeniería, Universidad Andres Bello, Antonio Varas 880, 8370146 Santiago, ChileDepartamento de Ciencias de la Ingeniería, Facultad de Ingeniería, Universidad Andres Bello, Antonio Varas 880, 8370146 Santiago, ChileFor almost every online service, it is fundamental to understand patterns, differences and trends revealed by age demographic analysis—for example, take the discovery of malicious activity, including identity theft, violation of community guidelines and fake profiles. In the particular case of platforms such as Facebook, Twitter and Yahoo! Answers, user demographics have impacts on their revenues and user experience; demographics assist in ensuring that the needs of each cohort are fulfilled via personalizing and contextualizing content. Despite the fact that technology has been made more accessible, thereby becoming evermore prevalent in both personal and professional lives alike, older people continue to trail Gen Z and Millennials in its adoption. This trailing brings about an under-representation that has a harmful influence on the demographic analysis and on supervised machine learning models. To that end, this paper pioneers attempts at examining this and other major challenges facing three distinct modalities when dealing with community question answering (cQA) platforms (i.e., texts, images and metadata). As for textual inputs, we propose an age-batched greedy curriculum learning (AGCL) approach to lessen the effects of their inherent class imbalances. When built on top of FastText shallow neural networks, AGCL achieved an increase of ca. 4% in macro-F1-score with respect to baseline systems (i.e., off-the-shelf deep neural networks). With regard to metadata, our experiments show that random forest classifiers significantly improve their performance when individuals close to generational borders are excluded (up to 20% more accuracy); and by experimenting with neural network-based visual classifiers, we discovered that images are the most challenging modality for age prediction. In fact, it is hard for a visual inspection to connect profile pictures with age cohorts, and there are considerable differences in their group distributions with respect to meta-data and textual inputs. All in all, we envisage that our findings will be highly relevant as guidelines for constructing assorted multimodal supervised models for automatic age recognition across cQA platforms.https://www.mdpi.com/2078-2489/12/2/48community question answeringuser demographicsimbalanced datamultimodal dataage predictionsupervised learning
collection DOAJ
language English
format Article
sources DOAJ
author Alejandro Figueroa
Billy Peralta
Orietta Nicolis
spellingShingle Alejandro Figueroa
Billy Peralta
Orietta Nicolis
Coming to Grips with Age Prediction on Imbalanced Multimodal Community Question Answering Data
Information
community question answering
user demographics
imbalanced data
multimodal data
age prediction
supervised learning
author_facet Alejandro Figueroa
Billy Peralta
Orietta Nicolis
author_sort Alejandro Figueroa
title Coming to Grips with Age Prediction on Imbalanced Multimodal Community Question Answering Data
title_short Coming to Grips with Age Prediction on Imbalanced Multimodal Community Question Answering Data
title_full Coming to Grips with Age Prediction on Imbalanced Multimodal Community Question Answering Data
title_fullStr Coming to Grips with Age Prediction on Imbalanced Multimodal Community Question Answering Data
title_full_unstemmed Coming to Grips with Age Prediction on Imbalanced Multimodal Community Question Answering Data
title_sort coming to grips with age prediction on imbalanced multimodal community question answering data
publisher MDPI AG
series Information
issn 2078-2489
publishDate 2021-01-01
description For almost every online service, it is fundamental to understand patterns, differences and trends revealed by age demographic analysis—for example, take the discovery of malicious activity, including identity theft, violation of community guidelines and fake profiles. In the particular case of platforms such as Facebook, Twitter and Yahoo! Answers, user demographics have impacts on their revenues and user experience; demographics assist in ensuring that the needs of each cohort are fulfilled via personalizing and contextualizing content. Despite the fact that technology has been made more accessible, thereby becoming evermore prevalent in both personal and professional lives alike, older people continue to trail Gen Z and Millennials in its adoption. This trailing brings about an under-representation that has a harmful influence on the demographic analysis and on supervised machine learning models. To that end, this paper pioneers attempts at examining this and other major challenges facing three distinct modalities when dealing with community question answering (cQA) platforms (i.e., texts, images and metadata). As for textual inputs, we propose an age-batched greedy curriculum learning (AGCL) approach to lessen the effects of their inherent class imbalances. When built on top of FastText shallow neural networks, AGCL achieved an increase of ca. 4% in macro-F1-score with respect to baseline systems (i.e., off-the-shelf deep neural networks). With regard to metadata, our experiments show that random forest classifiers significantly improve their performance when individuals close to generational borders are excluded (up to 20% more accuracy); and by experimenting with neural network-based visual classifiers, we discovered that images are the most challenging modality for age prediction. In fact, it is hard for a visual inspection to connect profile pictures with age cohorts, and there are considerable differences in their group distributions with respect to meta-data and textual inputs. All in all, we envisage that our findings will be highly relevant as guidelines for constructing assorted multimodal supervised models for automatic age recognition across cQA platforms.
topic community question answering
user demographics
imbalanced data
multimodal data
age prediction
supervised learning
url https://www.mdpi.com/2078-2489/12/2/48
work_keys_str_mv AT alejandrofigueroa comingtogripswithagepredictiononimbalancedmultimodalcommunityquestionansweringdata
AT billyperalta comingtogripswithagepredictiononimbalancedmultimodalcommunityquestionansweringdata
AT oriettanicolis comingtogripswithagepredictiononimbalancedmultimodalcommunityquestionansweringdata
_version_ 1724329481620946944