Toward Identifying Features for Automatic Gender Detection: A Corpus Creation and Analysis

The current paper aims to construct an inventory of stylometric and psychometric features for the automatic identification of the author's gender. These features are derived from an analysis of a manually developed Saudi Dialect Twitter Corpus (SDTwittC), consisting of four million words. Given...

Full description

Bibliographic Details
Main Author: Saad Awadh Alanazi
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8781684/
id doaj-d2286e4bf1094f4e88d73a99c7cd4674
record_format Article
spelling doaj-d2286e4bf1094f4e88d73a99c7cd46742021-04-05T17:22:02ZengIEEEIEEE Access2169-35362019-01-01711193111194310.1109/ACCESS.2019.29320268781684Toward Identifying Features for Automatic Gender Detection: A Corpus Creation and AnalysisSaad Awadh Alanazi0https://orcid.org/0000-0002-1714-1948Department of Computer Science, College of Computer and Information Sciences, Jouf University, Sakakah, Saudi ArabiaThe current paper aims to construct an inventory of stylometric and psychometric features for the automatic identification of the author's gender. These features are derived from an analysis of a manually developed Saudi Dialect Twitter Corpus (SDTwittC), consisting of four million words. Given that the study seeks to provide machine learning algorithms with the accurate set of features in solving the gender identification problem, word-based, character-based, syntactic, and function words are all considered during the selection stage. The word-based features constitute the largest category and they represent the possible gender discriminators from sociological, psychological and lexical perspectives. The results show that Saudi males use different styles that separate them from their female counterparts in terms of politeness (greeting, thanking, apology, congratulation, encouragement, best wishes etc), impoliteness (profanity and sarcasm), uses of intensifiers, hedges, color, emotion, reason, emoji among many others.https://ieeexplore.ieee.org/document/8781684/Automatic gender detectionfeature extractionSaudi dialects
collection DOAJ
language English
format Article
sources DOAJ
author Saad Awadh Alanazi
spellingShingle Saad Awadh Alanazi
Toward Identifying Features for Automatic Gender Detection: A Corpus Creation and Analysis
IEEE Access
Automatic gender detection
feature extraction
Saudi dialects
author_facet Saad Awadh Alanazi
author_sort Saad Awadh Alanazi
title Toward Identifying Features for Automatic Gender Detection: A Corpus Creation and Analysis
title_short Toward Identifying Features for Automatic Gender Detection: A Corpus Creation and Analysis
title_full Toward Identifying Features for Automatic Gender Detection: A Corpus Creation and Analysis
title_fullStr Toward Identifying Features for Automatic Gender Detection: A Corpus Creation and Analysis
title_full_unstemmed Toward Identifying Features for Automatic Gender Detection: A Corpus Creation and Analysis
title_sort toward identifying features for automatic gender detection: a corpus creation and analysis
publisher IEEE
series IEEE Access
issn 2169-3536
publishDate 2019-01-01
description The current paper aims to construct an inventory of stylometric and psychometric features for the automatic identification of the author's gender. These features are derived from an analysis of a manually developed Saudi Dialect Twitter Corpus (SDTwittC), consisting of four million words. Given that the study seeks to provide machine learning algorithms with the accurate set of features in solving the gender identification problem, word-based, character-based, syntactic, and function words are all considered during the selection stage. The word-based features constitute the largest category and they represent the possible gender discriminators from sociological, psychological and lexical perspectives. The results show that Saudi males use different styles that separate them from their female counterparts in terms of politeness (greeting, thanking, apology, congratulation, encouragement, best wishes etc), impoliteness (profanity and sarcasm), uses of intensifiers, hedges, color, emotion, reason, emoji among many others.
topic Automatic gender detection
feature extraction
Saudi dialects
url https://ieeexplore.ieee.org/document/8781684/
work_keys_str_mv AT saadawadhalanazi towardidentifyingfeaturesforautomaticgenderdetectionacorpuscreationandanalysis
_version_ 1721539845845680128