AUTOMATED GENDER CLASSIFICATION IN WIKIPEDIA BIOGRAPHIESa cross-lingual comparison

The written word plays an important role in the reinforcement of gender stereotypes, especially in texts of a more formal character. Wikipedia biographies have a lot of information about famous people, but do they describe men and women with different kinds of words? This thesis aims to evaluate and...

Full description

Bibliographic Details
Main Author: Weijand, Sasha
Format: Others
Language:English
Published: Umeå universitet, Institutionen för datavetenskap 2019
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-163371
id ndltd-UPSALLA1-oai-DiVA.org-umu-163371
record_format oai_dc
spelling ndltd-UPSALLA1-oai-DiVA.org-umu-1633712019-09-18T04:37:34ZAUTOMATED GENDER CLASSIFICATION IN WIKIPEDIA BIOGRAPHIESa cross-lingual comparisonengWeijand, SashaUmeå universitet, Institutionen för datavetenskap2019Engineering and TechnologyTeknik och teknologierThe written word plays an important role in the reinforcement of gender stereotypes, especially in texts of a more formal character. Wikipedia biographies have a lot of information about famous people, but do they describe men and women with different kinds of words? This thesis aims to evaluate and explore a method for gender classification of text. In this study, two machine learning classifiers, Random Forest (RF) and Support Vector Machine (SVM), are applied to the gender classification of Wikipedia biographies in two languages, English and French. Their performance is evaluated and compared. The 500 most important words (features) are listed for each of the classifiers.A short review is given on the theoretic foundations of text classification, and a detailed description on how the datasets are built, what tools are used, and why. The datasets used are built from the first 5 paragraphs in each biography, with only nouns, verbs, adjectives and adverbs remaining. Feature ranking is also applied, where the top tenth of the features are kept.Performance is measured using the F0:5-score. The comparison shows that the RF and SVM classifiers' performance are close to each other, but that the classifiers perform worse on the French set than on the English. Initial performance scores range from 0.82 to 0.86, but they drop drastically when the most important features are removed from the set. A majority of the top most important features are nouns related to career and family roles, in both languages.The results show that there are indeed some semantic differences in language depending on the gender of the person described. Whether these depend on the writers' biased views, an unequal gender distribution of real world contexts, such as careers, or if these differences depend on how the datasets were built, is not clear. Student thesisinfo:eu-repo/semantics/bachelorThesistexthttp://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-163371UMNAD ; 1191application/pdfinfo:eu-repo/semantics/openAccess
collection NDLTD
language English
format Others
sources NDLTD
topic Engineering and Technology
Teknik och teknologier
spellingShingle Engineering and Technology
Teknik och teknologier
Weijand, Sasha
AUTOMATED GENDER CLASSIFICATION IN WIKIPEDIA BIOGRAPHIESa cross-lingual comparison
description The written word plays an important role in the reinforcement of gender stereotypes, especially in texts of a more formal character. Wikipedia biographies have a lot of information about famous people, but do they describe men and women with different kinds of words? This thesis aims to evaluate and explore a method for gender classification of text. In this study, two machine learning classifiers, Random Forest (RF) and Support Vector Machine (SVM), are applied to the gender classification of Wikipedia biographies in two languages, English and French. Their performance is evaluated and compared. The 500 most important words (features) are listed for each of the classifiers.A short review is given on the theoretic foundations of text classification, and a detailed description on how the datasets are built, what tools are used, and why. The datasets used are built from the first 5 paragraphs in each biography, with only nouns, verbs, adjectives and adverbs remaining. Feature ranking is also applied, where the top tenth of the features are kept.Performance is measured using the F0:5-score. The comparison shows that the RF and SVM classifiers' performance are close to each other, but that the classifiers perform worse on the French set than on the English. Initial performance scores range from 0.82 to 0.86, but they drop drastically when the most important features are removed from the set. A majority of the top most important features are nouns related to career and family roles, in both languages.The results show that there are indeed some semantic differences in language depending on the gender of the person described. Whether these depend on the writers' biased views, an unequal gender distribution of real world contexts, such as careers, or if these differences depend on how the datasets were built, is not clear.
author Weijand, Sasha
author_facet Weijand, Sasha
author_sort Weijand, Sasha
title AUTOMATED GENDER CLASSIFICATION IN WIKIPEDIA BIOGRAPHIESa cross-lingual comparison
title_short AUTOMATED GENDER CLASSIFICATION IN WIKIPEDIA BIOGRAPHIESa cross-lingual comparison
title_full AUTOMATED GENDER CLASSIFICATION IN WIKIPEDIA BIOGRAPHIESa cross-lingual comparison
title_fullStr AUTOMATED GENDER CLASSIFICATION IN WIKIPEDIA BIOGRAPHIESa cross-lingual comparison
title_full_unstemmed AUTOMATED GENDER CLASSIFICATION IN WIKIPEDIA BIOGRAPHIESa cross-lingual comparison
title_sort automated gender classification in wikipedia biographiesa cross-lingual comparison
publisher Umeå universitet, Institutionen för datavetenskap
publishDate 2019
url http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-163371
work_keys_str_mv AT weijandsasha automatedgenderclassificationinwikipediabiographiesacrosslingualcomparison
_version_ 1719251675136393216