AUTOMATED GENDER CLASSIFICATION IN WIKIPEDIA BIOGRAPHIESa cross-lingual comparison

The written word plays an important role in the reinforcement of gender stereotypes, especially in texts of a more formal character. Wikipedia biographies have a lot of information about famous people, but do they describe men and women with different kinds of words? This thesis aims to evaluate and...

Full description

Bibliographic Details
Main Author: Weijand, Sasha
Format: Others
Language:English
Published: Umeå universitet, Institutionen för datavetenskap 2019
Subjects:
Online Access:http://urn.kb.se/resolve?urn=urn:nbn:se:umu:diva-163371
Description
Summary:The written word plays an important role in the reinforcement of gender stereotypes, especially in texts of a more formal character. Wikipedia biographies have a lot of information about famous people, but do they describe men and women with different kinds of words? This thesis aims to evaluate and explore a method for gender classification of text. In this study, two machine learning classifiers, Random Forest (RF) and Support Vector Machine (SVM), are applied to the gender classification of Wikipedia biographies in two languages, English and French. Their performance is evaluated and compared. The 500 most important words (features) are listed for each of the classifiers.A short review is given on the theoretic foundations of text classification, and a detailed description on how the datasets are built, what tools are used, and why. The datasets used are built from the first 5 paragraphs in each biography, with only nouns, verbs, adjectives and adverbs remaining. Feature ranking is also applied, where the top tenth of the features are kept.Performance is measured using the F0:5-score. The comparison shows that the RF and SVM classifiers' performance are close to each other, but that the classifiers perform worse on the French set than on the English. Initial performance scores range from 0.82 to 0.86, but they drop drastically when the most important features are removed from the set. A majority of the top most important features are nouns related to career and family roles, in both languages.The results show that there are indeed some semantic differences in language depending on the gender of the person described. Whether these depend on the writers' biased views, an unequal gender distribution of real world contexts, such as careers, or if these differences depend on how the datasets were built, is not clear.