Gender shades : intersectional phenotypic and demographic evaluation of face datasets and gender classifiers

Thesis: S.M., Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2017. === Cataloged from PDF version of thesis. === Includes bibliographical references (pages 103-116). === This thesis (1) characterizes the gender and skin type distributi...

Full description

Bibliographic Details
Main Author: Buolamwini, Joy Adowaa
Other Authors: Ethan Zuckerman.
Format: Others
Language:English
Published: Massachusetts Institute of Technology 2018
Subjects:
Online Access:http://hdl.handle.net/1721.1/114068
id ndltd-MIT-oai-dspace.mit.edu-1721.1-114068
record_format oai_dc
spelling ndltd-MIT-oai-dspace.mit.edu-1721.1-1140682019-05-02T15:42:55Z Gender shades : intersectional phenotypic and demographic evaluation of face datasets and gender classifiers Buolamwini, Joy Adowaa Ethan Zuckerman. Program in Media Arts and Sciences (Massachusetts Institute of Technology) Program in Media Arts and Sciences (Massachusetts Institute of Technology) Program in Media Arts and Sciences () Thesis: S.M., Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2017. Cataloged from PDF version of thesis. Includes bibliographical references (pages 103-116). This thesis (1) characterizes the gender and skin type distribution of IJB-A, a government facial recognition benchmark, and Adience, a gender classification benchmark, (2) outlines an approach for capturing images with more diverse skin types which is then applied to develop the Pilot Parliaments Benchmark (PPB), and (3) uses PPB to assess the classification accuracy of Adience, IBM, Microsoft, and Face++ gender classifiers with respect to gender, skin type, and the intersection of skin type and gender. The datasets evaluated are overwhelming lighter skinned: 79.6% - 86.24%. IJB-A includes only 24.6% female and 4.4% darker female, and features 59.4% lighter males. By construction, Adience achieves rough gender parity at 52.0% female but has only 13.76% darker skin. The Parliaments method for creating a more skin-type-balanced benchmark resulted in a dataset that is 44.39% female and 47% darker skin. An evaluation of four gender classifiers revealed a significant gap exists when comparing gender classification accuracies of females vs males (9 - 20%) and darker skin vs lighter skin (10 - 21%). Lighter males were in general the best classified group, and darker females were the worst classified group. 37% - 83% of classification errors resulted from the misclassification of darker females. Lighter males contributed the least to overall classification error (.4% - 3%). For the best performing classifier, darker females were 32 times more likely to be misclassified than lighter males. To increase the accuracy of these systems, more phenotypically diverse datasets need to be developed. Benchmark performance metrics need to be disaggregated not just by gender or skin type but by the intersection of gender and skin type. At a minimum, human-focused computer vision models should report accuracy on four subgroups: darker females, lighter females, darker males, and lighter males. The thesis concludes with a discussion of the implications of misclassification and the importance of building inclusive training sets and benchmarks. by Joy Adowaa Buolamwini. S.M. 2018-03-12T19:28:30Z 2018-03-12T19:28:30Z 2017 2017 Thesis http://hdl.handle.net/1721.1/114068 1026503582 eng MIT theses are protected by copyright. They may be viewed, downloaded, or printed from this source but further reproduction or distribution in any format is prohibited without written permission. http://dspace.mit.edu/handle/1721.1/7582 116 pages application/pdf Massachusetts Institute of Technology
collection NDLTD
language English
format Others
sources NDLTD
topic Program in Media Arts and Sciences ()
spellingShingle Program in Media Arts and Sciences ()
Buolamwini, Joy Adowaa
Gender shades : intersectional phenotypic and demographic evaluation of face datasets and gender classifiers
description Thesis: S.M., Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2017. === Cataloged from PDF version of thesis. === Includes bibliographical references (pages 103-116). === This thesis (1) characterizes the gender and skin type distribution of IJB-A, a government facial recognition benchmark, and Adience, a gender classification benchmark, (2) outlines an approach for capturing images with more diverse skin types which is then applied to develop the Pilot Parliaments Benchmark (PPB), and (3) uses PPB to assess the classification accuracy of Adience, IBM, Microsoft, and Face++ gender classifiers with respect to gender, skin type, and the intersection of skin type and gender. The datasets evaluated are overwhelming lighter skinned: 79.6% - 86.24%. IJB-A includes only 24.6% female and 4.4% darker female, and features 59.4% lighter males. By construction, Adience achieves rough gender parity at 52.0% female but has only 13.76% darker skin. The Parliaments method for creating a more skin-type-balanced benchmark resulted in a dataset that is 44.39% female and 47% darker skin. An evaluation of four gender classifiers revealed a significant gap exists when comparing gender classification accuracies of females vs males (9 - 20%) and darker skin vs lighter skin (10 - 21%). Lighter males were in general the best classified group, and darker females were the worst classified group. 37% - 83% of classification errors resulted from the misclassification of darker females. Lighter males contributed the least to overall classification error (.4% - 3%). For the best performing classifier, darker females were 32 times more likely to be misclassified than lighter males. To increase the accuracy of these systems, more phenotypically diverse datasets need to be developed. Benchmark performance metrics need to be disaggregated not just by gender or skin type but by the intersection of gender and skin type. At a minimum, human-focused computer vision models should report accuracy on four subgroups: darker females, lighter females, darker males, and lighter males. The thesis concludes with a discussion of the implications of misclassification and the importance of building inclusive training sets and benchmarks. === by Joy Adowaa Buolamwini. === S.M.
author2 Ethan Zuckerman.
author_facet Ethan Zuckerman.
Buolamwini, Joy Adowaa
author Buolamwini, Joy Adowaa
author_sort Buolamwini, Joy Adowaa
title Gender shades : intersectional phenotypic and demographic evaluation of face datasets and gender classifiers
title_short Gender shades : intersectional phenotypic and demographic evaluation of face datasets and gender classifiers
title_full Gender shades : intersectional phenotypic and demographic evaluation of face datasets and gender classifiers
title_fullStr Gender shades : intersectional phenotypic and demographic evaluation of face datasets and gender classifiers
title_full_unstemmed Gender shades : intersectional phenotypic and demographic evaluation of face datasets and gender classifiers
title_sort gender shades : intersectional phenotypic and demographic evaluation of face datasets and gender classifiers
publisher Massachusetts Institute of Technology
publishDate 2018
url http://hdl.handle.net/1721.1/114068
work_keys_str_mv AT buolamwinijoyadowaa gendershadesintersectionalphenotypicanddemographicevaluationoffacedatasetsandgenderclassifiers
_version_ 1719026457509888000