Gender Shades

Inclusion Requires Intention
Average Faces of darker males, darker females, lighter males, lighter females

The Gender Shades project evaluates the accuracy of AI powered gender classification products.


This evaluation focuses on gender classification as a motivating example to show the need for increased transparency in the performance of any AI products and services that focused on human subjects. Bias in this context is defined as having practical differences in gender classification error rates between groups.

Test Subjects

1270 images were chosen to create a benchmark for this gender classification performance test.


The subjects were selected from 3 African countries and 3 European countries. The subjects were then grouped by gender, skin type, and the intersection of gender and skin type.

Dataset of 1270 Images
Gender Labels
Gender was broken into female and male categories since evaluated products provide binary sex labels for the gender classification feature. The evaluation inherits these sex labels and this reduced view of gender which is a more complex construct.
Skin Type Labels
The dermatologist approved Fitzpatrick skin type classification system was used to label faces as Fitzpatrick Types I, II, III, IV, V, or VI.
Binary Skin Type
Then faces labeled Fitzpatrick Types I, II, and III were grouped in a lighter category and faces labeled Fitzpatrick Types IV, V, and VI were grouped into a darker category.
Companies
IBM, Microsoft, and Face++ logos Three companies - IBM, Microsoft, and Face++ - that offer gender classification products were chosen for this evaluation based on geographic location and their use of artificial intelligence for computer vision.
Overall
While the companies appear to have relatively high accuracy overall,there are notable differences in the error rates between different groups. Let's explore.
Gender
All companies perform better on males than females with an 8.1% - 20.6% difference in error rates.
Skin Type
All companies perform better on lighter subjects as a whole than on darker subjects as a whole with an 11.8% - 19.2% difference in error rates.
Intersection
When we analyze the results by intersectional subgroups - darker males, darker females, lighter males, lighter females - we see that all companies perform worst on darker females.
Average Faces of darker males, darker females, lighter males, lighter females IBM and Microsoft perform best on lighter males. Face++ performs best on darker males. Table of subgroup error rates
IBM
IBM had the largest gap in accuracy, with a difference of 34.4% in error rate between lighter males and darker females.
IBM Watson leaders responded within a day after receiving the performance results and are reportedly making changes to the Watson Visual Recognition API. Official Statement.
IBM
Error analysis reveals 93.6% of faces misgendered by Microsoft were those of darker subjects.
An internal evaluation of the Azure Face API is reportedly being conducted by Microsoft. Official Statement. Statement to Lead Researcher.
IBM
Error analysis reveals 95.9% of the faces misgendered by Face++ were those of female subjects.
Face++ has yet to respond to the research results which were sent to all companies on Dec 22 ,2017
Blind Operations
IBM, Microsoft, and Face++ logos At the time of evaluation , none of the companies tested reported how well their computer vision products perform across gender, skin type, ethnicity, age or other attributes.

Inclusive product testing and reporting are necessary if the industry is to create systems that work well for all of humanity. However, accuracy is not the only issue. Flawless facial analysis technology can be abused in the hands of authoritarian governments, personal adversaries, and predatory companies. Ongoing oversight and context limitations are needed.
books
While this study focused on gender classification, the machine learning techniques used to determine gender are also broadly applied to many other areas of facial analysis and automation. Face recognition technology that has not been publicly tested for demographic accuracy is increasingly used by law enforcement and at airports . AI fueled automation now helps determine who is fired, hired, promoted, granted a loan or insurance, and even how long someone spends in prison.

For interested readers, authors Cathy O'Neil and Virginia Eubanks explore the real-world impact of algorithmic bias. Book Covers for Weapons of Mathy Destruction and Automating Inequality
Harms
Chart of Algorithmic Harms Automated systems are not inherently neutral. They reflect the priorities, preferences, and prejudices - the coded gaze - of those who have the power to mold artificial intelligence.
We risk losing the gains made with the civil rights movement and women's movement under the false assumption of machine neutrality. We must demand increased transparency and accountability.
Next Steps
Learn more about the coded gaze -algorithmic bias - at www.ajlunited.org
Dive Deeper:
Gender Shades Academic Paper

Test Inclusively:
Request external performance test
Request Pilot Parliaments Benchmark
Gender Classifier Overall Accuracy on all Subjects in Pilot Parlaiments Benchmark (2017)
Microsoft
93.7%
Face++
90.0%
IBM
87.9%
Gender Classifier Female Subjects Accuracy Male Subjects Accuracy Error Rate Diff.
Microsoft
89.3%
97.4%
8.1%
Face++
78.7%
99.3%
20.6%
IBM
79.7%
94.4%
14.7%
Gender Classifier Darker Subjects Accuracy Lighter Subjects Accuracy Error Rate Diff.
Microsoft
87.1%
99.3%
12.2%
Face++
83.5%
95.3%
11.8%
IBM
77.6%
96.8%
19.2%
Gender Classifier Darker Male Darker Female Lighter Male Lighter Female Largest Gap
Microsoft
94.0%
79.2%
100%
98.3%
20.8%
Face++
99.3%
65.5%
99.2%
94.0%
33.8%
IBM
88.0%
65.3%
99.7%
92.9%
34.4%