Abstract: In response to concerns about algorithmic bias, developers of AI models have converged on six or seven distinct families of fairness metrics. These numerical indicators attempt to measure or quantify social biases in AI model performance. But their historical and theoretical underpinnings remain underexplored. By examining their origins, and why some have socially prevailed over others, I challenge the unquestioned role of these fairness metrics in evidence-based decision-making. In many cases, I argue that they function as sites of imagined moral consensus that prevent genuine moral discourse from taking place.
In practice, AI researchers conduct only cursory searches for highly-cited papers, aiming to replicate and extend their approach while exceeding their results on relevant measurables. Once a fairness metric appears in publication, it sets not only a normative standard of what fairness is, but also a technical benchmark for how fair current models are. A resulting valuation pipeline has sprung up where AI researchers consistently use a handful of relatively standardized metrics to frame moral considerations for evidence-based decision-makers.
Unfortunately, this assemblage of cobbled-together fairness metrics does not add up to a morally coherent view of what justice requires. Different metrics disagree deeply on what justice is, saddling evidence-based decision-makers with what I call fairness tradeoffs, in which metrics impose a practical stance on what fair treatment requires, making important moral choices for us. But since justice and fairness do not seem to be quantities in disguise, this metric-first approach obscures the very moral concerns it claims to “objectively” represent.
Keywords: AI, fairness, justice
Learning Objectives:
After participating in this conference, attendees should be able to:
Identify moral tensions between different families of fairness metrics commonly used in AI research.
Interrogate the role and purpose of fairness metrics in evidence-based decision-making.
Appreciate the need for shared moral discourse that engages AI developers and decision-makers all the way up the "valuation pipeline."