AI safety training is basically just a fresh coat of paint that hides ugly biases without actually fixing them.
April 3, 2026
Original Paper
ImplicitBBQ: Benchmarking Implicit Bias in Large Language Models through Characteristic Based Cues
arXiv · 2604.01925
The Takeaway
While AI has learned to avoid saying obviously offensive things, its internal biases remain six times higher than we thought. Standard safety fixes fail to address these deep stereotypes, leaving a massive hidden gap in AI fairness.
From the abstract
Large Language Models increasingly suppress biased outputs when demographic identity is stated explicitly, yet may still exhibit implicit biases when identity is conveyed indirectly. Existing benchmarks use name based proxies to detect implicit biases, which carry weak associations with many social demographics and cannot extend to dimensions like age or socioeconomic status. We introduce ImplicitBBQ, a QA benchmark that evaluates implicit bias through characteristic based cues, culturally assoc