AI & ML Breaks Assumption

Demonstrates that current 'faithfulness' metrics for Chain-of-Thought reasoning are highly subjective and vary wildly depending on the choice of classifier.

March 23, 2026

Original Paper

Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Richard J. Young

arXiv · 2603.20172

The Takeaway

This challenges the industry's reliance on fixed 'faithfulness' scores for models like DeepSeek-R1. It proves that model rankings can flip depending on the judge, suggesting the field lacks a stable definition of epistemic dependence in reasoning.

From the abstract

Recent work on chain-of-thought (CoT) faithfulness reports single aggregate numbers (e.g., DeepSeek-R1 acknowledges hints 39% of the time), implying that faithfulness is an objective, measurable property of a model. This paper demonstrates that it is not. Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. O