This study proves that reasoning traces (Chain-of-Thought) causally shape model behavior and generalization, even when the final answer is held constant.
arXiv · March 16, 2026 · 2603.12397
Why it matters
It refutes the idea that CoT is merely post-hoc rationalization. It shows that training on reasoning alone is sufficient to alter model behavior, implying that supervising only the final answer is insufficient for safety and alignment.
From the abstract
Chain-of-Thought (CoT) is often viewed as a window into LLM decision-making, yet recent work suggests it may function merely as post-hoc rationalization. This raises a critical alignment question: Does the reasoning trace causally shape model generalization independent of the final answer? To isolate reasoning's causal effect, we design a controlled experiment holding final harmful answers constant while varying reasoning paths. We construct datasets with \textit{Evil} reasoning embracing malice