Reveals that complex reasoning strategies like Chain-of-Thought (CoT) and Tree-of-Thought (ToT) provide negligible or even negative gains for text classification tasks.
March 23, 2026
Original Paper
TextReasoningBench: Does Reasoning Really Improve Text Classification in Large Language Models?
arXiv · 2603.19558
The Takeaway
Practitioners often blindly apply CoT to increase performance across all tasks. This paper empirically shows that for classification, the massive increase in token cost and latency does not justify the minimal (1-3%) accuracy gains, advocating for simpler decoding strategies.
From the abstract
Eliciting explicit, step-by-step reasoning traces from large language models (LLMs) has emerged as a dominant paradigm for enhancing model capabilities. Although such reasoning strategies were originally designed for problems requiring explicit multi-step reasoning, they have increasingly been applied to a broad range of NLP tasks. This expansion implicitly assumes that deliberative reasoning uniformly benefits heterogeneous tasks. However, whether such reasoning mechanisms truly benefit classif