AI & ML Efficiency Breakthrough

Improves Qwen2.5-7B performance on AIME2024 by 137% through test-time iterative rethinking and majority-voted pseudo-labels.

April 2, 2026

Original Paper

TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning

Wenxuan Jiang, Yuxin Zuo, Zijian Zhang, Xuecheng Wu, Zining Fan, Wenxuan Liu, Li Chen, Xiaoyu Li, Xuezhi Cao, Xiaolong Jin, Ninghao Liu

arXiv · 2604.00438

The Takeaway

TR-ICRL allows models to learn from unlabeled test sets during inference by generating their own feedback. This significantly boosts reasoning capabilities without requiring ground-truth labels or heavy fine-tuning.

From the abstract

In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retri