Improves Qwen2.5-7B performance on AIME2024 by 137% through test-time iterative rethinking and majority-voted pseudo-labels.
April 2, 2026
Original Paper
TR-ICRL: Test-Time Rethinking for In-Context Reinforcement Learning
arXiv · 2604.00438
The Takeaway
TR-ICRL allows models to learn from unlabeled test sets during inference by generating their own feedback. This significantly boosts reasoning capabilities without requiring ground-truth labels or heavy fine-tuning.
From the abstract
In-Context Reinforcement Learning (ICRL) enables Large Language Models (LLMs) to learn online from external rewards directly within the context window. However, a central challenge in ICRL is reward estimation, as models typically lack access to ground-truths during inference. To address this limitation, we propose Test-Time Rethinking for In-Context Reinforcement Learning (TR-ICRL), a novel ICRL framework designed for both reasoning and knowledge-intensive tasks. TR-ICRL operates by first retri