Introduces a framework for LLMs to self-improve reasoning in specific domains by autonomously mining and constructing training environments directly from the open web.
March 25, 2026
Original Paper
WIST: Web-Grounded Iterative Self-Play Tree for Domain-Targeted Reasoning Improvement
arXiv · 2603.22352
The Takeaway
This bypasses the need for human-curated or static datasets for domain-specific RLVR, allowing models to scale reasoning capabilities by discovering their own 'learnability signals' from web data. It demonstrates significant gains (+14.79 in medicine) over standard self-evolution methods.
From the abstract
Recent progress in reinforcement learning with verifiable rewards (RLVR) offers a practical path to self-improvement of language models, but existing methods face a key trade-off: endogenous self-play can drift over iterations, while corpus-grounded approaches rely on curated data environments. We present \textbf{WIST}, a \textbf{W}eb-grounded \textbf{I}terative \textbf{S}elf-play \textbf{T}ree framework for domain-targeted reasoning improvement that learns directly from the open web without req