Bootstraps reasoning-heavy RL by stochastically injecting few-shot demonstrations into training prompts via a curriculum.
March 20, 2026
Original Paper
Context Bootstrapped Reinforcement Learning
arXiv · 2603.18953
The Takeaway
It solves the 'cold start' problem in Reinforcement Learning from Verifiable Rewards (RLVR) where models fail to find any correct reasoning path. By annealing demonstrations over time, it forces the model to internalize complex reasoning patterns, significantly improving success rates on novel reasoning tasks.
From the abstract
Reinforcement Learning from Verifiable Rewards (RLVR) suffers from exploration inefficiency, where models struggle to generate successful rollouts, resulting in minimal learning signal. This challenge is particularly severe for tasks that require the acquisition of novel reasoning patterns or domain-specific knowledge. To address this, we propose Context Bootstrapped Reinforcement Learning (CBRL), which augments RLVR training by stochastically prepending few-shot demonstrations to training promp