Prompt Replay speeds up GRPO training by selectively reusing 'medium difficulty' prompts to maximize learning signal in RL rollouts.
March 24, 2026
Original Paper
Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts
arXiv · 2603.21177
The Takeaway
Reinforcement learning for LLM reasoning (like DeepSeek's GRPO) is compute-intensive. This method reduces zero-variance batches (where all answers are right or wrong) to ensure every rollout contributes meaningful signal to the model's advantage estimation.
From the abstract
Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate