AI & ML Efficiency Breakthrough

Prompt Replay speeds up GRPO training by selectively reusing 'medium difficulty' prompts to maximize learning signal in RL rollouts.

March 24, 2026

Original Paper

Prompt replay: speeding up grpo with on-policy reuse of high-signal prompts

Andrei Baroian, Rutger Berger

arXiv · 2603.21177

The Takeaway

Reinforcement learning for LLM reasoning (like DeepSeek's GRPO) is compute-intensive. This method reduces zero-variance batches (where all answers are right or wrong) to ensure every rollout contributes meaningful signal to the model's advantage estimation.

From the abstract

Reinforcement learning with verifiable rewards (RLVR) plays a crucial role in expanding the capacities of LLM reasoning, but GRPO-style training is dominated by expensive rollouts and wastes compute on unusable prompts. We propose Prompt Replay, an overhead-free online data selection method for GRPO that reuses prompts only (not trajectories), to preserve on-policy optimization. After each step, we insert prompts with medium difficulty into a buffer, and prioritize prompts closer to a pass rate