Establishes scaling laws for sampling compute in LLM Reinforcement Learning, providing a playbook for optimal parallel rollout and batch allocation.
arXiv · March 13, 2026 · 2603.12151
Why it matters
As RL post-training (e.g., DeepSeek-R1 style) becomes standard, this paper provides the necessary prescriptive rules for compute-efficient training. It explains how to scale parallel rollouts to balance solution sharpening on easy tasks versus coverage on hard ones.
From the abstract
While scaling laws guide compute allocation for LLM pre-training, analogous prescriptions for reinforcement learning (RL) post-training of large language models (LLMs) remain poorly understood. We study the compute-optimal allocation of sampling compute for on-policy RL methods in LLMs, framing scaling as a compute-constrained optimization over three resources: parallel rollouts per problem, number of problems per batch, and number of update steps. We find that the compute-optimal number of para