Introduces Bilateral Context Conditioning to DeepSeek's GRPO, allowing models to cross-reference successful and failed reasoning traces during optimization.
arXiv · March 16, 2026 · 2603.13134
Why it matters
By explicitly leveraging the contrast between 'right' and 'wrong' samples within a generation group, it stabilizes training and improves math reasoning benchmarks without increasing sampling costs.
From the abstract
Group Relative Policy Optimization (GRPO) has emerged as an effective method for training reasoning models. While it computes advantages based on group mean, GRPO treats each output as an independent sample during the optimization and overlooks a vital structural signal: the natural contrast between correct and incorrect solutions within the same group, thus ignoring the rich, comparative data that could be leveraged by explicitly pitting successful reasoning traces against failed ones. To capit