AI & ML Paradigm Shift

Enhances mathematical reasoning in LLMs by integrating Group Relative Policy Optimization (GRPO) with a specific reflection reward mechanism.

March 17, 2026

Original Paper

GRPO and Reflection Reward for Mathematical Reasoning in Large Language Models

Zhijie Wang

arXiv · 2603.14041

The Takeaway

While DeepSeek-R1 popularized GRPO, this paper provides a concrete recipe for encouraging proactive self-reflection during training rather than just following format. It demonstrates that cognitive rewards for internal reflection significantly boost performance over standard RLHF/SFT approaches.

From the abstract

The enhancement of reasoning capabilities in large language models (LLMs) has garnered significant attention, with supervised fine-tuning (SFT) and reinforcement learning emerging as dominant paradigms. While recent studies recognize the importance of reflection in reasoning processes, existing methodologies seldom address proactive reflection encouragement during training. This study focuses on mathematical reasoning by proposing a four-stage framework integrating Group Relative Policy Optimiza