AI & ML Efficiency Breakthrough

Applies Shapley values from cooperative game theory to solve the 'free-rider' problem in GRPO-based reinforcement learning post-training.

April 1, 2026

Original Paper

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

Rui Ai, Yu Pan, David Simchi-Levi, Chonghuan Wang

arXiv · 2603.29871

The Takeaway

By decomposing set-level rewards into granular, candidate-specific signals, this method provides much cleaner training gradients for multi-candidate tasks like recommendation or code generation. It leads to faster convergence and better exploration than standard GRPO.

From the abstract

In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. Thi