Applies Shapley values from cooperative game theory to solve the 'free-rider' problem in GRPO-based reinforcement learning post-training.
April 1, 2026
Original Paper
ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training
arXiv · 2603.29871
The Takeaway
By decomposing set-level rewards into granular, candidate-specific signals, this method provides much cleaner training gradients for multi-candidate tasks like recommendation or code generation. It leads to faster convergence and better exploration than standard GRPO.
From the abstract
In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. Thi