Achieve 97% of Oracle reward performance using only 20% of the training labels for complex LLM reasoning.
March 23, 2026
Original Paper
MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels
arXiv · 2603.19310
The Takeaway
It introduces a graph-based experience memory that propagates rewards from a small labeled set to unlabeled thinking processes. This significantly reduces the human-expert labeling bottleneck required for RL fine-tuning in math and code generation.
From the abstract
Training large language models (LLMs) for complex reasoning via reinforcement learning requires reward labels that specify whether the generated rollouts are correct. However, obtaining reward labels at scale often requires expensive human labeling or time-consuming verification procedures; for instance, evaluating mathematical proofs demands expert review, while open-ended question answering lacks definitive ground truth. When reward labels are limited, the effectiveness of reinforcement learni