AI & ML Efficiency Breakthrough

Achieve 97% of Oracle reward performance using only 20% of the training labels for complex LLM reasoning.

March 23, 2026

Original Paper

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

arXiv · 2603.19310

The Takeaway

It introduces a graph-based experience memory that propagates rewards from a small labeled set to unlabeled thinking processes. This significantly reduces the human-expert labeling bottleneck required for RL fine-tuning in math and code generation.

From the abstract

Training large language models (LLMs) for complex reasoning via reinforcement learning requires reward labels that specify whether the generated rollouts are correct. However, obtaining reward labels at scale often requires expensive human labeling or time-consuming verification procedures; for instance, evaluating mathematical proofs demands expert review, while open-ended question answering lacks definitive ground truth. When reward labels are limited, the effectiveness of reinforcement learni