AI & ML Efficiency Breakthrough

Selects high-quality synthetic code data using 'Reverse Mutual Information' to achieve full-dataset performance with 75% less data.

arXiv · March 13, 2026 · 2603.12165

Jiayin Lei, Ming Ma, Yunxi Duan, Chenxi Li, Tianming Yang

Why it matters

This framework (QAQ) solves the noise problem in synthetic datasets by measuring how well a generated answer can predict the original query. It allows for significant reductions in training compute while improving model robustness against hallucinations.

From the abstract

Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. H

Read the original paper →

← Back to today's papers