Selects high-quality synthetic code data using 'Reverse Mutual Information' to achieve full-dataset performance with 75% less data.
arXiv · March 13, 2026 · 2603.12165
Why it matters
This framework (QAQ) solves the noise problem in synthetic datasets by measuring how well a generated answer can predict the original query. It allows for significant reductions in training compute while improving model robustness against hallucinations.
From the abstract
Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. H