AI & ML Efficiency Breakthrough

Greedy Information Projection (GIP) provides a fast, geometrically-principled method for selecting training data that balances quality and diversity, achieving full-data performance with a fraction of the examples.

March 17, 2026

Original Paper

Greedy Information Projection for LLM Data Selection

Victor Ye Dong, Kuan-Yun Lee, Jiamei Shuai, Shengfei Liu, Yi Liu, Jian Jiao

arXiv · 2603.13790

The Takeaway

Current LLM data selection relies on expensive model-based scoring or simple heuristics; GIP offers a closed-form objective that optimizes the projection of query signals onto the data span. It significantly reduces the compute required for instruction fine-tuning and mathematical reasoning tasks.

From the abstract

We present \emph{Greedy Information Projection} (\textsc{GIP}), a principled framework for choosing training examples for large language model fine-tuning. \textsc{GIP} casts selection as maximizing mutual information between a subset of examples and task-specific query signals, which may originate from LLM quality judgments, metadata, or other sources. The framework involves optimizing a closed-form mutual information objective defined using both data and query embeddings, naturally balancing {