Greedy Information Projection (GIP) provides a fast, geometrically-principled method for selecting training data that balances quality and diversity, achieving full-data performance with a fraction of the examples.
March 17, 2026
Original Paper
Greedy Information Projection for LLM Data Selection
arXiv · 2603.13790
The Takeaway
Current LLM data selection relies on expensive model-based scoring or simple heuristics; GIP offers a closed-form objective that optimizes the projection of query signals onto the data span. It significantly reduces the compute required for instruction fine-tuning and mathematical reasoning tasks.
From the abstract
We present \emph{Greedy Information Projection} (\textsc{GIP}), a principled framework for choosing training examples for large language model fine-tuning. \textsc{GIP} casts selection as maximizing mutual information between a subset of examples and task-specific query signals, which may originate from LLM quality judgments, metadata, or other sources. The framework involves optimizing a closed-form mutual information objective defined using both data and query embeddings, naturally balancing {