SeriesFusion
Science, curated & edited by AI
Paradigm Challenge  /  AI

Your massive dataset is ruining your prompt optimization; you only need two diverse examples for better results.

The 'more data is better' mantra fails spectacularly in prompt engineering. This paper shows that scaling up to more user prompts actually degrades optimization results due to noise and overfitting. Instead, the researchers found that training on as few as two highly variant prompts provides better generalization than using a full dataset. For practitioners, this means you can stop scraping thousands of examples and start curating a tiny, 'high-variance' gold set. It radically simplifies the pipeline for production-grade prompt engineering.

Original Paper

p1: Better Prompt Optimization with Fewer Prompts

arXiv  ·  2604.08801

Prompt optimization improves language models without updating their weights by searching for a better system prompt, but its effectiveness varies widely across tasks. We study what makes a task amenable to prompt optimization. We show that the reward variance across different system prompts can be decomposed into two components: variance among responses, which captures generation stochasticity, and variance among system prompts, which captures differences in system prompt quality. Prompt optimiz