AI & ML Paradigm Shift

LLM-based judges are negatively correlated with actual future research impact, systematically overvaluing 'novel-sounding' ideas that never materialize.

March 17, 2026

Original Paper

HindSight: Evaluating Research Idea Generation via Future Impact

Bo Jiang

arXiv · 2603.15164

The Takeaway

This is a critical warning for the AI research community: relying on LLMs to evaluate ideas creates a bias toward flashy, low-impact results. The 'HindSight' framework introduces a more objective, impact-aligned way to benchmark AI's creative and scientific capabilities.

From the abstract

Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce \hs{}, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~$T$, we restrict an idea generation system to pre-$T$ literature, then evaluate its outputs against papers published in the s