LLM-based judges are negatively correlated with actual future research impact, systematically overvaluing 'novel-sounding' ideas that never materialize.
March 17, 2026
Original Paper
HindSight: Evaluating Research Idea Generation via Future Impact
arXiv · 2603.15164
The Takeaway
This is a critical warning for the AI research community: relying on LLMs to evaluate ideas creates a bias toward flashy, low-impact results. The 'HindSight' framework introduces a more objective, impact-aligned way to benchmark AI's creative and scientific capabilities.
From the abstract
Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce \hs{}, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~$T$, we restrict an idea generation system to pre-$T$ literature, then evaluate its outputs against papers published in the s