AI & ML Breaks Assumption

This study challenges the common 'best practice' of atomic decomposition for LLM judges, showing that holistic evaluation is often superior at detecting incompleteness.

March 31, 2026

Original Paper

Rethinking Atomic Decomposition for LLM Judges: A Prompt-Controlled Study of Reference-Grounded QA Evaluation

Xinran Zhang

arXiv · 2603.28005

The Takeaway

Researchers found that breaking answers into atomic claims (a popular design for RAG evaluation) actually performs worse than holistic judgment for completeness-sensitive tasks. This results in a direct recommendation for practitioners to simplify their evaluation pipelines.

From the abstract

Atomic decomposition -- breaking a candidate answer into claims before verifying each against a reference -- is a widely adopted design for LLM-based reference-grounded judges. However, atomic prompts are typically richer and longer, making it unclear whether any advantage comes from decomposition or from richer prompting. We study this for benchmark-style completeness-sensitive reference-support classification: classifying a candidate as fully supported, partially supported, or unsupported rela