AI & ML Breaks Assumption

Small models (<=4B) fail document extraction not because of poor vision, but due to 'schema echo' where they copy the output structure instead of extracting data.

arXiv · March 17, 2026 · 2603.15118

Udi Barzelay, Ophir Azulai, Inbar Shapira, Idan Friedman, Foad Abo Dahood, Madison Lee, Abraham Daniels

The Takeaway

This identifies a specific bottleneck in deployable foundation models that, when addressed via fine-tuning, yields an +81 percentage point gain. It provides a roadmap for practitioners to build cost-effective document processing agents using lightweight models.

From the abstract

We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain t