Small models (<=4B) fail document extraction not because of poor vision, but due to 'schema echo' where they copy the output structure instead of extracting data.
arXiv · March 17, 2026 · 2603.15118
The Takeaway
This identifies a specific bottleneck in deployable foundation models that, when addressed via fine-tuning, yields an +81 percentage point gain. It provides a roadmap for practitioners to build cost-effective document processing agents using lightweight models.
From the abstract
We introduce VAREX (VARied-schema EXtraction), a benchmark for evaluating multimodal foundation models on structured data extraction from government forms. VAREX employs a Reverse Annotation pipeline that programmatically fills PDF templates with synthetic values, producing deterministic ground truth validated through three-phase quality assurance. The benchmark comprises 1,777 documents with 1,771 unique schemas across three structural categories, each provided in four input modalities: plain t