AI & ML Efficiency Breakthrough

Qianfan-OCR introduces 'Layout-as-Thought,' enabling a 4B model to outperform 235B models on complex document parsing and layout analysis.

March 17, 2026

Original Paper

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Daxiang Dong, Mingming Zheng, Dong Xu, Chunhua Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, Yicheng Wang, Xue Xiong, Ayong Zheng, Xiaoying Zuo, Ziwei Ou, Jingnan Gu, Quanhao Guo, Jianmin Wu, Dawei Yin, Dou Shen

arXiv · 2603.13398

The Takeaway

By utilizing an optional 'thinking phase' to generate structured layout representations before text, it recovers spatial grounding lost in end-to-end models. This proves that architectural reasoning can bridge the gap between small edge-deployable models and massive frontier VLMs for document intelligence.

From the abstract

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by spe