AI & ML New Capability

Multimodal OCR (MOCR) treats charts, diagrams, and tables as code-level targets (e.g., TikZ, SVG) rather than just cropping them as pixels.

arXiv · March 16, 2026 · 2603.13032

Handong Zheng, Yumeng Li, Kaile Zhang, Liang Xin, Guangwei Zhao, Hao Liu, Jiayu Chen, Jie Lou, Jiyu Qiu, Qi Fu, Rui Yang, Shuo Jiang, Weijian Luo, Weijie Su, Weijun Zhang, Xingyu Zhu, Yabin Li, Yiwei ma, Yu Chen, Zhaohui Yu, Guang Yang, Colin Zhang, Lei Zhang, Yuliang Liu, Xiang Bai

Why it matters

It enables full semantic reconstruction of documents where graphics are traditionally lost; the 3B-parameter model enables LLMs to 'see' and manipulate structured visual data directly, setting a new SOTA on open-source document parsing.

From the abstract

We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termedthis http URL, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantage