Multimodal OCR (MOCR) treats charts, diagrams, and tables as code-level targets (e.g., TikZ, SVG) rather than just cropping them as pixels.
arXiv · March 16, 2026 · 2603.13032
Why it matters
It enables full semantic reconstruction of documents where graphics are traditionally lost; the 3B-parameter model enables LLMs to 'see' and manipulate structured visual data directly, setting a new SOTA on open-source document parsing.
From the abstract
We present Multimodal OCR (MOCR), a document parsing paradigm that jointly parses text and graphics into unified textual representations. Unlike conventional OCR systems that focus on text recognition and leave graphical regions as cropped pixels, our method, termedthis http URL, treats visual elements such as charts, diagrams, tables, and icons as first-class parsing targets, enabling systems to parse documents while preserving semantic relationships across elements. It offers several advantage