Releases Feynman, an agentic pipeline and 100k-sample dataset for generating high-quality, knowledge-rich diagrams with grounded captions.
arXiv · March 16, 2026 · 2603.12597
Why it matters
Visual reasoning data is notoriously hard to scale due to the lack of high-quality image-text alignment in technical domains. This release provides a scalable way to synthesize complex diagrammatic data for training vision-language models.
From the abstract
Visual design is an essential application of state-of-the-art multi-modal AI systems. Improving these systems requires high-quality vision-language data at scale. Despite the abundance of internet image and text data, knowledge-rich and well-aligned image-text pairs are rare. In this paper, we present a scalable diagram generation pipeline built with our agent, Feynman. To create diagrams, Feynman first enumerates domain-specific knowledge components (''ideas'') and performs code planning based