Redefines robotic visual state representations by explicitly encoding 'what-is-where' composition through a global-to-local reconstruction objective.
March 17, 2026
Original Paper
Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition
arXiv · 2603.13904
The Takeaway
Standard visual tokens often fail to capture the precise spatial-semantic relationships needed for complex manipulation. By forcing a single bottleneck token to reconstruct masked local patches, this framework ensures visual states preserve fine-grained identity and location information, significantly improving robot policy learning.
From the abstract
For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, ena