AI & ML New Capability

Lifting 2D features into a volumetric representation for robot manipulation policies yields a 14.8% success rate improvement by solving the 2D-3D spatial reasoning mismatch.

arXiv · March 19, 2026 · 2603.17720

Tianxing Zhou, Feiyang Xue, Zhangchen Ye, Tianyuan Yuan, Hang Zhao, Tao Jiang

The Takeaway

By explicitly reasoning in 3D through voxel-based spatial tokens, VolumeDP overcomes the inherent limitations of 2D image-to-3D action mapping. This results in significantly more robust generalization to novel camera viewpoints and spatial layouts compared to standard visual imitation learning.

From the abstract

Imitation learning is a prominent paradigm for robotic manipulation. However, existing visual imitation methods map 2D image observations directly to 3D action outputs, imposing a 2D-3D mismatch that hinders spatial reasoning and degrades robustness. We present VolumeDP, a policy architecture that restores spatial alignment by explicitly reasoning in 3D. VolumeDP first lifts image features into a Volumetric Representation via cross-attention. It then selects task-relevant voxels with a learnable