AI & ML Paradigm Shift

Decouples high-level intent planning from low-level motor control in Vision-Language-Action (VLA) models to prevent the degradation of pre-trained VLM representations.

April 1, 2026

Original Paper

DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

Yi Chen, Yuying Ge, Hui Zhou, Mingyu Ding, Yixiao Ge, Xihui Liu

arXiv · 2603.29844

The Takeaway

Traditional end-to-end VLAs often 'break' the underlying VLM's reasoning by forcing it to output raw actions; DIAL uses a latent foresight bottleneck that preserves pre-trained knowledge while improving robotic execution stability.

From the abstract

The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduc