AI & ML Efficiency Breakthrough

Prunes 85% of visual tokens in Vision-Language-Action (VLA) models while retaining 94% accuracy for autonomous driving.

March 30, 2026

Original Paper

ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

Yiru Wang, Anqing Jiang, Shuo Wang, Yuwen Heng, Zichong Gu, Hao Sun

arXiv · 2603.25766

AI-generated illustration

The Takeaway

VLA models are notoriously compute-heavy due to quadratic attention over multi-view video frames; this framework provides a massive efficiency gain (61% FLOP reduction) by dynamically identifying 'important' visual tokens inspired by human driver attention.

From the abstract

The integration of Vision-Language-Action (VLA) models into autonomous driving systems offers a unified framework for interpreting complex scenes and executing control commands. However, the necessity to incorporate historical multi-view frames for accurate temporal reasoning imposes a severe computational burden, primarily driven by the quadratic complexity of self-attention mechanisms in Large Language Models (LLMs). To alleviate this bottleneck, we propose ETA-VLA, an Efficient Token Adaptati

Read the original paper →

← Back to today's papers