Introduces adaptive video tokenization that allocates tokens based on scene complexity, reducing token usage by 24% while improving reconstruction quality.
arXiv · March 13, 2026 · 2603.12267
Why it matters
Fixed-length tokenizers waste compute on static video segments; EVATok uses lightweight routers to predict optimal token assignments per video block. This significantly lowers the computational cost for downstream autoregressive video generation models like Sora-style architectures.
From the abstract
Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform token assignment across temporal blocks of different videos, often wasting tokens on simple, static, or repetitive segments while underserving dynamic or complex ones. To address this inefficiency, we