Upgrades video Diffusion Transformers to ultra-high-resolution synthesis using a two-stage 'Relay LoRA' adaptation on pure images.
March 25, 2026
Original Paper
ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images
arXiv · 2603.23326
The Takeaway
It solves the quadratic complexity of 3D attention in high-res video generation. By decoupling modality alignment and spatial extrapolation, it allows pre-trained models to scale to high resolutions using cheap image data instead of expensive high-res video.
From the abstract
Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often i