AI & ML Efficiency Breakthrough

Upgrades video Diffusion Transformers to ultra-high-resolution synthesis using a two-stage 'Relay LoRA' adaptation on pure images.

March 25, 2026

Original Paper

ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

Yunfeng Wu, Hongying Cheng, Zihao He, Songhua Liu

arXiv · 2603.23326

The Takeaway

It solves the quadratic complexity of 3D attention in high-res video generation. By decoupling modality alignment and spatial extrapolation, it allows pre-trained models to scale to high resolutions using cheap image data instead of expensive high-res video.

From the abstract

Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often i

Read the original paper →

← Back to today's papers