FrescoDiffusion enables coherent, 4K image-to-video generation using a training-free, tiled diffusion method with precomputed latent priors.
arXiv · March 19, 2026 · 2603.17555
The Takeaway
It solves the resolution ceiling of current I2V models by fusing local tile details with global latent trajectories. This allows practitioners to animate high-resolution, complex scenes like monumental artworks while maintaining global structural consistency without expensive retraining.
From the abstract
Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model's native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different su