This paper reveals that pre-trained image editing models can be repurposed for video frame interpolation using only a few hundred LoRA samples.
March 17, 2026
Original Paper
Edit2Interp: Adapting Image Foundation Models from Spatial Editing to Video Frame Interpolation with Few-Shot Learning
arXiv · 2603.15003
The Takeaway
It breaks the assumption that video tasks require dedicated temporal architectures or motion estimation modules. It proves that massive spatial priors in foundation image models already contain 'latent temporal reasoning' that can be activated with minimal data for video synthesis.
From the abstract
Pre-trained image editing models exhibit strong spatial reasoning and object-aware transformation capabilities acquired from billions of image-text pairs, yet they possess no explicit temporal modeling. This paper demonstrates that these spatial priors can be repurposed to unlock temporal synthesis capabilities through minimal adaptation - without introducing any video-specific architecture or motion estimation modules. We show that a large image editing model (Qwen-Image-Edit), originally desig