OnlyFlow controls the generation of video with text and the motion of a video input, using an estimation of its optical flow.

Abstract

We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation. Models and code will be made available on GitHub and HuggingFace.

Conditioning strength

We can vary the influence of the optical flow conditioning strength for a more similar motion cloning.


Illustration of optical flow conditioning strength impact on generated videos. The videos presented in each row are obtained for increasing values of gamma between 0 and 1.0

Camera control capabilities

OnlyFlow can serve as a tool to control camera movement using either another video or a preset motion field (optical flow) corresponding to a specific camera trajectory.


All videos are generated using the same text prompt, camera trajectory (pan-left), and seed. Without having trained for this task, our model achieve the same camera control capability as other camera-movement approaches trained on this task.

Comparison with other approaches

The model exhibits a superior combination of motion fidelity and image realism.


Examples videos are generated using same text prompt and input video. It positively compares to approaches that uses depth map (RAVE, VideoComposer, Control-A-Video), is comparable to Gen-1's temporal coherence and VideoComposer's image quality.

Semantic alignment

Our model generate a video closely following both the prompt and optical flow


With the prompt "Trees in forest", our model generates a clothesline corresponding to the teeth of the smile.

OnlyFlow model architecture

OnlyFlow model architecture

Overview of OnlyFlow. Inputs are i) a tokenized and encoded text prompt, ii) noisy latents for the diffusion model and iii) the optical flow of an input video. The latter is fed through a trainable optical flow encoder which outputs features maps that are injected in the diffusion U-Net. We experiment with several injection strategies, for illustration purposes we only show the injection in temporal attention layers of the U-Net. The U-Net is kept frozen during training. The output generated video matches the prompt and the auxiliary video’s motion.

BibTeX

@preprint{koroglu2024onlyflow,
      title={OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models},
      author={Mathis Koroglu and Hugo Caselles-Dupré and Guillaume Jeanneret Sanmiguel and Matthieu Cord},
      year={2024},
      eprint={2411.10501},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.10501},
}

Acknowledgements

This project was provided with computing HPC & AI and storage resources by GENCI at IDRIS thanks to the grant 2024-AD011014329R1 on the supercomputer Jean Zay’s V100 & A100 partitions.