OnlyFlow

OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models

Abstract

We consider the problem of text-to-video generation tasks with precise control for various applications such as camera movement control and video-to-video editing. Most methods tacking this problem rely on providing user-defined controls, such as binary masks or camera movement embeddings. In our approach we propose OnlyFlow, an approach leveraging the optical flow firstly extracted from an input video to condition the motion of generated videos. Using a text prompt and an input video, OnlyFlow allows the user to generate videos that respect the motion of the input video as well as the text prompt. This is implemented through an optical flow estimation model applied on the input video, which is then fed to a trainable optical flow encoder. The output feature maps are then injected into the text-to-video backbone model. We perform quantitative, qualitative and user preference studies to show that OnlyFlow positively compares to state-of-the-art methods on a wide range of tasks, even though OnlyFlow was not specifically trained for such tasks. OnlyFlow thus constitutes a versatile, lightweight yet efficient method for controlling motion in text-to-video generation. Models and code will be made available on GitHub and HuggingFace.

BibTeX

@preprint{koroglu2024onlyflow, title={OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models}, author={Mathis Koroglu and Hugo Caselles-Dupré and Guillaume Jeanneret Sanmiguel and Matthieu Cord}, year={2024}, eprint={2411.10501}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2411.10501}, }

OnlyFlow: Optical Flow based Motion Conditioning for Video Diffusion Models

OnlyFlow controls the generation of video with text and the motion of a video input, using an estimation of its optical flow.

Abstract

Conditioning strength

Illustration of optical flow conditioning strength impact on generated videos. The videos presented in each row are obtained for increasing values of gamma between 0 and 1.0

Camera control capabilities

All videos are generated using the same text prompt, camera trajectory (pan-left), and seed. Without having trained for this task, our model achieve the same camera control capability as other camera-movement approaches trained on this task.

Comparison with other approaches

Examples videos are generated using same text prompt and input video. It positively compares to approaches that uses depth map (RAVE, VideoComposer, Control-A-Video), is comparable to Gen-1's temporal coherence and VideoComposer's image quality.

Semantic alignment

With the prompt "Trees in forest", our model generates a clothesline corresponding to the teeth of the smile.

OnlyFlow model architecture

BibTeX

Acknowledgements