VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models

1Zhejiang University   2ARC Lab, Tencent PCG   3Tencent AI Lab   4Huawei Noah's Ark Lab
* These authors contributed equally. † Corresponding author.


Demo

Abstract

Zero-shot customized video generation has gained significant attention due to its substantial application potential. Existing methods rely on additional models to extract and inject reference subject features, assuming that the Video Diffusion Model (VDM) alone is insufficient for zero-shot customized video generation. However, these methods often struggle to maintain consistent subject appearance due to suboptimal feature extraction and injection techniques. In this paper, we reveal that VDM inherently possesses the force to extract and inject subject features. Departing from previous heuristic approaches, we introduce a novel framework that leverages VDM's inherent force to enable high-quality zero-shot customized video generation. Specifically, for feature extraction, we directly input reference images into VDM and use its intrinsic feature extraction process, which not only provides fine-grained features but also significantly aligns with VDM's pre-trained knowledge. For feature injection, we devise an innovative bidirectional interaction between subject features and generated content through spatial self-attention within VDM, ensuring that VDM has better subject fidelity while maintaining the diversity of the generated video. Experiments on both customized human and object video generation validate the effectiveness of our framework.

Method

Overall pipeline of VideoMaker. We directly input the reference image into VDM and use VDM's modules for fine-grained feature extraction. We modified the computation of spatial self-attention to enable feature injection. Additionally, to distinguish between reference features and generated content, we designed the Guidance Information Recognition Loss to optimize the training strategy. Our method achieves high-fidelity zero-shot customized human and object video generation based on AnimateDiff

Customized Celebrity Video Generation Results

Qualitative comparison of customized human video generation on celebrity. We selected the AnimateDiff SD1.5 version as our base video diffusion model. Since PhotoMaker only has pretrained weights for the SDXL, we used the results generated with AnimateDiff SDXL at a resolution of 512×512 for comparison.

Customized Non-celebrity Video Generation Results

Qualitative comparison of customized human video generation on non-celebrity. We selected the AnimateDiff SD1.5 version as our base video diffusion model. Since PhotoMaker only has pretrained weights for the SDXL, we used the results generated with AnimateDiff SDXL at a resolution of 512×512 for comparison.

Customized Object Video Generation.

Qualitative comparison of customized object video generation

References

[1] Ye H, Zhang J, Liu S, et al. IP-adapter: Text compatible image prompt adapter for text-to-image diffusion models[J]. arXiv preprint arXiv:2308.06721, 2023.

[2] He X, Liu Q, Qian S, et al. ID-Animator: Zero-shot identity-preserving human video generation[J]. arXiv preprint arXiv:2404.15275, 2024.

[3] Li Z, Cao M, Wang X, et al. PhotoMaker: Customizing realistic human photos via stacked id embedding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 8640-8650.

[4] Jiang Y, Wu T, Yang S, et al. Videobooth: Diffusion-based video generation with image prompts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 6689-6700.

BibTeX


      @article{wu2024videomaker,
        title={VideoMaker: Zero-shot Customized Video Generation with the Inherent Force of Video Diffusion Models},
        author={Wu, Tao and Zhang, Yong and Cun, Xiaodong and Qi, Zhongang and Pu, Junfu and Dou, Huanzhang and Zheng, Guangcong and Shan, Ying and Li, Xi},
        journal={arXiv preprint arXiv:2412.19645},
        year={2024}
      }