tencent / hunyuan-video

A state-of-the-art text-to-video generation model capable of creating high-quality videos with realistic motion from text descriptions

  • Public
  • 18.4K runs
  • GitHub
  • Weights
  • Paper
  • License

HunyuanVideo Text-to-Video Generation Model 🎬

HunyuanVideo is an advanced text-to-video generation model that can create high-quality videos from text descriptions. It features a comprehensive framework that integrates image-video joint model training and efficient infrastructure for large-scale model training and inference.

Model Description ✨

This model is trained on a spatial-temporally compressed latent space and uses a large language model for text encoding. According to professional human evaluation results, HunyuanVideo outperforms previous state-of-the-art models in terms of text alignment, motion quality, and visual quality.

Key features:

  • 🎨 High-quality video generation from text descriptions
  • 📐 Support for various aspect ratios and resolutions
  • ✍️ Advanced prompt handling with a built-in rewrite system
  • 🎯 Stable motion generation and temporal consistency

Predictions Examples 💫

The model works well for prompts like: - “A cat walks on the grass, realistic style” - “A drone shot of mountains at sunset” - “A flower blooming in timelapse”

Limitations ⚠️

  • Generation time increases with video length and resolution
  • Higher resolutions require more GPU memory
  • Some complex motions may require prompt engineering for best results

Citation 📚

If you use this model in your research, please cite:

@misc{kong2024hunyuanvideo,
      title={HunyuanVideo: A Systematic Framework For Large Video Generative Models}, 
      author={Weijie Kong, et al.},
      year={2024},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}