Readme
Wan2.1
Wan2.1 is a suite of open video foundation models for video generation. The models support various tasks including Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, and Video-to-Audio generation.
Key Features
- High performance and quality video generation
- Consumer-grade GPU compatibility (T2V-1.3B requires only 8.19GB VRAM)
- Multiple task support
- Visual text generation (supports both Chinese and English)
- Efficient video VAE for encoding and decoding
Model Architecture
- 3D Variational Autoencoder: Novel 3D causal VAE architecture (Wan-VAE) for improved video compression and generation
- Video Diffusion DiT: Flow Matching framework with T5 Encoder for text encoding and transformer blocks with cross-attention
Model Specifications
Model | Dimension | Input Dimension | Output Dimension | Feedforward Dimension | Frequency Dimension | Number of Heads | Number of Layers |
---|---|---|---|---|---|---|---|
1.3B | 1536 | 16 | 16 | 8960 | 256 | 12 | 30 |
14B | 5120 | 16 | 16 | 13824 | 256 | 40 | 40 |
Computational Efficiency
Performance varies by GPU. The 1.3B model is designed to run on consumer GPUs, while the 14B model benefits from multi-GPU setups.
License
The models are licensed under the Apache 2.0 License.