Wan2.1

Wan2.1 is a suite of open video foundation models for video generation. The models support various tasks including Text-to-Video, Image-to-Video, Video Editing, Text-to-Image, and Video-to-Audio generation.

Key Features

High performance and quality video generation
Consumer-grade GPU compatibility (T2V-1.3B requires only 8.19GB VRAM)
Multiple task support
Visual text generation (supports both Chinese and English)
Efficient video VAE for encoding and decoding

Model Architecture

3D Variational Autoencoder: Novel 3D causal VAE architecture (Wan-VAE) for improved video compression and generation
Video Diffusion DiT: Flow Matching framework with T5 Encoder for text encoding and transformer blocks with cross-attention

Model Specifications

Model	Dimension	Input Dimension	Output Dimension	Feedforward Dimension	Frequency Dimension	Number of Heads	Number of Layers
1.3B	1536	16	16	8960	256	12	30
14B	5120	16	16	13824	256	40	40

Computational Efficiency

Performance varies by GPU. The 1.3B model is designed to run on consumer GPUs, while the 14B model benefits from multi-GPU setups.

License

The models are licensed under the Apache 2.0 License.