Readme
Accelerated Inference for Step-Video-T2V
We are WaveSpeedAI
, providing highly-optimized inference optimization for generative AI models.
We are excited to introduce our new product, a highly-optimized inference endpoint for Step-Video-T2V
model, s new SoTA text-to-video pre-trained model published by StepFun-AI
with 30 billion parameters and the capability to generate videos up to 204 frames.
We utilize cutting-edge inference acceleration techniques to provide very fast inference for this model.
And we are happy to bring this to you together with Replicate
and DataCrunch
.
Introduction from the original repository
1. Introduction
We present Step-Video-T2V, a state-of-the-art (SoTA) text-to-video pre-trained model with 30 billion parameters and the capability to generate videos up to 204 frames. To enhance both training and inference efficiency, we propose a deep compression VAE for videos, achieving 16x16 spatial and 8x temporal compression ratios. Direct Preference Optimization (DPO) is applied in the final stage to further enhance the visual quality of the generated videos. Step-Video-T2V’s performance is evaluated on a novel video generation benchmark, Step-Video-T2V-Eval, demonstrating its SoTA text-to-video quality compared to both open-source and commercial engines.
2. Model Summary
In Step-Video-T2V, videos are represented by a high-compression Video-VAE, achieving 16x16 spatial and 8x temporal compression ratios. User prompts are encoded using two bilingual pre-trained text encoders to handle both English and Chinese. A DiT with 3D full attention is trained using Flow Matching and is employed to denoise input noise into latent frames, with text embeddings and timesteps serving as conditioning factors. To further enhance the visual quality of the generated videos, a video-based DPO approach is applied, which effectively reduces artifacts and ensures smoother, more realistic video outputs.