Readme
STAR: Spatial-Temporal Video Super-Resolution
STAR is a powerful text-guided video super-resolution model that can enhance low-quality videos while maintaining temporal consistency. It leverages text-to-video models to generate high-quality reference frames and combines them with spatial-temporal features for superior upscaling results.
More visual results can be found on our project page and video demo.
Usage
The model accepts: - A video file (supported formats: mp4, avi, mov) - Optional text prompt describing the video content - Target upscaling factor (default: 4x)
The model outputs an enhanced, higher-resolution version of the input video.
Limitations
- For optimal results, input videos should be at least 240p resolution
- Processing time increases with video length and resolution
- Due to VRAM requirements, longer videos may need to be processed in segments
- The CogVideoX-5B variant only supports 720x480 input resolution
Model Versions
Two variants are available:
- I2VGen-XL-based:
- Light degradation model: Best for mild quality enhancement
-
Heavy degradation model: Optimized for severely degraded videos
-
CogVideoX-5B-based:
- Specialized for heavy degradation scenarios
- Fixed input resolution of 720x480
Citation
@misc{xie2025starspatialtemporalaugmentationtexttovideo,
title={STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution},
author={Rui Xie and Yinhong Liu and Penghao Zhou and Chen Zhao and Jun Zhou and Kai Zhang and Zhenyu Zhang and Jian Yang and Zhenheng Yang and Ying Tai},
year={2025},
eprint={2501.02976},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
License
- I2VGen-XL-based models: MIT License
- CogVideoX-5B-based model: CogVideoX License
Maintained by @zsxkib for Replicate integration