zsxkib / sonic

Generates realistic talking face animations from a portrait image and audio using the CVPR 2025 Sonic model

  • Public
  • 41 runs
  • GitHub
  • Weights
  • Paper
  • License

Run time and cost

This model runs on Nvidia A100 (80GB) GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Sonic: Talking Face Animation 🗣️ (CVPR 2025)

This Replicate model runs Sonic, a state-of-the-art method for generating realistic talking face animations from a single portrait image and an audio file.

Based on the research paper: Sonic: Shifting Focus to Global Audio Perception in Portrait Animation (CVPR 2025). Original Project: Sonic Project Page

About the Sonic Model

Sonic introduces a novel approach to audio-driven portrait animation by emphasizing global audio perception. Instead of relying solely on local lip-sync cues, it considers broader audio characteristics to generate more natural and expressive facial movements, including subtle head poses and expressions that match the audio’s tone and rhythm. The goal is to create animations that appear more holistic and less “puppet-like.”

Key Features & Capabilities ✨

  • Expressive Animation: Generates animations with nuanced facial expressions and subtle head movements derived from global audio features.
  • High-Quality Lip Sync: Accurately synchronizes lip movements with the input audio.
  • Single Image Input: Requires only one portrait image (works best with frontal or near-frontal views).
  • Handles Various Audio: Processes standard audio file formats (WAV, MP3, etc.).
  • Robust Face Handling: Includes face detection (YOLOv5) and cropping for optimal processing, gracefully falling back to the original image if no face is detected.

Replicate Implementation Details ⚙️

This Cog container packages the Sonic model and its dependencies for easy use on Replicate.

  • Core Model: Utilizes the pre-trained weights provided by the original Sonic authors (LeonJoe13/Sonic on Hugging Face).
  • Dependencies: Runs on PyTorch and leverages libraries like diffusers, transformers, pydub, and Pillow. Key components from the original research likely include architectures related to Stable Video Diffusion (SVD), Whisper (for audio encoding), and RIFE (for temporal consistency).
  • Weight Handling: Model weights for Sonic and its sub-components (SVD, Whisper, RIFE, YOLO) are efficiently downloaded using pget during the container build from a Replicate cache and stored locally (checkpoints/ directory).
  • Workflow (predict.py):
    1. Loads models into GPU memory during setup (setup method).
    2. Receives image and audio inputs.
    3. Performs preprocessing: saves image as PNG, converts audio to WAV, runs face detection/cropping.
    4. Calls the main self.pipe.process() function, passing the processed image, audio path, and user-configurable parameters (like dynamic_scale, inference_steps, min_resolution, keep_resolution, seed).
    5. Outputs the resulting animation as an MP4 video.

Underlying Technologies & Concepts 🔬

Sonic builds upon advancements in several areas: * Audio Feature Extraction: Likely uses models like Whisper to encode rich features from audio. * Diffusion Models for Video: Leverages techniques similar to Stable Video Diffusion for generating coherent video frames. * Face Detection: Employs YOLOv5 for accurate face localization. * Frame Interpolation: Potentially uses methods like RIFE to enhance temporal smoothness between generated frames. * Global Audio Perception: The core novelty, focusing on mapping broader audio characteristics to facial dynamics.

Use Cases 💡

  • Animating avatars for virtual assistants, games, or social media.
  • Creating engaging video content from static images and voiceovers.
  • Developing accessibility tools.
  • Entertainment and creative projects.

Limitations ⚠️

  • Best results are achieved with clear, high-resolution, relatively frontal portrait images. Extreme poses or obstructions may degrade quality.
  • Primarily focuses on animating the face and subtle head movements; does not generate large pose changes or body movements.
  • Lip-sync accuracy can be affected by audio quality (e.g., background noise, unclear speech).
  • The mapping from audio to expression is learned and may not capture every nuance intended by a human speaker.

License & Disclaimer 📜

This model is based on the original Sonic research, licensed under CC BY-NC-SA 4.0.

For non-commercial research use ONLY.

Commercial use requires separate licensing. See the original repository or Tencent Cloud Video Creation Large Model for commercial options. Users must comply with the license terms and applicable laws.

Citation 📚

Please cite the original Sonic paper if you use this model in your research:

@article{ji2024sonic,
  title={Sonic: Shifting Focus to Global Audio Perception in Portrait Animation},
  author={Ji, Xiaozhong and Hu, Xiaobin and Xu, Zhihong and Zhu, Junwei and Lin, Chuming and He, Qingdong and Zhang, Jiangning and Luo, Donghao and Chen, Yi and Lin, Qin and others},
  journal={arXiv preprint arXiv:2411.16331},
  year={2024}
}

Cog implementation managed by zsxkib.

Star the Cog repo on GitHub! ⭐

Follow me on Twitter/X