lucataco / step-audio-tts-3b

Step-Audio-TTS-3B represents the industry's first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm

  • Public
  • 1.1K runs
  • GitHub
  • Weights
  • License

Run time and cost

This model costs approximately $0.025 to run on Replicate, or 40 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 26 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Step-Audio-TTS-3B

Step-Audio-TTS-3B represents the industry’s first Text-to-Speech (TTS) model trained on a large-scale synthetic dataset utilizing the LLM-Chat paradigm. It has achieved SOTA Character Error Rate (CER) results on the SEED TTS Eval benchmark. The model supports multiple languages, a variety of emotional expressions, and diverse voice style controls. Notably, Step-Audio-TTS-3B is also the first TTS model in the industry capable of generating RAP and Humming, marking a significant advancement in the field of speech synthesis.

This repository provides the model weights for StepAudio-TTS-3B, which is a dual-codebook trained LLM (Large Language Model) for text-to-speech synthesis. Additionally, it includes a vocoder trained using the dual-codebook approach, as well as a specialized vocoder specifically optimized for humming generation. These resources collectively enable high-quality speech synthesis and humming capabilities, leveraging the advanced dual-codebook training methodology.

Performance comparison of content consistency (CER/WER) between GLM-4-Voice and MinMo.

Model test-zh test-en
CER (%) ↓ WER (%) ↓
GLM-4-Voice 2.19 2.91
MinMo 2.48 2.90
Step-Audio 1.53 2.71

Results of TTS Models on SEED Test Sets.

  • StepAudio-TTS-3B-Single denotes dual-codebook backbone with single-codebook vocoder*
Model test-zh test-en
CER (%) ↓ SS ↑ WER (%) ↓ SS ↑
FireRedTTS 1.51 0.630 3.82 0.460
MaskGCT 2.27 0.774 2.62 0.774
CosyVoice 3.63 0.775 4.29 0.699
CosyVoice 2 1.45 0.806 2.57 0.736
CosyVoice 2-S 1.45 0.812 2.38 0.743
Step-Audio-TTS-3B-Single 1.37 0.802 2.52 0.704
Step-Audio-TTS-3B 1.31 0.733 2.31 0.660
Step-Audio-TTS 1.17 0.73 2.0 0.660

Performance comparison of Dual-codebook Resynthesis with Cosyvoice.

Token test-zh test-en
CER (%) ↓ SS ↑ WER (%) ↓ SS ↑
Groundtruth 0.972 - 2.156 -
CosyVoice 2.857 0.849 4.519 0.807
Step-Audio-TTS-3B 2.192 0.784 3.585 0.742

More information

For more information, please refer to our repository: Step-Audio.