chenxwh / cosyvoice2-0.5b

Scalable Streaming Speech Synthesis with Large Language Models

  • Public
  • 3.6K runs
  • GitHub
  • Paper
  • License

CosyVoice 2.0

Compared to version 1.0, the new version offers more accurate, more stable, faster, and better speech generation capabilities.

Multilingual

  • Support Language: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
  • Crosslingual & Mixlingual:Support zero-shot voice cloning for cross-lingual and code-switching scenarios.

Ultra-Low Latency

  • Bidirectional Streaming Support: CosyVoice 2.0 integrates offline and streaming modeling technologies.
  • Rapid First Packet Synthesis: Achieves latency as low as 150ms while maintaining high-quality audio output.

High Accuracy

  • Improved Pronunciation: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
  • Benchmark Achievements: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.

Strong Stability

  • Consistency in Timbre: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
  • Cross-language Synthesis: Marked improvements compared to version 1.0.

Natural Experience

  • Enhanced Prosody and Sound Quality: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
  • Emotional and Dialectal Flexibility: Now supports more granular emotional controls and accent adjustments.