jichengdu / cosyvoice

CosyVoice2-0.5B-Scalable Streaming Speech Synthesis with Large Language Models

  • Public
  • 629 runs
  • GitHub
  • Weights
  • Paper
  • License

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

CosyVoice 2.0-0.5B

Multilingual Support

  • Supported Languages: Chinese, English, Japanese, Korean, Chinese dialects (Cantonese, Sichuanese, Shanghainese, Tianjinese, Wuhanese, etc.)
  • Cross-lingual & Mixed-lingual: Supports zero-shot voice cloning for cross-language and code-switching scenarios.

Ultra-Low Latency

  • Bidirectional Streaming Support: CosyVoice 2.0 integrates offline and streaming modeling technologies.
  • Rapid First Packet Synthesis: Achieves latency as low as 150ms while maintaining high-quality audio output.

High Accuracy

  • Improved Pronunciation: Reduces pronunciation errors by 30% to 50% compared to CosyVoice 1.0.
  • Benchmark Achievements: Attains the lowest character error rate on the hard test set of the Seed-TTS evaluation set.

Strong Stability

  • Timbre Consistency: Ensures reliable voice consistency for zero-shot and cross-language speech synthesis.
  • Cross-language Synthesis: Shows significant improvements compared to version 1.0.

Natural Experience

  • Enhanced Prosody and Sound Quality: Improved alignment of synthesized audio, raising MOS evaluation scores from 5.4 to 5.53.
  • Emotional and Dialectal Flexibility: Now supports more granular emotional controls and accent adjustments.

多语言支持

  • 支持的语言:中文、英语、日语、韩语、中国方言(粤语、四川话、上海话、天津话、武汉话等)
  • 跨语言与混合语言:支持跨语言和代码切换场景下的零样本声音克隆。

超低延迟

  • 双向流式支持:CosyVoice 2.0 集成了离线和流式建模技术。
  • 快速首包合成:在保持高质量音频输出的同时,实现低至 150ms 的延迟。

高精度

  • 改进的发音:与 CosyVoice 1.0 相比,发音错误减少了 30% 到 50%。
  • 基准测试成就:在 Seed-TTS 评估集的困难测试集上获得最低字符错误率。

强大的稳定性

  • 音色一致性:确保零样本和跨语言语音合成的可靠声音一致性。
  • 跨语言合成:与 1.0 版本相比有显著改进。

自然体验

  • 增强的韵律和音质:改进了合成音频的对齐,将 MOS 评估分数从 5.4 提高到 5.53。
  • 情感和方言灵活性:现在支持更细粒度的情感控制和口音调整。 来一段中英双语的 英语在前