Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Fish Speech V1.5 The 1.5 version of the model released by fish.audio.

Disclaimer This is an unofficial implementation. Please refer to the fishspeech repository for the original code and details. By using this model, you agree to the terms stated at the link above, which could change at any time. Make sure to comply to these terms before using the model.

FishSpeech: Advanced Speech Synthesis Technology

Key Features

Zero-shot & Few-shot TTS

Generate high-quality speech output that closely resembles the original voice with just 10-30 seconds of audio samples
Quickly achieve personalized voice cloning without extensive training data

Excellent Bilingual Support

Perfect support for both Chinese and English with seamless language switching
Simply copy and paste Chinese or English text into the input box for automatic processing
Natural and fluent reading of mixed Chinese-English text without additional settings

No Phoneme Dependency

Revolutionary technological breakthrough that completely eliminates traditional TTS dependency on phonemes
Model possesses powerful language understanding and generalization capabilities
Directly processes text without complex phoneme conversion procedures

Superior Accuracy Performance

Achieves approximately 2% Character Error Rate (CER) and Word Error Rate (WER) in 5-minute English text tests
Equally outstanding accuracy for Chinese text comprehension with clear and natural pronunciation
Significantly reduces pronunciation errors and unnatural pauses common in traditional TTS systems

Multi-scenario Applications

Personalized voice assistant customization
Audiobook and podcast production
Video dubbing and game character voices
Educational and assistive technology applications

FishSpeech: 先进的语音合成技术

主要特点

零样本 & 小样本 TTS

只需提供10至30秒的声音样本，即可生成与原声音高度相似的高质量语音输出
无需大量训练数据，快速实现个性化语音克隆

双语卓越支持

完美支持中文和英文，无缝切换两种语言
只需复制并粘贴中英文本到输入框，系统自动识别并处理
中英混合文本也能自然流畅地朗读，无需额外设置

无音素依赖设计

革命性技术突破，完全摆脱传统TTS对音素的依赖
模型具备强大的语言理解和泛化能力
能够直接处理文本，无需复杂的音素转换过程

超高准确率表现

在5分钟英文文本测试中，达到约2%的字符错误率(CER)和词错误率(WER)
中文文本理解准确率同样出色，发音清晰自然
大幅减少传统TTS系统常见的发音错误和不自然停顿

多场景应用

个性化语音助手定制
有声读物和播客制作
视频配音和游戏角色声音
教育和辅助技术应用