Run time and cost

This model costs approximately $0.00022 to run on Replicate, or 4545 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 1 seconds.

Readme

Kokoro: A Frontier TTS Model

Note

Kokoro v0.19 can output a max of 30 seconds of audio per generation.

Model Card

Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out).

On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a .onnx version of v0.19 is available.

In the weeks leading up to its release, Kokoro v0.19 was the #1🥇 ranked model in TTS Spaces Arena. Kokoro achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:

Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio
XTTS v2: 467M, CPML, >10k hours
Edge TTS: Microsoft, proprietary
MetaVoice: 1.2B, Apache, 100k hours
Parler Mini: 880M, Apache, 45k hours
Fish Speech: ~500M, CC-BY-NC-SA, 1M hours

Kokoro’s ability to top this Elo ladder suggests that the scaling law (Elo vs compute/data/params) for traditional TTS models might have a steeper slope than previously expected.

Acknowledgements

@yl4579 for architecting StyleTTS 2
@Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena

Model Card Contact

@rzvzn on Discord
Server invite: https://discord.gg/QuGxSWBfQy