kjjk10 / kokoro-82m

Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out).

  • Public
  • 345 runs
  • Weights
  • License

Run time and cost

This model costs approximately $0.00022 to run on Replicate, or 4545 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 1 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Kokoro: A Frontier TTS Model

Note

Kokoro v0.19 can output a max of 30 seconds of audio per generation.


Model Card

Kokoro is a frontier TTS model for its size of 82 million parameters (text in/audio out).

On 25 Dec 2024, Kokoro v0.19 weights were permissively released in full fp32 precision under an Apache 2.0 license. As of 2 Jan 2025, 10 unique Voicepacks have been released, and a .onnx version of v0.19 is available.

In the weeks leading up to its release, Kokoro v0.19 was the #1🥇 ranked model in TTS Spaces Arena. Kokoro achieved higher Elo in this single-voice Arena setting over other models, using fewer parameters and less data:

  • Kokoro v0.19: 82M params, Apache, trained on <100 hours of audio
  • XTTS v2: 467M, CPML, >10k hours
  • Edge TTS: Microsoft, proprietary
  • MetaVoice: 1.2B, Apache, 100k hours
  • Parler Mini: 880M, Apache, 45k hours
  • Fish Speech: ~500M, CC-BY-NC-SA, 1M hours

Kokoro’s ability to top this Elo ladder suggests that the scaling law (Elo vs compute/data/params) for traditional TTS models might have a steeper slope than previously expected.


Acknowledgements

  • @yl4579 for architecting StyleTTS 2
  • @Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena

Model Card Contact

@rzvzn on Discord
Server invite: https://discord.gg/QuGxSWBfQy