jaaari / kokoro-82m

Kokoro v1.0 - text-to-speech (82M params, based on StyleTTS2)

  • Public
  • 4.1M runs
  • GitHub
  • Weights
  • License

Run time and cost

This model costs approximately $0.00022 to run on Replicate, or 4545 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia T4 GPU hardware. Predictions typically complete within 1 seconds.

Readme

license: apache-2.0 language: - en base_model: - yl4579/StyleTTS2-LJSpeech pipeline_tag: text-to-speech


Disclaimer

This is a fork of the original Kokoro repo, in order to provide easy inference on Replicate. I am not affiliated with the original Kokoro authors, and this is not an official release of the Kokoro model. Similar to the Huggingface Space, this implementation provides automatic text splitting to support long form text inputs. See the original README below for more details.


Voices

Training Duration - How much audio was seen during training? Smaller durations result in a lower overall grade. - 10 hours <= HH hours < 100 hours - 1 hour <= H hours < 10 hours - 10 minutes <= MM minutes < 100 minutes - 1 minute <= M minutes < 10 minutes

American English 🇺🇸

  • misaki[en] lang_code='a' with en-us espeak-ng fallback
Name Traits Target Quality Training Duration Overall Grade SHA256
af_alloy 🚺 B MM minutes C 6d877149
af_aoede 🚺 B H hours C+ c03bd1a4
af_bella 🚺🔥 A HH hours A- 8cb64e02
af_jessica 🚺 C MM minutes D cdfdccb8
af_kore 🚺 B H hours C+ 8bfbc512
af_nicole 🚺🎧 B HH hours B- c5561808
af_nova 🚺 B MM minutes C e0233676
af_river 🚺 C MM minutes D e149459b
af_sarah 🚺 B H hours C+ 49bd364e
af_sky 🚺 B M minutes C- c799548a
am_adam 🚹 D H hours F+ ced7e284
am_echo 🚹 C MM minutes D 8bcfdc85
am_eric 🚹 C MM minutes D ada66f0e
am_fenrir 🚹 B H hours C+ 98e507ec
am_liam 🚹 C MM minutes D c8255075
am_michael 🚹 B H hours C+ 9a443b79
am_onyx 🚹 C MM minutes D e8452be1
am_puck 🚹 B H hours C+ dd1d8973

British English 🇬🇧

  • misaki[en] lang_code='b' with en-gb espeak-ng fallback
Name Traits Target Quality Training Duration Overall Grade SHA256
bf_alice 🚺 C MM minutes D d292651b
bf_emma 🚺 B HH hours B- d0a423de
bf_isabella 🚺 B MM minutes C cdd4c370
bf_lily 🚺 C MM minutes D 6e09c2e4
bm_daniel 🚹 C MM minutes D fc3fce4e
bm_fable 🚹 B MM minutes C d44935f3
bm_george 🚹 B MM minutes C f1bc8122
bm_lewis 🚹 C H hours D+ b5204750

French 🇫🇷

  • espeak-ng fr-fr
  • Total French training data: <11 hours
Name Traits Target Quality Training Duration Overall Grade SHA256 CC BY
ff_siwis 🚺 B <11 hours B- 8073bf2d SIWIS

Hindi 🇮🇳

  • espeak-ng hi
  • Total Hindi training data: H hours
Name Traits Target Quality Training Duration Overall Grade SHA256
hf_alpha 🚺 B MM minutes C 06906fe0
hf_beta 🚺 B MM minutes C 63c0a1a6
hm_omega 🚹 B MM minutes C b55f02a8
hm_psi 🚹 B MM minutes C 2f0f055c

Italian 🇮🇹

  • espeak-ng it
  • Total Italian training data: H hours
Name Traits Target Quality Training Duration Overall Grade SHA256
if_sara 🚺 B MM minutes C 6c0b253b
im_nicola 🚹 B MM minutes C 234ed066

Japanese 🇯🇵

  • misaki[ja]
  • Total Japanese training data: H hours
Name Traits Target Quality Training Duration Overall Grade SHA256 CC BY
jf_alpha 🚺 B H hours C+ 1bf4c9dc
jf_gongitsune 🚺 B MM minutes C 1b171917 gongitsune
jf_nezumi 🚺 B M minutes C- d83f007a nezuminoyomeiri
jf_tebukuro 🚺 B MM minutes C 0d691790 tebukurowokaini
jm_kumo 🚹 B M minutes C- 98340afd kumonoito

Mandarin Chinese 🇨🇳

  • misaki[zh]
  • Total Mandarin Chinese training data: H hours
Name Traits Target Quality Training Duration Overall Grade SHA256
zf_xiaobei 🚺 C MM minutes D 9b76be63
zf_xiaoni 🚺 C MM minutes D 95b49f16
zf_xiaoxiao 🚺 C MM minutes D cfaf6f2d
zf_xiaoyi 🚺 C MM minutes D b5235dba
zm_yunjian 🚹 C MM minutes D 76cbf8ba
zm_yunxi 🚹 C MM minutes D dbe6e1ce
zm_yunxia 🚹 C MM minutes D bb2b03b0
zm_yunyang 🚹 C MM minutes D 5238ac22

✨ You can now pip install kokoro! See Usage.

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Releases

Model Published Training Data Compute (A100 80GB) Langs & Voices SHA256
v1.0 2025 Jan 27 Few hundred hrs $1000 for 1000 hrs 6 & 46 496dba11
v0.19 2024 Dec 25 <100 hrs $400 for 500 hrs 1 & 10 3b0c392f

Usage

pip install kokoro installs the inference library at https://github.com/hexgrad/kokoro

Under the hood, kokoro uses misaki, a G2P library at https://github.com/hexgrad/misaki

Model Facts

Architecture: - StyleTTS 2: https://arxiv.org/abs/2306.07691 - ISTFTNet: https://arxiv.org/abs/2203.02395 - Decoder only: no diffusion, no encoder release

Architected by: Li et al @ https://github.com/yl4579/StyleTTS2

Trained by: @rzvzn on Discord

Languages: American English, British English, French, Hindi

Model SHA256 Hash: 496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4

Training Details

Compute: About $1000 for 1000 hours of A100 80GB vRAM

Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include: - Public domain audio - Audio licensed under Apache, MIT, etc - Synthetic audio<sup>[1]</sup> generated by closed<sup>[2]</sup> TTS models from large providers
[1] https://copyright.gov/ai/ai_policy_guidance.pdf
[2] No synthetic audio from open TTS models or “custom voice clones”

Total Dataset Size: A few hundred hours of audio

Creative Commons Attribution

The following CC BY audio was part of the dataset used to train Kokoro v1.0.

Audio Data Duration Used License Added to Training Set After
Koniwa tnc <1h CC BY 3.0 v0.19 / 22 Nov 2024
SIWIS <11h CC BY 4.0 v0.19 / 22 Nov 2024

Acknowledgements

  • 🛠️ @yl4579 for architecting StyleTTS 2.
  • 🏆 @Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena.
  • 📊 Thank you to everyone who contributed synthetic training data.
  • ❤️ Special thanks to all compute sponsors.
  • 👾 Discord server: https://discord.gg/QuGxSWBfQy
  • 🪽 Kokoro is a Japanese word that translates to “heart” or “spirit”. Kokoro is also the name of an AI in the Terminator franchise.

kokoro