license: apache-2.0 language: - en base_model: - yl4579/StyleTTS2-LJSpeech pipeline_tag: text-to-speech

Disclaimer

This is a fork of the original Kokoro repo, in order to provide easy inference on Replicate. I am not affiliated with the original Kokoro authors, and this is not an official release of the Kokoro model. Similar to the Huggingface Space, this implementation provides automatic text splitting to support long form text inputs. See the original README below for more details.

Voices

Training Duration - How much audio was seen during training? Smaller durations result in a lower overall grade. - 10 hours <= HH hours < 100 hours - 1 hour <= H hours < 10 hours - 10 minutes <= MM minutes < 100 minutes - 1 minute <= M minutes < 10 minutes

American English 🇺🇸

misaki[en] lang_code='a' with en-us espeak-ng fallback

Name	Traits	Target Quality	Training Duration	Overall Grade	SHA256
af_alloy	🚺	B	MM minutes	C	`6d877149`
af_aoede	🚺	B	H hours	C+	`c03bd1a4`
af_bella	🚺🔥	A	HH hours	A-	`8cb64e02`
af_jessica	🚺	C	MM minutes	D	`cdfdccb8`
af_kore	🚺	B	H hours	C+	`8bfbc512`
af_nicole	🚺🎧	B	HH hours	B-	`c5561808`
af_nova	🚺	B	MM minutes	C	`e0233676`
af_river	🚺	C	MM minutes	D	`e149459b`
af_sarah	🚺	B	H hours	C+	`49bd364e`
af_sky	🚺	B	M minutes	C-	`c799548a`
am_adam	🚹	D	H hours	F+	`ced7e284`
am_echo	🚹	C	MM minutes	D	`8bcfdc85`
am_eric	🚹	C	MM minutes	D	`ada66f0e`
am_fenrir	🚹	B	H hours	C+	`98e507ec`
am_liam	🚹	C	MM minutes	D	`c8255075`
am_michael	🚹	B	H hours	C+	`9a443b79`
am_onyx	🚹	C	MM minutes	D	`e8452be1`
am_puck	🚹	B	H hours	C+	`dd1d8973`

British English 🇬🇧

misaki[en] lang_code='b' with en-gb espeak-ng fallback

Name	Traits	Target Quality	Training Duration	Overall Grade	SHA256
bf_alice	🚺	C	MM minutes	D	`d292651b`
bf_emma	🚺	B	HH hours	B-	`d0a423de`
bf_isabella	🚺	B	MM minutes	C	`cdd4c370`
bf_lily	🚺	C	MM minutes	D	`6e09c2e4`
bm_daniel	🚹	C	MM minutes	D	`fc3fce4e`
bm_fable	🚹	B	MM minutes	C	`d44935f3`
bm_george	🚹	B	MM minutes	C	`f1bc8122`
bm_lewis	🚹	C	H hours	D+	`b5204750`

French 🇫🇷

espeak-ng fr-fr
Total French training data: <11 hours

Name	Traits	Target Quality	Training Duration	Overall Grade	SHA256	CC BY
ff_siwis	🚺	B	<11 hours	B-	`8073bf2d`	SIWIS

Hindi 🇮🇳

espeak-ng hi
Total Hindi training data: H hours

Name	Traits	Target Quality	Training Duration	Overall Grade	SHA256
hf_alpha	🚺	B	MM minutes	C	`06906fe0`
hf_beta	🚺	B	MM minutes	C	`63c0a1a6`
hm_omega	🚹	B	MM minutes	C	`b55f02a8`
hm_psi	🚹	B	MM minutes	C	`2f0f055c`

Italian 🇮🇹

espeak-ng it
Total Italian training data: H hours

Name	Traits	Target Quality	Training Duration	Overall Grade	SHA256
if_sara	🚺	B	MM minutes	C	`6c0b253b`
im_nicola	🚹	B	MM minutes	C	`234ed066`

Japanese 🇯🇵

misaki[ja]
Total Japanese training data: H hours

Name	Traits	Target Quality	Training Duration	Overall Grade	SHA256	CC BY
jf_alpha	🚺	B	H hours	C+	`1bf4c9dc`
jf_gongitsune	🚺	B	MM minutes	C	`1b171917`	gongitsune
jf_nezumi	🚺	B	M minutes	C-	`d83f007a`	nezuminoyomeiri
jf_tebukuro	🚺	B	MM minutes	C	`0d691790`	tebukurowokaini
jm_kumo	🚹	B	M minutes	C-	`98340afd`	kumonoito

Mandarin Chinese 🇨🇳

misaki[zh]
Total Mandarin Chinese training data: H hours

Name	Traits	Target Quality	Training Duration	Overall Grade	SHA256
zf_xiaobei	🚺	C	MM minutes	D	`9b76be63`
zf_xiaoni	🚺	C	MM minutes	D	`95b49f16`
zf_xiaoxiao	🚺	C	MM minutes	D	`cfaf6f2d`
zf_xiaoyi	🚺	C	MM minutes	D	`b5235dba`
zm_yunjian	🚹	C	MM minutes	D	`76cbf8ba`
zm_yunxi	🚹	C	MM minutes	D	`dbe6e1ce`
zm_yunxia	🚹	C	MM minutes	D	`bb2b03b0`
zm_yunyang	🚹	C	MM minutes	D	`5238ac22`

✨ You can now pip install kokoro! See Usage.

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Releases

Model	Published	Training Data	Compute (A100 80GB)	Langs & Voices	SHA256
v1.0	2025 Jan 27	Few hundred hrs	$1000 for 1000 hrs	6 & 46	`496dba11`
v0.19	2024 Dec 25	<100 hrs	$400 for 500 hrs	1 & 10	`3b0c392f`

Usage

pip install kokoro installs the inference library at https://github.com/hexgrad/kokoro

Under the hood, kokoro uses misaki, a G2P library at https://github.com/hexgrad/misaki

Model Facts

Architecture: - StyleTTS 2: https://arxiv.org/abs/2306.07691 - ISTFTNet: https://arxiv.org/abs/2203.02395 - Decoder only: no diffusion, no encoder release

Architected by: Li et al @ https://github.com/yl4579/StyleTTS2

Trained by: @rzvzn on Discord

Languages: American English, British English, French, Hindi

Model SHA256 Hash: 496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4

Training Details

Compute: About $1000 for 1000 hours of A100 80GB vRAM

Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include: - Public domain audio - Audio licensed under Apache, MIT, etc - Synthetic audio<sup>[1]</sup> generated by closed<sup>[2]</sup> TTS models from large providers
[1] https://copyright.gov/ai/ai_policy_guidance.pdf
[2] No synthetic audio from open TTS models or “custom voice clones”

Total Dataset Size: A few hundred hours of audio

Creative Commons Attribution

The following CC BY audio was part of the dataset used to train Kokoro v1.0.

Audio Data	Duration Used	License	Added to Training Set After
Koniwa `tnc`	<1h	CC BY 3.0	v0.19 / 22 Nov 2024
SIWIS	<11h	CC BY 4.0	v0.19 / 22 Nov 2024

Acknowledgements

🛠️ @yl4579 for architecting StyleTTS 2.
🏆 @Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena.
📊 Thank you to everyone who contributed synthetic training data.
❤️ Special thanks to all compute sponsors.
👾 Discord server: https://discord.gg/QuGxSWBfQy
🪽 Kokoro is a Japanese word that translates to “heart” or “spirit”. Kokoro is also the name of an AI in the Terminator franchise.

kokoro