README.md · hexgrad/Kokoro-82M at 04393022a04dcb701fd2061b8d5dbe23ac7be751

metadata

license: apache-2.0
language:
  - en
base_model:
  - yl4579/StyleTTS2-LJSpeech
pipeline_tag: text-to-speech

🐈 GitHub: https://github.com/hexgrad/kokoro

🚀 Demo: https://hf.co/spaces/hexgrad/Kokoro-TTS

🚫 Beware of fake websites: Sites like kokorottsai [dot] com are blank boilerplates that exploit model popularity to deceptively farm emails. The owner of this model does NOT own any domain name that includes "kokoro". Be careful who you email.

Kokoro is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, Kokoro can be deployed anywhere from production environments to personal projects.

Releases
Usage
EVAL.md ↗️
SAMPLES.md ↗️
VOICES.md ↗️
Model Facts
Training Details
Creative Commons Attribution
Acknowledgements

Releases

Model	Published	Training Data	Langs & Voices	SHA256
v1.0	2025 Jan 27	Few hundred hrs	8 & 54	`496dba11`
v0.19	2024 Dec 25	<100 hrs	1 & 10	`3b0c392f`

Training Costs	v0.19	v1.0	Total
in A100 80GB GPU hours	500	500	1000
average hourly rate	$0.80/h	$1.20/h	$1/h
in USD	$400	$600	$1000

Usage

You can run this basic cell on Google Colab. Listen to samples. For more languages and details, see Advanced Usage.

!pip install -q kokoro>=0.9.2 soundfile
!apt-get -qq -y install espeak-ng > /dev/null 2>&1
from kokoro import KPipeline
from IPython.display import display, Audio
import soundfile as sf
import torch
pipeline = KPipeline(lang_code='a')
text = '''
[Kokoro](/kˈOkəɹO/) is an open-weight TTS model with 82 million parameters. Despite its lightweight architecture, it delivers comparable quality to larger models while being significantly faster and more cost-efficient. With Apache-licensed weights, [Kokoro](/kˈOkəɹO/) can be deployed anywhere from production environments to personal projects.
'''
generator = pipeline(text, voice='af_heart')
for i, (gs, ps, audio) in enumerate(generator):
    print(i, gs, ps)
    display(Audio(data=audio, rate=24000, autoplay=i==0))
    sf.write(f'{i}.wav', audio, 24000)

Under the hood, kokoro uses misaki, a G2P library at https://github.com/hexgrad/misaki

Model Facts

Architecture:

StyleTTS 2: https://arxiv.org/abs/2306.07691
ISTFTNet: https://arxiv.org/abs/2203.02395
Decoder only: no diffusion, no encoder release

Architected by: Li et al @ https://github.com/yl4579/StyleTTS2

Trained by: @rzvzn on Discord

Languages: Multiple

Model SHA256 Hash: 496dba118d1a58f5f3db2efc88dbdc216e0483fc89fe6e47ee1f2c53f18ad1e4

Training Details

Data: Kokoro was trained exclusively on permissive/non-copyrighted audio data and IPA phoneme labels. Examples of permissive/non-copyrighted audio include:

Public domain audio
Audio licensed under Apache, MIT, etc
Synthetic audio^[1] generated by closed^[2] TTS models from large providers
[1] https://copyright.gov/ai/ai_policy_guidance.pdf
[2] No synthetic audio from open TTS models or "custom voice clones"

Total Dataset Size: A few hundred hours of audio

Total Training Cost: About $1000 for 1000 hours of A100 80GB vRAM

Creative Commons Attribution

The following CC BY audio was part of the dataset used to train Kokoro v1.0.

Audio Data	Duration Used	License	Added to Training Set After
Koniwa `tnc`	<1h	CC BY 3.0	v0.19 / 22 Nov 2024
SIWIS	<11h	CC BY 4.0	v0.19 / 22 Nov 2024

Acknowledgements

🛠️ @yl4579 for architecting StyleTTS 2.
🏆 @Pendrokar for adding Kokoro as a contender in the TTS Spaces Arena.
📊 Thank you to everyone who contributed synthetic training data.
❤️ Special thanks to all compute sponsors.
👾 Discord server: https://discord.gg/QuGxSWBfQy
🪽 Kokoro is a Japanese word that translates to "heart" or "spirit". Kokoro is also the name of an AI in the Terminator franchise.