Readme
Kimi-Audio-7B-Instruct: universal audio model π (Cog implementation)
This Replicate model runs Kimi-Audio-7B-Instruct, Moonshot AIβs open-source, seven billion-parameter audio model. It listens to any sound, understands whatβs happening, and can answer in text or speech. The same checkpoint handles speech-to-text, audio question answering, audio captioning, emotion recognition, sound-event classification, and two-way voice chat. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face, [2504.18425] Kimi-Audio Technical Report - arXiv.org)
GitHub: https://github.com/MoonshotAI/Kimi-Audio (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
Technical report: arXiv 2504.18425 ([2504.18425] Kimi-Audio Technical Report - arXiv.org)
Hugging Face weights: moonshotai/Kimi-Audio-7B-Instruct
(moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
About the model
Kimi-Audio turns raw audio into continuous acoustic features and discrete semantic tokens, then feeds both to a Qwen 2.5 transformer with parallel heads for text and audio generation. A flow-matching vocoder streams 24 kHz speech with about 300 milliseconds of latency. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, moonshotai/Kimi-Audio-7B - Hugging Face)
The model was pre-trained on more than thirteen million hours of speech, music, and everyday sounds, then fine-tuned with conversation data so it follows chat prompts. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, How to Install Kimi-Audio 7B Instruct Locally - DEV Community)
Key features
- Many skills, one model β speech-to-text, audio Q&A, captioning, emotion tags, sound-event labels, and voice responses. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
- Real-time streaming β talk to it and get text and speech back while youβre still speaking. ([2504.18425] Kimi-Audio Technical Report - arXiv.org)
- Chat formatting out of the box β send
role
,message_type
, andcontent
just like other chat models. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face) - Permissive license β code that comes from Qwen 2.5 is Apache 2.0; the rest is MIT. You can use it in commercial products as long as you keep the notices. (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …, moonshotai/Kimi-Audio-7B - Hugging Face)
Replicate packaging βοΈ
Component | Source |
---|---|
Transformer weights | moonshotai/Kimi-Audio-7B-Instruct (β 9.8 GB, bf16) (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face) |
Tokenizers + vocoder | Bundled in the GitHub repo (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …) |
Docker base image | moonshotai/kimi-audio:v0.1 (Kimi-Audio/Dockerfile at master Β· MoonshotAI/Kimi-Audio - GitHub) |
The Cog container caches weights under /model_cache
and sets HF_HOME
and TORCH_HOME
so the files are reused across runs.
predict.py
flow
- Load weights, tokenizers, and detokenizer onto the GPU.
- Input accepts an
audio
file or URL plus an optional textprompt
. You can tweak temperatures, top-k values, and the randomseed
. - Generate text only, audio only, or both, using the modelβs
generate
method. - Return a JSON payload with a path to the WAV file (if speech was requested) and the generated text.
Expect about 24 GB of GPU memory for full-precision weights, or roughly 8 GB with 4-bit quantization at slower speed. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, moonshotai/Kimi-Audio-7B - Hugging Face)
How it works under the hood
- Hybrid tokens β continuous vectors give fine acoustic detail, discrete tokens capture meaning. ([2504.18425] Kimi-Audio Technical Report - arXiv.org)
- Flow-matching vocoder β converts semantic tokens to waveform with tiny look-ahead. (moonshotai/Kimi-Audio-7B - Hugging Face)
- Open evaluation kit β Moonshot provides a benchmarking toolkit if you want to compare your own models. (MoonshotAI/Kimi-Audio-Evalkit - GitHub)
Use cases
- Build a voice assistant that actually answers instead of reading web snippets.
- Add live captions and emotion tags to video calls.
- Monitor factory sounds for unusual events.
- Turn recorded meetings into searchable text.
Limitations
- Needs a modern NVIDIA card; small personal laptops may struggle.
- The current speech synthesizer focuses on English and Mandarin phonemes. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
- In very long conversations it can sometimes make up events it never heard. (Kimi-Audio Technical Report - arXiv.org)
License and disclaimer
Kimi-Audio-7B-Instruct weights are MIT. Code that originated in Qwen 2.5 is Apache 2.0. (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
You are responsible for any content you generate with this model. Follow local laws and the upstream license terms.
Citation
@misc{kimi_audio_2025,
title = {Kimi-Audio Technical Report},
author = {Kimi Team},
year = {2025},
eprint = {2504.18425},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
Cog implementation managed by zsxkib.
Star the repo on GitHub once itβs live. β
Follow me on Twitter/X