Readme

Kimi-Audio-7B-Instruct: universal audio model 🔊 (Cog implementation)

This Replicate model runs Kimi-Audio-7B-Instruct, Moonshot AI’s open-source, seven billion-parameter audio model. It listens to any sound, understands what’s happening, and can answer in text or speech. The same checkpoint handles speech-to-text, audio question answering, audio captioning, emotion recognition, sound-event classification, and two-way voice chat. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face, [2504.18425] Kimi-Audio Technical Report - arXiv.org)

GitHub: https://github.com/MoonshotAI/Kimi-Audio (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
Technical report: arXiv 2504.18425 ([2504.18425] Kimi-Audio Technical Report - arXiv.org)
Hugging Face weights: moonshotai/Kimi-Audio-7B-Instruct (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)

About the model

Kimi-Audio turns raw audio into continuous acoustic features and discrete semantic tokens, then feeds both to a Qwen 2.5 transformer with parallel heads for text and audio generation. A flow-matching vocoder streams 24 kHz speech with about 300 milliseconds of latency. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, moonshotai/Kimi-Audio-7B - Hugging Face)

The model was pre-trained on more than thirteen million hours of speech, music, and everyday sounds, then fine-tuned with conversation data so it follows chat prompts. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, How to Install Kimi-Audio 7B Instruct Locally - DEV Community)

Key features

Many skills, one model – speech-to-text, audio Q&A, captioning, emotion tags, sound-event labels, and voice responses. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
Real-time streaming – talk to it and get text and speech back while you’re still speaking. ([2504.18425] Kimi-Audio Technical Report - arXiv.org)
Chat formatting out of the box – send role, message_type, and content just like other chat models. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
Permissive license – code that comes from Qwen 2.5 is Apache 2.0; the rest is MIT. You can use it in commercial products as long as you keep the notices. (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …, moonshotai/Kimi-Audio-7B - Hugging Face)

Replicate packaging ⚙️

Component	Source
Transformer weights	`moonshotai/Kimi-Audio-7B-Instruct` (≈ 9.8 GB, bf16) (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
Tokenizers + vocoder	Bundled in the GitHub repo (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
Docker base image	`moonshotai/kimi-audio:v0.1` (Kimi-Audio/Dockerfile at master · MoonshotAI/Kimi-Audio - GitHub)

The Cog container caches weights under /model_cache and sets HF_HOME and TORCH_HOME so the files are reused across runs.

`predict.py` flow

Load weights, tokenizers, and detokenizer onto the GPU.
Input accepts an audio file or URL plus an optional text prompt. You can tweak temperatures, top-k values, and the random seed.
Generate text only, audio only, or both, using the model’s generate method.
Return a JSON payload with a path to the WAV file (if speech was requested) and the generated text.

Expect about 24 GB of GPU memory for full-precision weights, or roughly 8 GB with 4-bit quantization at slower speed. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, moonshotai/Kimi-Audio-7B - Hugging Face)

How it works under the hood

Hybrid tokens – continuous vectors give fine acoustic detail, discrete tokens capture meaning. ([2504.18425] Kimi-Audio Technical Report - arXiv.org)
Flow-matching vocoder – converts semantic tokens to waveform with tiny look-ahead. (moonshotai/Kimi-Audio-7B - Hugging Face)
Open evaluation kit – Moonshot provides a benchmarking toolkit if you want to compare your own models. (MoonshotAI/Kimi-Audio-Evalkit - GitHub)

Use cases

Build a voice assistant that actually answers instead of reading web snippets.
Add live captions and emotion tags to video calls.
Monitor factory sounds for unusual events.
Turn recorded meetings into searchable text.

Limitations

Needs a modern NVIDIA card; small personal laptops may struggle.
The current speech synthesizer focuses on English and Mandarin phonemes. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
In very long conversations it can sometimes make up events it never heard. (Kimi-Audio Technical Report - arXiv.org)

License and disclaimer

Kimi-Audio-7B-Instruct weights are MIT. Code that originated in Qwen 2.5 is Apache 2.0. (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
You are responsible for any content you generate with this model. Follow local laws and the upstream license terms.

Citation

@misc{kimi_audio_2025,
  title        = {Kimi-Audio Technical Report},
  author       = {Kimi Team},
  year         = {2025},
  eprint       = {2504.18425},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL}
}

Cog implementation managed by zsxkib.

Star the repo on GitHub once it’s live. ⭐

Follow me on Twitter/X

Run time and cost