zsxkib / kimi-audio-7b-instruct

🎧 Kimi-Audio-7B-Instruct, ASR, audio reasoning, captioning, emotion sensing, and TTS into one universal model πŸ”Š

  • Public
  • 32 runs
  • GitHub
  • Weights
  • Paper
  • License

Kimi-Audio-7B-Instruct: universal audio model πŸ”Š (Cog implementation)

Replicate

This Replicate model runs Kimi-Audio-7B-Instruct, Moonshot AI’s open-source, seven billion-parameter audio model. It listens to any sound, understands what’s happening, and can answer in text or speech. The same checkpoint handles speech-to-text, audio question answering, audio captioning, emotion recognition, sound-event classification, and two-way voice chat. (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face, [2504.18425] Kimi-Audio Technical Report - arXiv.org)

GitHub: https://github.com/MoonshotAI/Kimi-Audio (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
Technical report: arXiv 2504.18425 ([2504.18425] Kimi-Audio Technical Report - arXiv.org)
Hugging Face weights: moonshotai/Kimi-Audio-7B-Instruct (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)


About the model

Kimi-Audio turns raw audio into continuous acoustic features and discrete semantic tokens, then feeds both to a Qwen 2.5 transformer with parallel heads for text and audio generation. A flow-matching vocoder streams 24 kHz speech with about 300 milliseconds of latency. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, moonshotai/Kimi-Audio-7B - Hugging Face)

The model was pre-trained on more than thirteen million hours of speech, music, and everyday sounds, then fine-tuned with conversation data so it follows chat prompts. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, How to Install Kimi-Audio 7B Instruct Locally - DEV Community)


Key features


Replicate packaging βš™οΈ

Component Source
Transformer weights moonshotai/Kimi-Audio-7B-Instruct (β‰ˆ 9.8 GB, bf16) (moonshotai/Kimi-Audio-7B-Instruct - Hugging Face)
Tokenizers + vocoder Bundled in the GitHub repo (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
Docker base image moonshotai/kimi-audio:v0.1 (Kimi-Audio/Dockerfile at master Β· MoonshotAI/Kimi-Audio - GitHub)

The Cog container caches weights under /model_cache and sets HF_HOME and TORCH_HOME so the files are reused across runs.

predict.py flow

  1. Load weights, tokenizers, and detokenizer onto the GPU.
  2. Input accepts an audio file or URL plus an optional text prompt. You can tweak temperatures, top-k values, and the random seed.
  3. Generate text only, audio only, or both, using the model’s generate method.
  4. Return a JSON payload with a path to the WAV file (if speech was requested) and the generated text.

Expect about 24 GB of GPU memory for full-precision weights, or roughly 8 GB with 4-bit quantization at slower speed. ([2504.18425] Kimi-Audio Technical Report - arXiv.org, moonshotai/Kimi-Audio-7B - Hugging Face)


How it works under the hood


Use cases

  • Build a voice assistant that actually answers instead of reading web snippets.
  • Add live captions and emotion tags to video calls.
  • Monitor factory sounds for unusual events.
  • Turn recorded meetings into searchable text.

Limitations


License and disclaimer

Kimi-Audio-7B-Instruct weights are MIT. Code that originated in Qwen 2.5 is Apache 2.0. (GitHub - MoonshotAI/Kimi-Audio: Kimi-Audio, an open-source audio …)
You are responsible for any content you generate with this model. Follow local laws and the upstream license terms.


Citation

@misc{kimi_audio_2025,
  title        = {Kimi-Audio Technical Report},
  author       = {Kimi Team},
  year         = {2025},
  eprint       = {2504.18425},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL}
}

Cog implementation managed by zsxkib.

Star the repo on GitHub once it’s live. ⭐

Follow me on Twitter/X