moonshotai/kimi-k2-instruct

Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities

187 runs

Pricing

Readme

Kimi K2: Open Agentic Intellignece

Chat github Homepage

Hugging Face Twitter Follow Discord

License

📰  Tech Blog     |     📄  Paper

1. Model Introduction

Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 billion activated parameters and 1 trillion total parameters. Trained with the Muon optimizer, Kimi K2 achieves exceptional performance across frontier knowledge, reasoning, and coding tasks while being meticulously optimized for agentic capabilities.

Key Features

  • Large-Scale Training: Pre-trained a 1T parameter MoE model on 15.5T tokens with zero training instability.
  • MuonClip Optimizer: We apply the Muon optimizer to an unprecedented scale, and develop novel optimization techniques to resolve instabilities while scaling up.
  • Agentic Intelligence: Specifically designed for tool use, reasoning, and autonomous problem-solving.

Model Variants

  • Kimi-K2-Base: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
  • Kimi-K2-Instruct: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.

<picture> Evaluation Results </picture>

2. Model Summary

Architecture Mixture-of-Experts (MoE)
Total Parameters 1T
Activated Parameters 32B
Number of Layers (Dense layer included) 61
Number of Dense Layers 1
Attention Hidden Dimension 7168
MoE Hidden Dimension (per Expert) 2048
Number of Attention Heads 64
Number of Experts 384
Selected Experts per Token 8
Number of Shared Experts 1
Vocabulary Size 160K
Context Length 128K
Attention Mechanism MLA
Activation Function SwiGLU

3. Evaluation Results

Instruction model evaluation results

Benchmark Metric <sup>Kimi K2 Instruct</sup> <sup>DeepSeek-V3-0324</sup> <sup>Qwen3-235B-A22B
<sup>(non-thinking)</sup></sup>
<sup>Claude Sonnet 4
<sup>(w/o extended thinking)</sup></sup>
<sup>Claude Opus 4
<sup>(w/o extended thinking)</sup></sup>
<sup>GPT-4.1</sup> <sup>Gemini 2.5 Flash
Preview (05-20)</sup>
Coding Tasks
LiveCodeBench v6
<sup>(Aug 24 - May 25)</sup>
Pass@1 53.7 46.9 37.0 48.5 47.4 44.7 44.7
OJBench Pass@1 27.1 24.0 11.3 15.3 19.6 19.5 19.5
MultiPL-E Pass@1 <ins>85.7</ins> 83.1 78.2 88.6 89.6 86.7 85.6
SWE-bench Verified
<sup>(Agentless Coding)</sup>
Single Patch w/o Test (Acc) <ins>51.8</ins> 36.6 39.4 50.2 53.0 40.8 32.6
SWE-bench Verified
<sup>(Agentic Coding)</sup>
Single Attempt (Acc) <ins>65.8</ins> 38.8 34.4 72.7<sup>*</sup> 72.5<sup>*</sup> 54.6
Multiple Attempts (Acc) <ins>71.6</ins> 80.2 79.4<sup>*</sup>
SWE-bench Multilingual
<sup>(Agentic Coding)</sup>
Single Attempt (Acc) <ins>47.3 </ins> 25.8 20.9 51.0 31.5
TerminalBench Inhouse Framework (Acc) <ins>30.0</ins> 35.5 43.2 8.3
Terminus (Acc) <ins>25.0 </ins> 16.3 6.6 30.3 16.8
Aider-Polyglot Acc 60.0 55.1 <ins>61.8</ins> 56.4 70.7 52.4 44.0
Tool Use Tasks
Tau2 retail Avg@4 <ins>70.6</ins> 69.1 57.0 75.0 81.8 74.8 64.3
Tau2 airline Avg@4 <ins>56.5</ins> 39.0 26.5 55.5 60.0 54.5 42.5
Tau2 telecom Avg@4 65.8 32.5 22.1 45.2 57.0 38.6 16.9
AceBench Acc <ins>76.5</ins> 72.7 70.5 76.2 75.6 80.1 74.5
Math & STEM Tasks
AIME 2024 Avg@64 69.6 59.4<sup>*</sup> 40.1<sup>*</sup> 43.4 48.2 46.5 61.3
AIME 2025 Avg@64 49.5 46.7 24.7<sup>*</sup> 33.1<sup>*</sup> 33.9<sup>*</sup> 37.0 46.6
MATH-500 Acc 97.4 94.0<sup>*</sup> 91.2<sup>*</sup> 94.0 94.4 92.4 95.4
HMMT 2025 Avg@32 38.8 27.5 11.9 15.9 15.9 19.4 34.7
CNMO 2024 Avg@16 74.3 <ins>74.7</ins> 48.6 60.4 57.6 56.6 75.0
PolyMath-en Avg@4 65.1 59.5 51.9 52.8 49.8 54.0 49.9
ZebraLogic Acc 89.0 84.0 37.7<sup>*</sup> 73.7 59.3 58.5 57.9
AutoLogi Acc <ins>89.5</ins> 88.9 83.3 89.8 86.1 88.2 84.1
GPQA-Diamond Avg@8 75.1 68.4<sup>*</sup> 62.9<sup>*</sup> 70.0<sup>*</sup> 74.9<sup>*</sup> 66.3 68.2
SuperGPQA Acc 57.2 53.7 50.2 55.7 56.5 50.8 49.6
Humanity's Last Exam
<sup>(Text Only)</sup>
- 4.7 5.2 <ins>5.7</ins> 5.8 7.1 3.7 5.6
General Tasks
MMLU EM <ins>89.5</ins> 89.4 87.0 91.5 92.9 90.4 90.1
MMLU-Redux EM <ins>92.7</ins> 90.5 89.2 93.6 94.2 92.4 90.6
MMLU-Pro EM 81.1 <ins>81.2</ins><sup>*</sup> 77.3 83.7 86.6 81.8 79.4
IFEval Prompt Strict 89.8 81.1 83.2<sup>*</sup> 87.6 87.4 88.0 84.3
Multi-Challenge Acc 54.1 31.4 34.0 46.8 49.0 36.4 39.5
SimpleQA Correct <ins>31.0</ins> 27.7 13.2 15.9 22.8 42.3 23.3
Livebench Pass@1 76.4 72.4 67.6 74.8 74.6 69.8 67.8

<sup> • Bold denotes global SOTA, and underlined denotes open-source SOTA. </sup>
<sup> • Data points marked with * are taken directly from the model’s tech report or blog. </sup>
<sup> • All metrics, except for SWE-bench Verified (Agentless), are evaluated with an 8k output token length. SWE-bench Verified (Agentless) is limited to a 16k output token length. </sup>
<sup> • Kimi K2 achieves 65.8% pass@1 on the SWE-bench Verified tests with bash/editor tools (single-attempt patches, no test-time compute). It also achieves a 47.3% pass@1 on the SWE-bench Multilingual tests under the same conditions. Additionally, we report results on SWE-bench Verified tests (71.6%) that leverage parallel test-time compute by sampling multiple sequences and selecting the single best via an internal scoring model. </sup>
<sup> • To ensure the stability of the evaluation, we employed avg@k on the AIME, HMMT, CNMO, PolyMath-en, GPQA-Diamond, EvalPlus, Tau2. </sup>
<sup> • Some data points have been omitted due to prohibitively expensive evaluation costs. </sup>


Base model evaluation results

<div align="center">
Benchmark Metric Shot Kimi K2 Base Deepseek-V3-Base Qwen2.5-72B Llama 4 Maverick
General Tasks
MMLU EM 5-shot 87.8 87.1 86.1 84.9
MMLU-pro EM 5-shot 69.2 60.6 62.8 63.5
MMLU-redux-2.0 EM 5-shot 90.2 89.5 87.8 88.2
SimpleQA Correct 5-shot 35.3 26.5 10.3 23.7
TriviaQA EM 5-shot 85.1 84.1 76.0 79.3
GPQA-Diamond Avg@8 5-shot 48.1 50.5 40.8 49.4
SuperGPQA EM 5-shot 44.7 39.2 34.2 38.8
Coding Tasks
LiveCodeBench v6 Pass@1 1-shot 26.3 22.9 21.1 25.1
EvalPlus Pass@1 - 80.3 65.6 66.0 65.5
Mathematics Tasks
MATH EM 4-shot 70.2 60.1 61.0 63.0
GSM8k EM 8-shot 92.1 91.7 90.4 86.3
Chinese Tasks
C-Eval EM 5-shot 92.5 90.0 90.9 80.9
CSimpleQA Correct 5-shot 77.6 72.1 50.5 53.5
• We only evaluate open-source pretrained models in this work. We report results for Qwen2.5-72B because the base checkpoint for Qwen3-235B-A22B was not open-sourced at the time of our study.
• All models are evaluated using the same evaluation protocol. ## 5. Model Usage ### Chat Completion Once the local inference service is up, you can interact with it through the chat endpoint: ### Tool Calling Kimi-K2-Instruct has strong tool-calling capabilities. To enable them, you need to pass the list of available tools in each request, then the model will autonomously decide when and how to invoke them. ## 6. License Both the code repository and the model weights are released under the [Modified MIT License](LICENSE). --- ## 7. Third Party Notices See [THIRD PARTY NOTICES](THIRD_PARTY_NOTICES.md) --- ## 7. Contact Us If you have any questions, please reach out at [support@moonshot.cn](mailto:support@moonshot.cn).