Kimi-VL: An Open Mixture-of-Experts Vision-Language Model 🦉

Kimi-VL is Moonshot AI’s vision-language model designed for multimodal reasoning, understanding long contexts, and powering capable AI agents—while only activating 2.8 billion parameters at runtime.

There are two flavors of Kimi-VL:

Variant	Best For	Recommended Temperature
Kimi-VL-A3B-Instruct	General perception, OCR, videos, agents	0.2
Kimi-VL-A3B-Thinking	Complex reasoning, math problems, puzzles	0.6

Original GitHub Repository: moonshotai/Kimi-VL
Model Weights on Hugging Face: Kimi-VL-A3B-Instruct, Kimi-VL-A3B-Thinking

Why Kimi-VL Is Useful

Handles Long Contexts: Processes up to 128K tokens at once—think entire academic papers or long-form videos.
Detailed Image Understanding: Its built-in MoonViT encoder keeps images at native resolution, so even tiny details remain clear.
Efficient: Activates just 2.8 billion parameters during use, making it GPU-friendly.
Strong Benchmark Results: Matches or beats larger models like GPT-4o-mini on complex tasks (e.g., 64.5% on LongVideoBench, 36.8% on MathVision).
Fully Open-Source: Licensed entirely under MIT, making it easy and flexible for everyone to use and improve.

Key Features ✨

Multimodal Reasoning: Seamlessly integrates image, video, and text inputs.
Agent-Friendly Interface: Uses the familiar chat-style message format compatible with OpenAI’s chat API.
Supports vLLM and LLaMA-Factory: Fine-tune easily on a single GPU or deploy with vLLM’s fast inference server.
Optimized Speed: Supports Flash-Attention 2 and native FP16/bfloat16 precision for faster, efficient runs.

How the Container Works 🛠️

Core Weights: Uses Moonlight-16B-A3B backbone and MoonViT image encoder directly from Hugging Face.
Dependencies: Built with PyTorch 2.5, transformers 4.51, torchvision, flash-attn, and optionally vllm.
Processing Steps:
Loads images using Pillow.
Formats inputs through AutoProcessor.
Executes model inference directly or via vLLM.
Decodes output tokens into human-readable text.

Use Cases 💡

Summarizing or analyzing multi-page documents and large PDFs.
Solving complex math problems using visual data.
Building AI agents capable of reasoning across images, text, and video.

Limitations ⚠️

GPU Memory: Ideally requires at least 24GB VRAM. Use Replicate’s larger GPU instances for full 128K context tasks.
No Direct Video Input: Requires pre-processing of videos into frames externally before use.
Image Limit per Prompt: Currently supports up to eight images per prompt (can be adjusted).

License and Citation 📜

Everything is openly available under the MIT license.

@misc{kimiteam2025kimivl,
  title   = {Kimi-VL Technical Report},
  author  = {Kimi Team and Angang Du et al.},
  year    = {2025},
  eprint  = {2504.07491},
  archivePrefix = {arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.07491}
}

Acknowledgements

Big thanks to Moonshot AI for releasing Kimi-VL as open-source, and to the teams at vLLM and LLaMA-Factory for their rapid integration and support.

⭐ Star this project on GitHub: moonshotai/Kimi-VL

🐦 Follow me on Twitter: @zsakib_