zsxkib / kimi-vl-a3b-thinking

Kimi-VL-A3B-Thinking is a multi-modal LLM that can understand text and images, and generate text with thinking processes

  • Public
  • 107 runs
  • GitHub
  • Weights
  • Paper
  • License

Kimi-VL: An Open Mixture-of-Experts Vision-Language Model 🦉

Kimi-VL is Moonshot AI’s vision-language model designed for multimodal reasoning, understanding long contexts, and powering capable AI agents—while only activating 2.8 billion parameters at runtime.

There are two flavors of Kimi-VL:

Variant Best For Recommended Temperature
Kimi-VL-A3B-Instruct General perception, OCR, videos, agents 0.2
Kimi-VL-A3B-Thinking Complex reasoning, math problems, puzzles 0.6

Why Kimi-VL Is Useful

  • Handles Long Contexts: Processes up to 128K tokens at once—think entire academic papers or long-form videos.
  • Detailed Image Understanding: Its built-in MoonViT encoder keeps images at native resolution, so even tiny details remain clear.
  • Efficient: Activates just 2.8 billion parameters during use, making it GPU-friendly.
  • Strong Benchmark Results: Matches or beats larger models like GPT-4o-mini on complex tasks (e.g., 64.5% on LongVideoBench, 36.8% on MathVision).
  • Fully Open-Source: Licensed entirely under MIT, making it easy and flexible for everyone to use and improve.

Key Features ✨

  • Multimodal Reasoning: Seamlessly integrates image, video, and text inputs.
  • Agent-Friendly Interface: Uses the familiar chat-style message format compatible with OpenAI’s chat API.
  • Supports vLLM and LLaMA-Factory: Fine-tune easily on a single GPU or deploy with vLLM’s fast inference server.
  • Optimized Speed: Supports Flash-Attention 2 and native FP16/bfloat16 precision for faster, efficient runs.

How the Container Works 🛠️

  • Core Weights: Uses Moonlight-16B-A3B backbone and MoonViT image encoder directly from Hugging Face.
  • Dependencies: Built with PyTorch 2.5, transformers 4.51, torchvision, flash-attn, and optionally vllm.
  • Processing Steps:
  • Loads images using Pillow.
  • Formats inputs through AutoProcessor.
  • Executes model inference directly or via vLLM.
  • Decodes output tokens into human-readable text.

Use Cases 💡

  • Summarizing or analyzing multi-page documents and large PDFs.
  • Solving complex math problems using visual data.
  • Building AI agents capable of reasoning across images, text, and video.

Limitations ⚠️

  • GPU Memory: Ideally requires at least 24GB VRAM. Use Replicate’s larger GPU instances for full 128K context tasks.
  • No Direct Video Input: Requires pre-processing of videos into frames externally before use.
  • Image Limit per Prompt: Currently supports up to eight images per prompt (can be adjusted).

License and Citation 📜

Everything is openly available under the MIT license.

@misc{kimiteam2025kimivl,
  title   = {Kimi-VL Technical Report},
  author  = {Kimi Team and Angang Du et al.},
  year    = {2025},
  eprint  = {2504.07491},
  archivePrefix = {arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2504.07491}
}

Acknowledgements

Big thanks to Moonshot AI for releasing Kimi-VL as open-source, and to the teams at vLLM and LLaMA-Factory for their rapid integration and support.


Star this project on GitHub: moonshotai/Kimi-VL

🐦 Follow me on Twitter: @zsakib_