bytedance / sa2va-4b-image

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

  • Public
  • 19 runs
  • GitHub
  • Weights
  • Paper
  • License

Run time and cost

This model runs on Nvidia L40S GPU hardware. We don't yet have enough runs of this model to provide performance information.

Readme

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

[📂 GitHub] [📜 Sa2VA paper] [🚀 Quick Start]

Introduction

Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.

Sa2VA Family

We built the Sa2VA series based on Qwen2-VL and InternVL2/2.5. In the following table, we provide some Sa2VA models built on InternVL2.5. Other Sa2VA models will be open-sourced soon.

Model Name Base MLLM Language Part HF Link
Sa2VA-1B InternVL2.5-1B Qwen2.5-0.5B-Instruct 🤗 link
Sa2VA-4B InternVL2.5-4B Qwen2.5-3B-Instruct 🤗 link
Sa2VA-8B InternVL2.5-8B internlm2_5-7b-chat 🤗 link
Sa2VA-26B InternVL2.5-26B internlm2_5-20b-chat 🤗 link

Sa2VA Performance

Model Name MME MMBench RefCOCO RefCOCO+ RefCOCOg MeVIS (val_u) DAVIS
Sa2VA-1B 1504/434 71.9 79.6 73.6 77.7 53.4 69.5
Sa2VA-4B 1691/610 81.8 82.4 77.6 79.7 55.9 73.7
Sa2VA-8B 1690/610 84.4 82.6 78.0 80.3 58.9 75.9
Sa2VA-26B 1698/653 85.8 82.9 79.3 81.2 61.8 78.6

Citation

If you find this project useful in your research, please consider citing:

@article{sa2va,
  title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
  author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong Huang and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
  journal={arXiv preprint},
  year={2025}
}