bytedance / sa2va-26b-image

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

  • Public
  • 284 runs
  • GitHub
  • Weights
  • Paper
  • License

Run time and cost

This model costs approximately $0.0031 to run on Replicate, or 322 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 3 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

[📂 GitHub] [📜 Sa2VA paper] [🚀 Quick Start]

Introduction

Sa2VA is an MLLM capable of question answering, visual prompt understanding, and dense object segmentation at both image and video levels. It achieves comparable performance to SOTA MLLMs Qwen2-VL and InternVL2.5 on question-answering benchmarks. Additionally, Sa2VA possesses the visual prompt understanding and dense object segmentation capabilities that SOTA MLLMs Qwen2-VL and InternVL2.5 lack. Sa2VA achieves SOTA performance on both image and video grounding and segmentation benchmarks.

Sa2VA Family

We built the Sa2VA series based on Qwen2-VL and InternVL2/2.5. In the following table, we provide some Sa2VA models built on InternVL2.5. Other Sa2VA models will be open-sourced soon.

Model Name Base MLLM Language Part HF Link
Sa2VA-1B InternVL2.5-1B Qwen2.5-0.5B-Instruct 🤗 link
Sa2VA-4B InternVL2.5-4B Qwen2.5-3B-Instruct 🤗 link
Sa2VA-8B InternVL2.5-8B internlm2_5-7b-chat 🤗 link
Sa2VA-26B InternVL2.5-26B internlm2_5-20b-chat 🤗 link

Sa2VA Performance

Model Name MME MMBench RefCOCO RefCOCO+ RefCOCOg MeVIS (val_u) DAVIS
Sa2VA-1B 1504/434 71.9 79.6 73.6 77.7 53.4 69.5
Sa2VA-4B 1691/610 81.8 82.4 77.6 79.7 55.9 73.7
Sa2VA-8B 1690/610 84.4 82.6 78.0 80.3 58.9 75.9
Sa2VA-26B 1698/653 85.8 82.9 79.3 81.2 61.8 78.6

Citation

If you find this project useful in your research, please consider citing:

@article{sa2va,
  title={Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos},
  author={Yuan, Haobo and Li, Xiangtai and Zhang, Tao and Huang, Zilong Huang and Xu, Shilin and Ji, Shunping and Tong, Yunhai and Qi, Lu and Feng, Jiashi and Yang, Ming-Hsuan},
  journal={arXiv preprint},
  year={2025}
}