Readme
Qwen2-VL-2B-Instruct
Introduction
We’re excited to unveil Qwen2-VL, the latest iteration of our Qwen-VL model, representing nearly a year of innovation.
What’s New in Qwen2-VL?
Key Enhancements:
SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.
Model Architecture Updates:
Naive Dynamic Resolution: Unlike before, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, offering a more human-like visual processing experience.
 
**Multimodal Rotary Position Embedding (M-ROPE)**: Decomposes positional embedding into parts to capture 1D textual, 2D visual, and 3D video positional information, enhancing its multimodal processing capabilities.
 
We have three models with 2, 7 and 72 billion parameters. This repo contains the instruction-tuned 7B Qwen2-VL model. # Image Benchmarks
| Benchmark | InternVL2-8B | MiniCPM-V 2.6 | GPT-4o-mini | Qwen2-VL-7B | 
|---|---|---|---|---|
| MMMUval | 51.8 | 49.8 | 60 | 54.1 | 
| DocVQAtest | 91.6 | 90.8 | - | 94.5 | 
| InfoVQAtest | 74.8 | - | - | 76.5 | 
| ChartQAtest | 83.3 | - | - | 83.0 | 
| TextVQAval | 77.4 | 80.1 | - | 84.3 | 
| OCRBench | 794 | 852 | 785 | 845 | 
| MTVQA | - | - | - | 26.3 | 
| RealWorldQA | 64.4 | - | - | 70.1 | 
| MMEsum | 2210.3 | 2348.4 | 2003.4 | 2326.8 | 
| MMBench-ENtest | 81.7 | - | - | 83.0 | 
| MMBench-CNtest | 81.2 | - | - | 80.5 | 
| MMBench-V1.1test | 79.4 | 78.0 | 76.0 | 80.7 | 
| MMT-Benchtest | - | - | - | 63.7 | 
| MMStar | 61.5 | 57.5 | 54.8 | 60.7 | 
| MMVetGPT-4-Turbo | 54.2 | 60.0 | 66.9 | 62.0 | 
| HallBenchavg | 45.2 | 48.1 | 46.1 | 50.6 | 
| MathVistatestmini | 58.3 | 60.6 | 52.4 | 58.2 | 
| MathVision | - | - | - | 16.3 | 
| Benchmark | Internvl2-8B | LLaVA-OneVision-7B | MiniCPM-V 2.6 | Qwen2-VL-7B | 
|---|---|---|---|---|
| MVBench | 66.4 | 56.7 | - | 67.0 | 
| PerceptionTesttest | - | 57.1 | - | 62.3 | 
| EgoSchematest | - | 60.1 | - | 66.7 | 
| Video-MMEwo/w subs | 54.0/56.9 | 58.2/- | 60.9/63.6 | 63.3/69.0 | 
@article{Qwen2-VL,
  title={Qwen2-VL},
  author={Qwen team},
  year={2024}
}
@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}
