cjwbw / uform-gen2-qwen-500m

Pocket-Sized Multimodal AI For Content Understanding and Generation

  • Public
  • 390 runs
  • License

Run time and cost

This model costs approximately $0.0059 to run on Replicate, or 169 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A40 (Large) GPU hardware. Predictions typically complete within 9 seconds.

Readme

Description

UForm-Gen is a small generative vision-language model primarily designed for Image Captioning and Visual Question Answering. The model consists of two parts:

  1. CLIP-like ViT-H/14
  2. Qwen1.5-0.5B-Chat

Evaluation

Model LLM Size SQA MME MMBench Average¹
UForm-Gen2-Qwen-500m 0.5B 45.5 880.1 42.0 29.31
MobileVLM v2 1.4B 52.1 1302.8 57.7 36.81
LLaVA-Phi 2.7B 68.4 1335.1 59.8 42.95

¹MME scores were divided by 2000 before averaging.