gfodor / instructblip

Image captioning via vision-language models with instruction tuning

  • Public
  • 539.1K runs
  • GitHub
  • Paper
  • License

Run time and cost

This model costs approximately $0.084 to run on Replicate, or 11 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 60 seconds.

Readme

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

project page paper

InstructBLIP is an instruction tuned image captioning model.

Comparison

From the project page:

“The response from InstructBLIP is more comprehensive than GPT-4, more visually-grounded than LLaVA, and more logical than MiniGPT-4. The responses of GPT-4 and LLaVA are obtained from their respective papers, while the official demo is used for MiniGPT-4.”