andreasjansson / blip-2

Answers questions about images

  • Public
  • 28.3M runs
  • GitHub
  • Paper

Run time and cost

This model costs approximately $0.0058 to run on Replicate, or 172 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 5 seconds.

Readme

Unofficial BLIP-2 demo and API

Note that this is an unofficial implementation of BLIP-2 that is not associated with Salesforce.

Usage

Blip-2 is a model that answers questions about images. To use it, provide an image, and then ask a question about that image. For example, you can provide the following image:

Image

and then pose the following question:

What is this a picture of?

and get the output:

marina bay sands, singapore.

Blip-2 is also capable of captioning images. This works by sending the model a blank prompt, though we have an explicit toggle for image captioning in the UI & API.

You can also provide Blip-2 with more context when asking a question. For example, given the following image:

img

you can provide the output of a previous Q&A as context in question: ... answer: ... format like so:

question: what animal is this? answer: panda

and then pose an additional question:

what country is this animal from?

and get the output:

china

Model description

BLIP-2 is a generic and efficient pre-training strategy that easily harvests development of pretrained vision models and large language models (LLMs) for vision-language pretraining. BLIP-2 beats Flamingo on zero-shot VQAv2 (65.0 vs 56.3), establishing new state-of-the-art on zero-shot captioning (on NoCaps 121.6 CIDEr score vs previous best 113.2). Equipped with powerful LLMs (e.g. OPT, FlanT5), BLIP-2 also unlocks the new zero-shot instructed vision-to-language generation capabilities for various interesting applications! Learn more at the official repo

Citation

@misc{https://doi.org/10.48550/arxiv.2301.12597,
  doi = {10.48550/ARXIV.2301.12597},
  url = {https://arxiv.org/abs/2301.12597},
  author = {Li, Junnan and Li, Dongxu and Savarese, Silvio and Hoi, Steven},
  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
  title = {BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models},
  publisher = {arXiv},
  year = {2023},
  copyright = {Creative Commons Attribution 4.0 International}
}