Run time and cost

This model costs approximately $0.014 to run on Replicate, or 71 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia A100 (80GB) GPU hardware. Predictions typically complete within 11 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Unofficial BLIP-3 (xgen-mm-phi3-mini-instruct-r-v1) demo and API

Note that this is an unofficial implementation of BLIP-3 (previously known as blip3-phi3-mini-base-r-v1) that is not associated with Salesforce.

Usage

BLIP-3 is a model that answers questions about images. To use it, provide an image, and then ask a question about that image. For example, you can provide the following image:

Marina Bay Sands

and then pose the following question:

What is this a picture of?

and get the output:

Marina Bay Sands, Singapore.

BLIP-3 is also capable of captioning images. This works by sending the model a blank prompt, though we have an explicit toggle for image captioning in the UI & API.

You can also provide BLIP-3 with more context when asking a question. For example, given the following image:

Panda

you can provide the output of a previous Q&A as context in question: … answer: … format like so:

question: What animal is this? answer: A panda

and then pose an additional question:

What country is this animal native to?

and get the output:

China

Model description

XGen-MM (previously known as BLIP-3) is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. This series advances upon the successful designs of the BLIP series, incorporating fundamental enhancements that ensure a more robust and superior foundation.

Key features of XGen-MM: - The pretrained foundation model, xgen-mm-phi3-mini-base-r-v1, achieves state-of-the-art performance under 5b parameters and demonstrates strong in-context learning capabilities. - The instruct fine-tuned model, xgen-mm-phi3-mini-instruct-r-v1, achieves state-of-the-art performance among open-source and closed-source VLMs under 5b parameters. - xgen-mm-phi3-mini-instruct-r-v1 supports flexible high-resolution image encoding with efficient visual token sampling.

These models have been trained at scale on high-quality image caption datasets and interleaved image-text data.

Citation

@misc{xgen_mm_phi3_mini, title={xgen-mm-phi3-mini-instruct Model Card}, url={https://huggingface.co/Salesforce/xgen-mm-phi3-mini-instruct-r-v1}, author={Salesforce AI Research}, month={May}, year={2024} }