Input

image

file

Input image

prompt

*string

Shift + Return to add a new line

Prompt to use for text generation

top_p

number

(minimum: 0, maximum: 1)

When decoding text, samples from the top p percentage of most likely tokens; lower to ignore less likely tokens

Default: 1

temperature

number

(minimum: 0)

Adjusts randomness of outputs, greater than 1 is random and 0 is deterministic

Default: 0.2

max_tokens

integer

(minimum: 0)

Maximum number of tokens to generate. A word is generally 2-3 tokens

Default: 1024

history

string[]

List of earlier chat messages, alternating roles, starting with user input. Include <image> to specify which message to attach the image to.

Run this model in Node.js with one line of code:

npx create-replicate --model=yorickvp/llava-v1.6-mistral-7b

or set up a project from scratch

Install Replicate’s Node.js client library:

npm install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import and set up the client:

import Replicate from "replicate";

const replicate = new Replicate({
  auth: process.env.REPLICATE_API_TOKEN,
});

Run yorickvp/llava-v1.6-mistral-7b using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

const output = await replicate.run(
  "yorickvp/llava-v1.6-mistral-7b:19be067b589d0c46689ffa7cc3ff321447a441986a7694c01225973c2eafc874",
  {
    input: {
      image: "https://replicate.delivery/pbxt/KKNB7w6pjN79j5pHDSyYXa5EwaQE9FL5fx6Qa83XMn1HYuKm/extreme_ironing.jpg",
      top_p: 1,
      prompt: "What is unusual about this image?",
      max_tokens: 1024,
      temperature: 0.2
    }
  }
);

console.log(output);

To learn more, take a look at the guide on getting started with Node.js.

Install Replicate’s Python client library:

pip install replicate

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Import the client:

import replicate

Run yorickvp/llava-v1.6-mistral-7b using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

output = replicate.run(
    "yorickvp/llava-v1.6-mistral-7b:19be067b589d0c46689ffa7cc3ff321447a441986a7694c01225973c2eafc874",
    input={
        "image": "https://replicate.delivery/pbxt/KKNB7w6pjN79j5pHDSyYXa5EwaQE9FL5fx6Qa83XMn1HYuKm/extreme_ironing.jpg",
        "top_p": 1,
        "prompt": "What is unusual about this image?",
        "max_tokens": 1024,
        "temperature": 0.2
    }
)

# The yorickvp/llava-v1.6-mistral-7b model can stream output as it's running.
# The predict method returns an iterator, and you can iterate over that output.
for item in output:
    # https://replicate.com/yorickvp/llava-v1.6-mistral-7b/api#output-schema
    print(item, end="")

To learn more, take a look at the guide on getting started with Python.

Set the REPLICATE_API_TOKEN environment variable:

export REPLICATE_API_TOKEN=<paste-your-token-here>

Find your API token in your account settings.

Run yorickvp/llava-v1.6-mistral-7b using Replicate’s API. Check out the model's schema for an overview of inputs and outputs.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -H "Prefer: wait" \
  -d $'{
    "version": "yorickvp/llava-v1.6-mistral-7b:19be067b589d0c46689ffa7cc3ff321447a441986a7694c01225973c2eafc874",
    "input": {
      "image": "https://replicate.delivery/pbxt/KKNB7w6pjN79j5pHDSyYXa5EwaQE9FL5fx6Qa83XMn1HYuKm/extreme_ironing.jpg",
      "top_p": 1,
      "prompt": "What is unusual about this image?",
      "max_tokens": 1024,
      "temperature": 0.2
    }
  }' \
  https://api.replicate.com/v1/predictions

To learn more, take a look at Replicate’s HTTP API reference docs.

Output

The unusual aspect of this image is that a man is standing on the back of a yellow SUV, ironing clothes. This is not a typical scene, as one would expect to see the man either inside the vehicle or on the ground, rather than standing on the back of the SUV. The act of ironing clothes while standing on the back of a moving vehicle is both unusual and potentially dangerous.

{
  "completed_at": "2024-02-01T11:47:16.816746Z",
  "created_at": "2024-02-01T11:44:51.978072Z",
  "data_removed": false,
  "error": null,
  "id": "qlua7w3b5y2fcputi27al53qgy",
  "input": {
    "image": "https://replicate.delivery/pbxt/KKNB7w6pjN79j5pHDSyYXa5EwaQE9FL5fx6Qa83XMn1HYuKm/extreme_ironing.jpg",
    "top_p": 1,
    "prompt": "What is unusual about this image?",
    "max_tokens": 1024,
    "temperature": 0.2
  },
  "logs": "The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.\nSetting `pad_token_id` to `eos_token_id`:2 for open-end generation.",
  "metrics": {
    "predict_time": 4.445047,
    "total_time": 144.838674
  },
  "output": [
    "The ",
    "unusual ",
    "aspect ",
    "of ",
    "this ",
    "image ",
    "is ",
    "that ",
    "a ",
    "man ",
    "is ",
    "standing ",
    "on ",
    "the ",
    "back ",
    "of ",
    "a ",
    "yellow ",
    "SUV, ",
    "ironing ",
    "clothes. ",
    "This ",
    "is ",
    "not ",
    "a ",
    "typical ",
    "scene, ",
    "as ",
    "one ",
    "would ",
    "expect ",
    "to ",
    "see ",
    "the ",
    "man ",
    "either ",
    "inside ",
    "the ",
    "vehicle ",
    "or ",
    "on ",
    "the ",
    "ground, ",
    "rather ",
    "than ",
    "standing ",
    "on ",
    "the ",
    "back ",
    "of ",
    "the ",
    "SUV. ",
    "The ",
    "act ",
    "of ",
    "ironing ",
    "clothes ",
    "while ",
    "standing ",
    "on ",
    "the ",
    "back ",
    "of ",
    "a ",
    "moving ",
    "vehicle ",
    "is ",
    "both ",
    "unusual ",
    "and ",
    "potentially ",
    "dangerous. "
  ],
  "started_at": "2024-02-01T11:47:12.371699Z",
  "status": "succeeded",
  "urls": {
    "stream": "https://streaming-api.svc.us.c.replicate.net/v1/streams/523buztgaalgf6aipreduz2zsudl3edwknku3wftgaw4sdlbdtta",
    "get": "https://api.replicate.com/v1/predictions/qlua7w3b5y2fcputi27al53qgy",
    "cancel": "https://api.replicate.com/v1/predictions/qlua7w3b5y2fcputi27al53qgy/cancel"
  },
  "version": "6d853ae87b782cd5f659578c807aa5cc03075e40b28fc3ff83e8aee98f9f94c7"
}

Generated in

4.5 seconds

Tweak it Report View full prediction

This output was created using a different version of the model, yorickvp/llava-v1.6-mistral-7b:6d853ae8.

Run time and cost

This model costs approximately $0.030 to run on Replicate, or 33 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 32 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Check out the different LLaVA’s on Replicate:

Name	Version	Base	Size	Finetunable
v1.5 - Vicuna-13B	v1.5	Vicuna	13B	Yes
v1.6 - Vicuna-13B	v1.6	Vicuna	13B	No
v1.6 - Vicuna-7B	v1.6	Vicuna	7B	No
v1.6 - Mistral-7B	v1.6	Mistral	7B	No
v1.6 - Nous-Hermes-2-34B	v1.6	Nous-Hermes-2	34B	No

🌋 LLaVA v1.6: Large Language and Vision Assistant

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.

[Project Page] [Demo] [Data] [Model Zoo]

Improved Baselines with Visual Instruction Tuning [Paper]
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

LLaVA v1.6 changes

LLaVA-1.6 is out! With additional scaling to LLaVA-1.5, LLaVA-1.6-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post!

Summary

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

yorickvp / llava-v1.6-mistral-7b

Input

Output

Run time and cost

Readme

🌋 LLaVA v1.6: Large Language and Vision Assistant

LLaVA v1.6 changes

Summary

Logs (qlua7w3b5y2fcputi27al53qgy)