Run time and cost

This model costs approximately $0.036 to run on Replicate, or 27 runs per $1, but this varies depending on your inputs. It is also open source and you can run it on your own computer with Docker.

This model runs on Nvidia L40S GPU hardware. Predictions typically complete within 38 seconds. The predict time for this model varies significantly based on the inputs.

Readme

Check out the different LLaVA’s on Replicate:

Name	Version	Base	Size	Finetunable
v1.5 - Vicuna-13B	v1.5	Vicuna	13B	Yes
v1.6 - Vicuna-13B	v1.6	Vicuna	13B	No
v1.6 - Vicuna-7B	v1.6	Vicuna	7B	No
v1.6 - Mistral-7B	v1.6	Mistral	7B	No
v1.6 - Nous-Hermes-2-34B	v1.6	Nous-Hermes-2	34B	No

🌋 LLaVA v1.6: Large Language and Vision Assistant

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.

[Project Page] [Demo] [Data] [Model Zoo]

Improved Baselines with Visual Instruction Tuning [Paper]
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee

Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

LLaVA v1.6 changes

LLaVA-1.6 is out! With additional scaling to LLaVA-1.5, LLaVA-1.6-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post!

Summary

LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.