Cog implementation of moondream2.
Creator’s GitHub repo: https://github.com/vikhyat/moondream
HF: https://huggingface.co/vikhyatk/moondream2
X / Twitter of creator: https://x.com/vikhyatk
Benchmarks
moondream2 is a 1.86B parameter model initialized with weights from SigLIP and Phi 1.5.
Model | VQAv2 | GQA | TextVQA | TallyQA (simple) | TallyQA (full) |
---|---|---|---|---|---|
moondream1 | 74.7 | 57.9 | 35.6 | - | - |
moondream2 (latest) | 76.8 | 60.6 | 46.4 | 79.6 | 73.3 |
Usage
Using transformers (recommended)
pip install transformers timm einops
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
model_id = "vikhyatk/moondream2"
revision = "2024-03-13"
model = AutoModelForCausalLM.from_pretrained(
model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))
The model is updated regularly, so we recommend pinning the model version to a specific release as shown above.
To enable Flash Attention on the text model, pass in attn_implementation="flash_attention_2"
when instantiating the model.
model = AutoModelForCausalLM.from_pretrained(
model_id, trust_remote_code=True, revision=revision,
torch_dtype=torch.float16, attn_implementation="flash_attention_2"
).to("cuda")
Batch inference is also supported.
answers = moondream.batch_answer(
images=[Image.open('<IMAGE_PATH_1>'), Image.open('<IMAGE_PATH_2>')],
prompts=["Describe this image.", "Are there people in this image?"],
tokenizer=tokenizer,
)