Skip to content

Multi-Modal LLM / VLM Inference

Introduction

  • Multi-modal LLMs are LLMs capable of handling multiple types (modalities) of data, e.g. text, image, audio, video.
  • In this workshop, we focus on Vision Language Model (VLM), a subset of multi-modal LLMs.

VLM architectures

Image source: Sebastian Raschka

Inference in vLLM

$ vllm serve unsloth/Llama-3.2-11B-Vision-Instruct
  • Some other useful arguments:
    • --limit-mm-per-prompt
    • --allowed-local-media-path

Messages to LLM

messages = [
    {"role": "user", "content": "..."},
    ...
]

Messages to multi-modal model

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is shown in the image?"},
            {"type": "image_url", "image_url": {"url": "https://..."}}
        ]
    }
]

Send image raw data

  • Encode with base64
# bash
data=$(base64 image.jpg)
# python
import base64
with open("image.jpg", "rb") as image_file:
    data = base64.b64encode(image_file.read()).decode("utf-8")

Message with raw data

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is shown in the image?"},
            {"type": "image_url", "image_url": {"url": "data"}}
        ]
    }
]

OpenAI python SDK example

import base64
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="")

with open("../image/eso2105a.jpg", "rb") as image_file:
    data = base64.b64encode(image_file.read()).decode("utf-8")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is shown in the image?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{data}"}},
        ],
    },
]
response = client.chat.completions.create(model="...", messages=messages)
print(response)

Offline inference in Transformers

  • Use AutoProcessor instead of AutoTokenizer
  • User AutoModelForImageTextToText instead of AutoModelForCausalLM
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModelForImageTextToText

model = AutoModelForImageTextToText.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)

url = "https://cdn.eso.org/images/screen/eso2105a.jpg"

Message to multimodal model

  • Use {"type": "image", "url": url} instead of {"type": "image_url", "image_url": {"url": url}}
    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "url": url},
                {"type": "text", "text": "What is shown in the image?"},
            ],
        },
    ]
    

Raw data in message

with open("../image/eso2105a.jpg", "rb") as image_file:
    data = base64.b64encode(image_file.read()).decode("utf-8")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": "What is shown in the image?"},
            {"type": "image", "url": f"data:image/png;base64,{data}"},
        ],
    },
]

processed_chat = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
)
print(list(processed_chat.keys()))

Attach image with processor

image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is shown in the image"}
        ]
    }
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    image, input_text, add_special_tokens=False, return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=30)
print(processor.decode(output[0]))

Exercise

- Write a jobscript to launch a vLLM server serving one VLM - Use your preferable way to send messages and images to the server, you can do it in the same jobscript - Use transformers to load a VLM and handle a message with image

Reference

  • https://huggingface.co/learn/computer-vision-course/unit4/multimodal-models/pre-intro
  • https://magazine.sebastianraschka.com/p/understanding-multimodal-llms
  • https://arxiv.org/abs/2405.17927