Skip to content

LLM Inference

LM Studio

LM Studio on Alvis

lmstudio1

LM Studio desktop app

lmstudio2

Chat interface

lmstudio3

Model download

lmstudio4

Select download directory

lmstudio5

Check downloaded models

lmstudio6

Load model and chat

lmstudio7

OpenAI-compatible API server

lmstudio8

Endpoints

  • Four endpoints:
    • /v1/models
    • /v1/chat/completions
    • /v1/completions
    • /v1/embeddings

Use OpenAI API

  • Get available models:
$ curl http://localhost:1234/v1/models
  • Chat to a model:
$ curl http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "llama-3.3-70b-instruct",
    "messages": [
        { "role": "user", "content": "why is the sky blue" }
    ]
}'

OpenAI Python SDK

from openai import OpenAI

client = OpenAI(base_url="http://localhost:1234/v1")

model_list = client.models.list()
print(model_list)

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "why is the sky blue?"}]
)
print(reponse)

Chat completion arguments

Command line tools

$ ~/.lmstudio/bin/lms status

   ┌ Status ───────────────────────────────────┐
   │                                           │
   │   Server:  ON  (Port: 1234)               │
   │                                           │
   │   Loaded Models                           │
   │     · llama-3.3-70b-instruct - 42.52 GB   │
   │                                           │
   └───────────────────────────────────────────┘

Command line tools (continue)

$ ~/.lmstudio/bin/lms ps

   LOADED MODELS

Identifier: llama-3.3-70b-instruct
  • Type:  LLM
  • Path: lmstudio-community/Llama-3.3-70B-Instruct-GGUF/Llama-3.3-70B-Instruct-Q4_K_M.gguf
  • Size: 42.52 GB
  • Architecture: Llama

Check more: ~/.lmstudio/bin/lms --help

Advanced settings

lmstudio9

Exercise

- Launch your own LM Studio on compute node - Use `curl` to get response - Use OpenAI python SDK to get response

vLLM

OpenAI-Compatible API Server

  • Launch server: vllm serve unsloth/Llama-3.2-1B-Instruct
    • default URL: http://localhost:8000
    • set host: --host <host>
    • set port: --port <port>
    • More arguments

Once a server is launched, in another terminal:

  • Chat: vllm chat
  • Completion: vllm complete
  • Benchmark: vllm bench

Note:

- Using `unsloth/Llama-3.2-1B-Instruct` will download model from huggingface to your `HF_HOME` directory. - To load local model, you can use `/..../models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e3...../`

Endpoints

  • Part of the endpoints

    • /v1/models
    • /v1/responses
    • /v1/responses/{response_id}
    • /v1/chat/completions
    • /v1/completions
    • ...
    • /openapi.json
    • /docs
    • /health
    • ...
  • Similiarly, get available models:

    $ curl http://localhost:8000/v1/models
    

Offline inference (LLM class)

  • LLM python class
from vllm import LLM

# Initialize the vLLM engine.
llm = LLM(model="unsloth/Llama-3.2-1B-Instruct")
  • Arguments are similar to the ones used in vllm serve except for some missing features like pipeline parallelism.

LLM class methods

  • Chat
from vllm import LLM, SamplingParams

sampling_params = SamplingParams(
    temperature=0.6,
    max_tokens=128,
    top_p=0.9,
)

messages = [
    {"role": "user", "content": "Why is the sky blue?"},
]

llm = LLM(
    model="unsloth/Llama-3.2-1B-Instruct",
    tensor_parallel_size=4,
)

output = llm.chat(messages, sampling_params, use_tqdm=False)
print(output[0].outputs[0].text)

Exercise

- Launch your own vLLM server on compute node - Write a jobscript to launch vLLM - Use `curl` in your jobscript to get response the vLLM server - Write another python file and use OpenAI python SDK to get response - Use `LLM` class to load model and generate some output

Huggingface Transformers

OpenAI-Compatible API Server

  • Launch server: transformers serve (model is not selected yet)
  • Endpoints:
    • /v1/chat/completions
    • /v1/responses
    • /v1/audio/transcriptions
    • /v1/models
  • Chat in another teminal: transformers chat --model-name-or-path openai/gpt-oss-20b
  • More information

Lower level operation

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]

inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt", return_dict=True,
).to(model.device)

outputs = model.generate(
    **inputs, max_new_tokens=200, temperature=0.7
)

print(tokenizer.decode(outputs[0]))

Exercise

- Use `AutoModel` and `AutoTokenizer` class to load model and generate some output

Other Tools