LLM Inference

LM Studio ¶

LM Studio on Alvis¶

lmstudio1

LM Studio desktop app¶

lmstudio2

Chat interface¶

lmstudio3

Model download¶

lmstudio4

Select download directory¶

lmstudio5

Check downloaded models¶

lmstudio6

Load model and chat¶

lmstudio7

OpenAI-compatible API server¶

lmstudio8

Endpoints¶

Four endpoints:
- /v1/models
- /v1/chat/completions
- /v1/completions
- /v1/embeddings

Use OpenAI API¶

Get available models:

$ curl http://localhost:1234/v1/models

{
  "data": [
    {
      "id": "llama-3.3-70b-instruct",
      "object": "model",
      "owned_by": "organization_owner"
    },
    {
      "id": "text-embedding-nomic-embed-text-v1.5",
      "object": "model",
      "owned_by": "organization_owner"
    }
  ],
  "object": "list"
}

Chat to a model:

$ curl http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "llama-3.3-70b-instruct",
    "messages": [
        { "role": "user", "content": "why is the sky blue" }
    ]
}'

{
  "id": "chatcmpl-stubx36wa8neg1u8jo5re",
  "object": "chat.completion",
  "created": 1746801158,
  "model": "llama-3.3-70b-instruct",
  "choices": [
    {
      "index": 0,
      "logprobs": null,
      "finish_reason": "stop",
      "message": {
        "role": "assistant",
        "content": "The sky appears blue because of a phenomenon called Rayleigh scattering..."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 40,
    "completion_tokens": 406,
    "total_tokens": 446
  },
  "stats": {},
  "system_fingerprint": "llama-3.3-70b-instruct"
}

OpenAI Python SDK¶

from openai import OpenAI

client = OpenAI(base_url="http://localhost:1234/v1")

model_list = client.models.list()
print(model_list)

response = client.chat.completions.create(
    messages=[{"role": "user", "content": "why is the sky blue?"}]
)
print(reponse)

Chat completion arguments

Command line tools¶

$ ~/.lmstudio/bin/lms status

   ┌ Status ───────────────────────────────────┐
   │                                           │
   │   Server:  ON  (Port: 1234)               │
   │                                           │
   │   Loaded Models                           │
   │     · llama-3.3-70b-instruct - 42.52 GB   │
   │                                           │
   └───────────────────────────────────────────┘

Command line tools (continue)¶

$ ~/.lmstudio/bin/lms ps

   LOADED MODELS

Identifier: llama-3.3-70b-instruct
  • Type:  LLM
  • Path: lmstudio-community/Llama-3.3-70B-Instruct-GGUF/Llama-3.3-70B-Instruct-Q4_K_M.gguf
  • Size: 42.52 GB
  • Architecture: Llama

Check more: ~/.lmstudio/bin/lms --help

Advanced settings¶

lmstudio9

Exercise¶

- Launch your own LM Studio on compute node - Use `curl` to get response - Use OpenAI python SDK to get response

vLLM ¶

OpenAI-Compatible API Server¶

Launch server: vllm serve unsloth/Llama-3.2-1B-Instruct
- default URL: http://localhost:8000
- set host: --host <host>
- set port: --port <port>
- More arguments

Once a server is launched, in another terminal:

Chat: vllm chat
Completion: vllm complete
Benchmark: vllm bench

Note:

- Using `unsloth/Llama-3.2-1B-Instruct` will download model from huggingface to your `HF_HOME` directory. - To load local model, you can use `/..../models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e3...../`

Endpoints¶

Part of the endpoints
- /v1/models
- /v1/responses
- /v1/responses/{response_id}
- /v1/chat/completions
- /v1/completions
- ...
- /openapi.json
- /docs
- /health
- ...
Similiarly, get available models:
```
$ curl http://localhost:8000/v1/models
```

Offline inference (LLM class)¶

LLM python class

from vllm import LLM

# Initialize the vLLM engine.
llm = LLM(model="unsloth/Llama-3.2-1B-Instruct")

Arguments are similar to the ones used in vllm serve except for some missing features like pipeline parallelism.

LLM class methods¶

Chat

from vllm import LLM, SamplingParams

sampling_params = SamplingParams(
    temperature=0.6,
    max_tokens=128,
    top_p=0.9,
)

messages = [
    {"role": "user", "content": "Why is the sky blue?"},
]

llm = LLM(
    model="unsloth/Llama-3.2-1B-Instruct",
    tensor_parallel_size=4,
)

output = llm.chat(messages, sampling_params, use_tqdm=False)
print(output[0].outputs[0].text)

More examples.

Exercise¶

- Launch your own vLLM server on compute node - Write a jobscript to launch vLLM - Use `curl` in your jobscript to get response the vLLM server - Write another python file and use OpenAI python SDK to get response - Use `LLM` class to load model and generate some output

Huggingface Transformers ¶

OpenAI-Compatible API Server¶

Launch server: transformers serve (model is not selected yet)
Endpoints:
- /v1/chat/completions
- /v1/responses
- /v1/audio/transcriptions
- /v1/models
Chat in another teminal: transformers chat --model-name-or-path openai/gpt-oss-20b
More information

Lower level operation¶

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain what MXFP4 quantization is."},
]

inputs = tokenizer.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt", return_dict=True,
).to(model.device)

outputs = model.generate(
    **inputs, max_new_tokens=200, temperature=0.7
)

print(tokenizer.decode(outputs[0]))

Exercise¶

- Use `AutoModel` and `AutoTokenizer` class to load model and generate some output

LLM Inference

LM Studio¶

LM Studio on Alvis¶

LM Studio desktop app¶

Chat interface¶

Model download¶

Select download directory¶

Check downloaded models¶

Load model and chat¶

OpenAI-compatible API server¶

Endpoints¶

Use OpenAI API¶

OpenAI Python SDK¶

Command line tools¶

Command line tools (continue)¶

Advanced settings¶

Exercise¶

vLLM¶

OpenAI-Compatible API Server¶

Endpoints¶

Offline inference (LLM class)¶

LLM class methods¶

Exercise¶

Huggingface Transformers¶

OpenAI-Compatible API Server¶

Lower level operation¶

Exercise¶

Other Tools¶

LM Studio ¶

vLLM ¶

Huggingface Transformers ¶