LLM Inference
LM Studio¶
LM Studio on Alvis¶

LM Studio desktop app¶

Chat interface¶

Model download¶

Select download directory¶

Check downloaded models¶

Load model and chat¶

OpenAI-compatible API server¶

Endpoints¶
- Four endpoints:
/v1/models/v1/chat/completions/v1/completions/v1/embeddings
Use OpenAI API¶
- Get available models:
- Chat to a model:
$ curl http://localhost:1234/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "llama-3.3-70b-instruct",
"messages": [
{ "role": "user", "content": "why is the sky blue" }
]
}'
OpenAI Python SDK¶
from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1")
model_list = client.models.list()
print(model_list)
response = client.chat.completions.create(
messages=[{"role": "user", "content": "why is the sky blue?"}]
)
print(reponse)
Command line tools¶
$ ~/.lmstudio/bin/lms status
┌ Status ───────────────────────────────────┐
│ │
│ Server: ON (Port: 1234) │
│ │
│ Loaded Models │
│ · llama-3.3-70b-instruct - 42.52 GB │
│ │
└───────────────────────────────────────────┘
Command line tools (continue)¶
$ ~/.lmstudio/bin/lms ps
LOADED MODELS
Identifier: llama-3.3-70b-instruct
• Type: LLM
• Path: lmstudio-community/Llama-3.3-70B-Instruct-GGUF/Llama-3.3-70B-Instruct-Q4_K_M.gguf
• Size: 42.52 GB
• Architecture: Llama
Check more: ~/.lmstudio/bin/lms --help
Advanced settings¶

Exercise¶
- Launch your own LM Studio on compute node
- Use `curl` to get response
- Use OpenAI python SDK to get response
vLLM¶
OpenAI-Compatible API Server¶
- Launch server:
vllm serve unsloth/Llama-3.2-1B-Instruct- default URL:
http://localhost:8000 - set host:
--host <host> - set port:
--port <port> - More arguments
- default URL:
Once a server is launched, in another terminal:
- Chat:
vllm chat - Completion:
vllm complete - Benchmark:
vllm bench
Note:
- Using `unsloth/Llama-3.2-1B-Instruct` will download model from huggingface
to your `HF_HOME` directory.
- To load local model, you can use
`/..../models--unsloth--Llama-3.2-1B-Instruct/snapshots/d2b9e3...../`
Endpoints¶
-
Part of the endpoints
/v1/models/v1/responses/v1/responses/{response_id}/v1/chat/completions/v1/completions- ...
/openapi.json/docs/health- ...
-
Similiarly, get available models:
Offline inference (LLM class)¶
LLMpython class
- Arguments are
similar to the ones used in
vllm serveexcept for some missing features like pipeline parallelism.
LLM class methods¶
- Chat
from vllm import LLM, SamplingParams
sampling_params = SamplingParams(
temperature=0.6,
max_tokens=128,
top_p=0.9,
)
messages = [
{"role": "user", "content": "Why is the sky blue?"},
]
llm = LLM(
model="unsloth/Llama-3.2-1B-Instruct",
tensor_parallel_size=4,
)
output = llm.chat(messages, sampling_params, use_tqdm=False)
print(output[0].outputs[0].text)
Exercise¶
- Launch your own vLLM server on compute node
- Write a jobscript to launch vLLM
- Use `curl` in your jobscript to get response the vLLM server
- Write another python file and use OpenAI python SDK to get response
- Use `LLM` class to load model and generate some output
Huggingface Transformers¶
OpenAI-Compatible API Server¶
- Launch server:
transformers serve(model is not selected yet) - Endpoints:
/v1/chat/completions/v1/responses/v1/audio/transcriptions/v1/models
- Chat in another teminal:
transformers chat --model-name-or-path openai/gpt-oss-20b - More information
Lower level operation¶
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
messages = [
{"role": "user", "content": "Explain what MXFP4 quantization is."},
]
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt", return_dict=True,
).to(model.device)
outputs = model.generate(
**inputs, max_new_tokens=200, temperature=0.7
)
print(tokenizer.decode(outputs[0]))
Exercise¶
- Use `AutoModel` and `AutoTokenizer` class to load model and generate some output