Evaluating LLMs

Motivation¶

Tracking progress
Reproducibility
Helping informed decisions

Caveats¶

Apples-to-apples
- Compare against similar model size/datatype precision
Benchmark quality varies
- MMLU
Benchmark gaming
- Llama 4 Chatbor Arena scandal
- Models can cheat
Optimizing too much for benchmarks can have side-effects
- Why language models hallucinate

Evaluation¶

Generic benchmarks
AI-based evaluation
Task specific benchmarks

Collections¶

General benchmarks¶

AI-based evaluations¶

LLM-as-a-judge
Human evaluation of subset to evaluate evaluator
Make sure to prompt LLM-judge scale to be used

Task specific benchmarks¶

Constructing your own benchmarks¶

Good example: R&D-bench

The NeMo Evaluator collection¶

Collection of other evaluation harnesses and specific benchmarks
Provides a collection of docker containers
Build with e.g. apptainer pull nvcr.io/nvidia/eval-factory/lm-evaluation-harness:25.10

NeMo Evaluator CLI¶

nemo-evaluator ls: list available benchmarks
nemo_evaluator run_eval ...: run one or more benchmarks

NeMo Evaluator Python SDK¶

Run benchmarks from Python code
Configure your own benchmark

Exercise¶

Try running evals against vLLM endpoint instructor set-up
Make sure to set HF_HOME as datasets will be downloaded when running
Launch interactive job srun -A NAISS2025-22-1522 -t 30 -C NOGPU -c 2 --pty bash
Run portforwarding to expose port from alvis2
Run with e.g. apptainer exec container.sif nemo-evaluator run_eval --run_config config.yaml --eval_type=mmlu
Config (port may be different):

target:
  api_endpoint:
    url: http://localhost:34253/v1/completions
    model_id: QuantTrio/GLM-4.5-Air-GPTQ-Int4-Int8Mix
    type: completions

Instructor set-up¶

We are running an vLLM instance on a compute node and forwarding the port to alvis2 log-in node. The relevant port number will be in /mimer/NOBACKUP/groups/llm-workshop/exercises/day3/tools/_instructor/.vllm_alvis2_port.

Interactive node set-up¶

Allocate an interactive job srun -A NAISS2025-22-1522 -t 30 -C NOGPU --pty bash
Forward the port from alvis2
Set-up NeMo Evaluator config file
Set HF_HOME to $TMPDIR (usually this would be project storage)
Run parts of a benchmark with NeMo Evaluator

Steps 2 and onwards:

# Prepare local port
my_vllm_port=$(find_ports)  # local port
echo my_vllm_port=$my_vllm_port
ssh -L $my_vllm_port:localhost:$(cat /mimer/NOBACKUP/groups/llm-workshop/exercises/day3/tools/_instructor/.vllm_alvis2_port) -fN alvis2  # port forwarding

# Create config file
echo "
target:
  api_endpoint:
    url: http://localhost:${my_vllm_port}/v1/completions
    type: completions
    #  url: http://localhost:${my_vllm_port}/v1/chat
    #  type: chat
    model_id: QuantTrio/GLM-4.5-Air-GPTQ-Int4-Int8Mix

config:
    params:
        parallelism: 1
        request_timeout: 600
" > my_nemo_config.yaml

# Set HF_HOME to not have stuff put in your home dir
HF_HOME="$TMPDIR/hf"

# Run partial benchmark
apptainer exec /apps/containers/NeMo/Evaluator/NeMo-Evaluator-LM-Evaluation-Harnesk-NGC-25.10.sif nemo-evaluator run_eval --run_config my_nemo_config.yaml --output_dir="${TMPDIR:-/tmp/}/$USER" --eval_type bbq --override="config.params.limit_samples=10"

Possible complications¶

Some LM Harness benchmarks require tokenizer to be specified. Possible solutions:
- Install NeMo Evaluator on-top of vLLM and launch your model with NeMo Evaluator launcher
- Run LM Harness directly with a custom model using both endpoint and tokenizer
Endpoint down
- If you're not running this during the workshop session you will have to launch your own vLLM instance
- See exercises/tools/_instructor/launch_GLM.sbatch