Brief introduction to publicly available LLMs¶
Learning outcomes
- To understand the different categories that LLM comes in
- To know which matrices to look at for your particular usecase
Arena¶
Open-weight models are catching up with closed source models steadily18. However, creating high-quality benchmarks is an active area of research as the existing ones are beginning to plateau.
Categories¶
- LLMs come in wide-range of "openness".
- Public != Open.
- “Publicly Available” means that the model checkpoints can be publicly accessible (terms can still apply) while “Closed Source” means the opposite.
| Category (ordered by openness) | Weights available? | Inference | Fine‑tuning | Redistribute weights / derivatives | Typical license | Examples |
|---|---|---|---|---|---|---|
| Open Source (OSI‑compatible) | ✅ Full | ✅ | ✅ | ✅ | Apache‑2.0 / MIT | Mistral 7B ; OLMo 2 ; Alpaca |
| Open Weights (restricted / gated) | ✅ Full | ✅ | ⚠️ License‑bound (e.g., research‑only / carve‑outs) | ❌ Usually not allowed | Custom terms (Llama / Gemma / RAIL) | Llama 3 (Meta Llama 3 Community License); Gemma 2 (Gemma Terms of Use); BLOOM (OpenRAIL) |
| Adapter‑only / Delta releases | ⚠️ Partial (adapters/deltas) | ✅ (after applying) | ✅ (adapters) | ✅ Adapters (base license applies) | Mixed | LoRA adapters over a base model |
| Proprietary API + FT | ❌ | ⚠️ API-only | ⚠️ API‑only (no weights export) | ❌ | Vendor ToS | OpenAI (GPT‑4.1, o4‑mini FT/RFT); Cohere (Command R/R+ FT); Anthropic (Claude 3 Haiku FT via Bedrock) |
| Proprietary API‑only | ❌ | ⚠️ API-only | ❌ | ❌ | Vendor ToS | Google Gemini API |
Leaderboard¶
Other notable leaderboards:
- HELM (Holistic Evaluation of Language Models by Stanford)
- LMArena (focus on open-weight models by UC Berkeley)
Benchmarks to consider¶
Focus on a small set of comparable metrics (most appear on the Open LLM Leaderboard or model cards):
Core capability benchmarks (higher is better unless noted)
- MMLU-Pro2: general academic/world knowledge
- GPQA3: Q&A dataset designed by domain experts (PhD-level))
- MuSR4: Reasoning with very long contexts (up to 100K tokens)
- MATH5: high-school competition math problems
- IFEval7: Testing ability to strictly follow instructions
- BBH6: reasoning & commonsense
| Category | Benchmarks (examples) | Orgs with open weights that report them |
|---|---|---|
| General academic / world knowledge | MMLU, MMLU-Pro, CMMLU | Meta (LLaMA), Mistral, Cohere, DeepSeek |
| Domain expert level | GPQA, CEval, CMMLU | Meta (LLaMA papers mention expert subsets), Cohere (Command evals), DeepSeek (reports CEval/CMMLU/GPQA) |
| Reasoning with long context | MuSR, LongBench / long-context evals | Mistral (Mixtral with long context, reported evals), DeepSeek (long-context benchmarks in tech report) |
| High-school competition / advanced math | GSM8K, MATH, AIME | Meta (MATH, GSM8K), Mistral (GSM8K, MATH), Cohere (GSM8K), DeepSeek (MATH, GSM8K, AIME) |
| Instruction following / alignment | IFEval, instruction eval suites | Meta (instruction-tuned LLaMA), Cohere (Command-R+ evals), DeepSeek (instruction following evals) |
| Reasoning & commonsense | BBH, HellaSwag, Winogrande, PiQA, ARC, DROP | Meta (HellaSwag, BBH), Mistral (HellaSwag, Winogrande), Cohere (commonsense evals), DeepSeek (HellaSwag, BBH, PiQA, Winogrande, ARC, DROP) |
| Code completion & debugging | HumanEval, MBPP, LeetCode, Codeforces | Meta (HumanEval), Mistral (HumanEval, MBPP), Cohere (HumanEval, MBPP), DeepSeek (HumanEval, MBPP, LeetCode) |
Note: Mulitlingual and multimodal benchmarks are not covered here in detail.
Detailed benchmark coverage per open-weight model provider
| Benchmark | Meta (LLaMA) | Mistral | Cohere (Command-R+) | DeepSeek |
|---|---|---|---|---|
| MMLU / MMLU-Pro / CMMLU | ✅ | ✅ | ✅ | ✅ |
| GPQA / CEval (expert Q&A) | ⚪ (GPQA subsets in papers) | ⚪ (less common) | ✅ | ✅ |
| MuSR / LongBench / long-context evals | ⚪ (not main focus, context ≤32k) | ✅ (Mixtral-8x22B long context) | ⚪ | ✅ |
| GSM8K (math word problems) | ✅ | ✅ | ✅ | ✅ |
| MATH (competition-level) | ✅ | ✅ | ⚪ | ✅ |
| AIME (advanced math) | ⚪ | ⚪ | ⚪ | ✅ |
| IFEval / Instruction evals | ✅ | ⚪ | ✅ | ✅ |
| BBH (BigBench Hard) | ✅ | ⚪ | ⚪ | ✅ |
| HellaSwag | ✅ | ✅ | ⚪ | ✅ |
| Winogrande | ⚪ | ✅ | ⚪ | ✅ |
| PiQA | ⚪ | ⚪ | ⚪ | ✅ |
| ARC (AI2 Reasoning Challenge) | ⚪ | ⚪ | ⚪ | ✅ |
| DROP (reading comp / commonsense) | ⚪ | ⚪ | ⚪ | ✅ |
| HumanEval (code completion) | ✅ | ✅ | ✅ | ✅ |
| MBPP (Python problems) | ⚪ | ✅ | ✅ | ✅ |
| LeetCode / Codeforces evals | ⚪ | ⚪ | ⚪ | ✅ |
✅ = reported officially in model card / tech report / benchmarks page
⚪ = not a primary benchmark for that org (either not reported or only mentioned indirectly)
So what models for look for, while we do our research spanning couple of years? - Look for Chinese model makers. This year has been there's.
-
The path forward for large language models in medicine is open. Nature ↩
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. arXiv ↩
-
GPQA: A High-Quality Dataset for Evaluating Question Answering in Specialized Domains. arXiv ↩
-
MuSR: A Benchmark for Evaluating Mathematical Understanding and Symbolic Reasoning in Large Language Models. arXiv ↩
-
MATH: Measuring Mathematical Problem Solving With the MATH Dataset. arXiv ↩
-
BBH: Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. arXiv ↩
-
IFEval: Instruction-Following Evaluation for Large Language Models. arXiv ↩