Brief introduction to publicly available LLMs¶

Learning outcomes

To understand the different categories that LLM comes in
To know which matrices to look at for your particular usecase

Quiz yourself!

Answer key

1:A, 2:B, 3:C

What distinguishes "Publicly Available" models from "Closed Source" models in the document?
A. Publicly available means model checkpoints can be publicly accessible (terms can still apply). Closed source means the opposite.
B. Publicly available models always permit commercial use without restrictions.
C. Publicly available models are only accessible via an API.
D. Publicly available models are always smaller and less capable than closed-source models.
Where to begin your search for publicly available models?
A. Academic papers and technical reports
B. Model hub & model cards on Hugging Face
C. Official vendor pages and API docs (e.g., OpenAI, Meta, Google)
D. Community leaderboards, GitHub repos, and discussion forums
Which category from the table typically allows redistribution of weights and derivatives?
A. Open Source (OSI‑compatible)
B. Open Weights (restricted / gated)
C. Adapter‑only / Delta releases
D. Proprietary API‑only

The DeepSeek Moment 🚀¶

Open AI vs Deepseek meme

The release of DeepSeek-R1 in January 2025 marked a pivotal "DeepSeek moment" in the LLM landscape. This open-weight model demonstrated that it could match or even exceed the performance of leading closed-source models like GPT-4o and Claude-3.5-Sonnet across multiple benchmarks, while being trained at a fraction of the cost (~$5.5M vs hundreds of millions).

DeepSeek's achievement proved that world-class AI capabilities are no longer exclusive to well-funded closed-source providers, fundamentally shifting the competitive dynamics and accessibility of cutting-edge language models.

Arena ⚖️¶

Open vs Closed Models — Narrowing performance gap on MMLU benchmark (Apr 2022 - July 2025) with human domain experts at 89.8%

Open-weight models are catching up with closed source models steadily¹⁸. However, creating high-quality benchmarks is an active area of research as the existing ones are beginning to plateau.

Categories 📂¶

LLMs come in wide-range of "openness".
Public != Open.
“Publicly Available” means that the model checkpoints can be publicly accessible (terms can still apply) while “Closed Source” means the opposite.

Category (ordered by openness)	Weights available?	Inference	Fine‑tuning	Redistribute weights / derivatives	Typical license	Examples
Open Source (OSI‑compatible)	✅ Full	✅	✅	✅	Apache‑2.0 / MIT	Mistral 7B ; OLMo 2 ; Alpaca
Open Weights (restricted / gated)	✅ Full	✅	⚠️ License‑bound (e.g., research‑only / carve‑outs)	❌ Usually not allowed	Custom terms (Llama / Gemma / RAIL)	Llama 3 (Meta Llama 3 Community License); Gemma 2 (Gemma Terms of Use); BLOOM (OpenRAIL)
Adapter‑only / Delta releases	⚠️ Partial (adapters/deltas)	✅ (after applying)	✅ (adapters)	✅ Adapters (base license applies)	Mixed	LoRA adapters over a base model
Proprietary API + FT	❌	⚠️ API-only	⚠️ API‑only (no weights export)	❌	Vendor ToS	OpenAI (GPT‑4.1, o4‑mini FT/RFT); Cohere (Command R/R+ FT); Anthropic (Claude 3 Haiku FT via Bedrock)
Proprietary API‑only	❌	⚠️ API-only	❌	❌	Vendor ToS	Google Gemini API

Leaderboard 🏆¶

Open LLM Leaderboard

Open LLM leaderboard has retired

To check community owned leaderboards head to : OpenEvals

Other notable leaderboards:
- HELM (Holistic Evaluation of Language Models by Stanford)
- LMArena (focus on open-weight models by UC Berkeley)

Benchmarks to consider 📊¶

Focus on a small set of comparable metrics (most appear on the Open LLM Leaderboard or model cards):

Core capability benchmarks (higher is better unless noted)

MMLU-Pro²: general academic/world knowledge
GPQA³: Q&A dataset designed by domain experts (PhD-level))
MuSR⁴: Reasoning with very long contexts (up to 100K tokens)
MATH⁵: high-school competition math problems
IFEval⁷: Testing ability to strictly follow instructions
BBH⁶: reasoning & commonsense

Category	Benchmarks (examples)	Orgs with open weights that report them
General academic / world knowledge	MMLU, MMLU-Pro, CMMLU	Meta (LLaMA), Mistral, Cohere, DeepSeek
Domain expert level	GPQA, CEval, CMMLU	Meta (LLaMA papers mention expert subsets), Cohere (Command evals), DeepSeek (reports CEval/CMMLU/GPQA)
Reasoning with long context	MuSR, LongBench / long-context evals	Mistral (Mixtral with long context, reported evals), DeepSeek (long-context benchmarks in tech report)
High-school competition / advanced math	GSM8K, MATH, AIME	Meta (MATH, GSM8K), Mistral (GSM8K, MATH), Cohere (GSM8K), DeepSeek (MATH, GSM8K, AIME)
Instruction following / alignment	IFEval, instruction eval suites	Meta (instruction-tuned LLaMA), Cohere (Command-R+ evals), DeepSeek (instruction following evals)
Reasoning & commonsense	BBH, HellaSwag, Winogrande, PiQA, ARC, DROP	Meta (HellaSwag, BBH), Mistral (HellaSwag, Winogrande), Cohere (commonsense evals), DeepSeek (HellaSwag, BBH, PiQA, Winogrande, ARC, DROP)
Code completion & debugging	HumanEval, MBPP, LeetCode, Codeforces	Meta (HumanEval), Mistral (HumanEval, MBPP), Cohere (HumanEval, MBPP), DeepSeek (HumanEval, MBPP, LeetCode)

Note: Mulitlingual and multimodal benchmarks are not covered here in detail.

Popular benchmarks for Vision Language Models

MathVista: evaluates mathematical reasoning in the context of images.
AI2D: focuses on diagram understanding.
ScienceQA: science question answering.
OCRBench: assesses document understanding and OCR capabilities.

Leaderboard for VLMs

Detailed benchmark coverage per open-weight model provider

Benchmark	Meta (LLaMA)	Mistral	Cohere (Command-R+)	DeepSeek
MMLU / MMLU-Pro / CMMLU	✅	✅	✅	✅
GPQA / CEval (expert Q&A)	⚪ (GPQA subsets in papers)	⚪ (less common)	✅	✅
MuSR / LongBench / long-context evals	⚪ (not main focus, context ≤32k)	✅ (Mixtral-8x22B long context)	⚪	✅
GSM8K (math word problems)	✅	✅	✅	✅
MATH (competition-level)	✅	✅	⚪	✅
AIME (advanced math)	⚪	⚪	⚪	✅
IFEval / Instruction evals	✅	⚪	✅	✅
BBH (BigBench Hard)	✅	⚪	⚪	✅
HellaSwag	✅	✅	⚪	✅
Winogrande	⚪	✅	⚪	✅
PiQA	⚪	⚪	⚪	✅
ARC (AI2 Reasoning Challenge)	⚪	⚪	⚪	✅
DROP (reading comp / commonsense)	⚪	⚪	⚪	✅
HumanEval (code completion)	✅	✅	✅	✅
MBPP (Python problems)	⚪	✅	✅	✅
LeetCode / Codeforces evals	⚪	⚪	⚪	✅

✅ = reported officially in model card / tech report / benchmarks page

⚪ = not a primary benchmark for that org (either not reported or only mentioned indirectly)

The path forward for large language models in medicine is open. Nature ↩
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark. arXiv ↩
GPQA: A High-Quality Dataset for Evaluating Question Answering in Specialized Domains. arXiv ↩
MuSR: A Benchmark for Evaluating Mathematical Understanding and Symbolic Reasoning in Large Language Models. arXiv ↩
MATH: Measuring Mathematical Problem Solving With the MATH Dataset. arXiv ↩
BBH: Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. arXiv ↩
IFEval: Instruction-Following Evaluation for Large Language Models. arXiv ↩
Closed-source vs. open-weight models LinkedIn ↩