Post-training¶
-
This phase consists of fine-tuning and/or RL.
-
Do you even need to post-train your model? 🤔
- When simple prompting and instruction following fails.
- Query KB using RAG if KB keeps changing.
- Domain knowelege is needed heavily and base model does not perform well? Continue your pre-training for inject more knowledge.
- BUT if you want model to follow almost all your instructions tightly and have improved targeted capabilities, reasoning, use fine-tuning/post-training.
Supervised Fine-tuning (SFT) 📖¶
- Imitation learning based on instruction prompts and responses
- Use cases:
- pre-trained to instruction following model
- non-reasoning to reasoning model
- non-tool usage to tool usage
- Improve certain model capabilities from larger model to smaller model
- Data curation:
- Quality >> quantity. 1k high quality data > 1M mixed/low quality data. ⭐
- Common methods:
- Distillation: by generating responses from stronger/larger model
- Best of K/ rejection sampling: by generating multiple responses from model and selecting best among them.
- Filtering: by starting from a large-scale dataset and then filter as per the quality of responses and diversity of the prompts.
- Full-finetune or PEFT? ⚖️
- Both can be used. FullFT takes lot of memory and is slower to perform. PEFT (LoRA) saves lot of memory, but learns less and forgets less1.(1)
Exercise
- For this and subsequent exercises we will use TRL and PEFT libraries.
- Copy
llm-workshop/containers/post_train/post_train_env.shto your~/portal/jupyter/ - Create
~/portal/jupyterdir if you dont have already. - Run a jupyter notebook using
post_train_envenvironment and working directory as your personal project directory. -
Run
sft.ipynb -
Handy tools: max_length calculator. Deepspeed memory calculator API.
DIY Exercise: Gpt-oss PEFT, Qwen3-14B PEFT
Preference Optimization 🤝¶
- Contrastive learning based on positive and negative responses.
- With SFT, LLM only reproduce patterns it learnt from the data it was trained on.
- LLM has more potential to learn if it is shown good and bad examples of responses.
- This encourages model to produce more "preferred" responses and dicourages it to produce less of the other kind.
- Uses cases:
- Give persona/identity.
- Give safer responses.
- Improve multilingual responses.
- Improve instruction following.
- Data curation:
- We need even less data than SFT as the model is already following our instructions nicely and alternatively gained domain knowledge.
- We can already leverage LLMs now to generate strong and weak pair of responses. Use a better model -> strong responses, weaker/baseline model -> weak responses.
- Alternatively, run only one LLM on the same prompt to produce strong/weak response pairs and another "grader" LLM that gives scores to these outputs.
Exercise
- Run a jupyter notebook using
post_train_envenvironment and working directory as your personal project directory. - Run
dpo.ipynb
Reinforcement Learning (RL) 🎮¶
- "A is better than B" is not always what we are looking for. We would like to have in-between steps to be correct too for problems that requires thinking longer.
- LLM can be given an environment that could include code unit tests, math verfiers, humans as judges or code executors in real-time as LLM is thinking.
- Reward signals can help guide LLMs to generate better code, solve math problems or plan in multiple steps. Caveat being reward models are difficult to create and LLMs are harder to stabilise and costly to train. Reward hacking is a challenging to overcome.
- Use cases:
- When we can create verifiable reward signals
- When tasks are multi-steps
Exercise
- Run a jupyter notebook using
post_train_envenvironment and working directory as your personal project directory. - Run
rl.ipynb
Note on On-policy distillation
Methods like this lie in-between preference optimization and RL by taking advantage of reward signal coming from a Teacher model. The student model can continously absorb samples from Teacher model and without needing explicit preference labels, can start to show capabilities of the Teacher model. This method is often simpler to implement than RL way cheaper in compute too. Although its success is only shown in smaller to mid-sized (<30B) models. Whereas RL works better in larger models (20B+) only.
Note on Memory considerations
- In case of full finetuning, a rule of thumb is 16 GB of GPU memory per 1B params in the model
Resources 📚
- Frameworks for post-training2:
| Framework | SFT | PO | RL | Multi-modal | FullFT | LoRA | Distributed |
|---|---|---|---|---|---|---|---|
| TRL | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Axolotl | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| OpenInstruct | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ |
| Unsloth | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ❌ |
| vERL | ✅ | ❌ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Prime RL | ✅ | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ |
| PipelineRL | ❌ | ❌ | ✅ | ❌ | ✅ | ✅ | ✅ |
| ART | ❌ | ❌ | ✅ | ❌ | ❌ | ✅ | ❌ |
| TorchForge | ✅ | ❌ | ✅ | ❌ | ✅ | ❌ | ✅ |
| NemoRL | ✅ | ✅ | ✅ | ❌ | ✅ | ❌ | ✅ |
| OpenRLHF | ✅ | ✅ | ✅ | ❌ | ✅ | ✅ | ✅ |
- Optimizing finetuning on GPU: transformers (GPU)