LLM parallelization schemes
Motivation¶
- Parallelization is necessary;
- Good vs. bad parallelization.
Strategies¶
Data parallelism (DP)¶

- Trivial to implement;
- Syncing overhead for training;
- Memory cost due to replication.
Overlapping in DP¶
- Overlapping reduces communication overhead.
Sharding in DP¶

- Sharding reduces memory usage.
Tensor parallelism¶

- Row vs. column version;
- Expensive communication.
Tensor parallelism (cont.)¶

- Tensor parallelism reduces memory usage;
- Efficiency reduce with inter-node parallelization;
Pipeline parallelism¶
- Training: bubbling problem;
- Combine with DP to for better performance;
Pipeline parallelism (cont.)¶

- More strategies exist;
- Balancing, bubble, memory and communication;
- Implement is not trivial.
Expert (MoE) parallelism¶

- Activate only subsets of experts per token;
- Reduces compute cost for huge models.
Hybrid / 3D parallelism¶

- For really large models one need to combine the technique;
Implementations¶
Distributed computing - MPI¶
# just a simple script
...
hvd.init()
model = Model(...)
optimizer = hvd.DistributedOptimizer()
model.compile( ... )
model.fit( ... )
...
# and run in a job script
srun python train.py
- General-purpose HPC communication;
- Integrates well with the cluster;
- Not so popular in AI/ML.
Distributed computing - Ray¶
ray.init()
@ray.remote
def preprocess(data):
...
@ray.remote
def train(model_name, dataset):
...
cleaned = [preprocess.remote(...) for data in dataset ]
trained_models = [train.remote(...) for data in cleand]
results = ray.get(trained_models)
- Python-native distributed orchestration;
Distributed computing - Ray (cont.)¶
# start ray head
srun -J "head ray node-step-%J" \
apptainer exec ${SIF_IMAGE} ${RAY_CMD_HEAD} &
RAY_HEAD_PID=$!
# start ray worker
srun -J "worker ray node-step-%J" \
apptainer exec ${SIF_IMAGE} ${RAY_CMD_WORKER} &
sleep 10
# start the actual script
apptainer exec ${SIF_IMAGE} vllm serve ${HF_MODEL} \
--port ${API_PORT} ${vllm_opts}
- Run in server/client mode;
- Needs more work to configure.
Popular training frameworks¶
- PyTorch DDP: standard (basic) DP in PyTorch;
- PyTorch FSDP: improved with sharding;
- DeepSpeed: implements advanced schemes;
- Megatron-LM: Nvidia's of 3D parallelism;
- Other options: Colossal-AI, FairScale, ...
Popular inference frameworks¶
- vLLM: PagedAttention and dynamic batching;
- DeepSpeed: fused attention;
- Triton: NVIDIA’s inference platform;
- Other frameworks: ???
Summary¶
Take home-message¶
- Enough memory: use data parallelism;
- On a single node: prefer tensor parallelism;
- On many nodes: user pipeline parallelism;
- For training: Ultrascale playbook.
- For inference: use a inference engine.