Model Parallelism

This section is available as slides which is presented on the workshop. This text version include some additional notes. You can also access the slide version here.

Overview

Parallelization schemes that are useful for LLM;
Common tools that for running LLMs in parallel;

Strategies

Data parallelism (DP)

Image source: ultrascale playbook

Trivial to implement;
Syncing overhead for training;
Memory cost due to replication.

Overlapping in DP

Image source: ultrascale playbook

Overlapping reduces communication overhead.

Sharding in DP

While data-parallelism is nice, the biggest limitation is that we can not use it for large models that does not fit in one GPU. To fit larger models to one GPU, we can store part of the variables in each GPU, and access it only while necessary.

This is known as the ZeRO optimization, of the deepspeed library. With three stages of optimization, the optimizer state, the gradients and the weights. The fully sharded data parallel (FSDP2) resembles ZeRO stage 3 optimization.

Note that sharding of data is not free lunch, transfer of sharded states do incur extra cost due to both communication overhead and extra memory needed to store the metadata. Typically optimizer state is the most expensive part in training tasks so ZeRO-1 brings the most improvement.

Going for more sharding does not always improve the performance, as a rule of thumb, ZeRO-2 is often consider balance. Note that when fine-tuning the model, there will be much less gradients and optimizer states to store, sharding in this case might not be really desirable.

Closely related to sharding is the offloading of memories to RAM or other fast storage, this goes out of the scope of this session but you can find some examples here.

Image source: ultrascale playbook

Sharding reduces memory usage;
Three levels of sharding (optimizer, gradients,weights);
ZeRO-methods / FSDP2.

Expert (MoE) parallelism

Image source: ultrascale playbook

Activate only subsets of experts per token;
Similar to sharding, but reduce need for data exchange.

Tensor parallelism

Tensor parallelism (TP) splits a model by distributing individual tensor operations (matrix multiplications) to different GPUs. Unlike data parallelism, the same data will be processed by different GPUs and the results gathered (or summed). Depending on which direction the computation is sliced, it can be further categorized into row- or column versions.

Notably, tensor parallelism needs some careful design, by labeling how each layer will be parallelized. Taking the below illustration of row- and column- version, one see that if one perform a column-split followed by a row-split, one can omit the intermediate all_gather step since each GPU will only take one column as the next input.

TP for inference tasks is somewhat well supported (e.g. in vLLM and deepspeed. Tensor parallel support is less readily available. Know (untested) implements are that of Megatron-LM and NeMo

Image source: ultrascale playbook

row vs. column version;
needs to be carefully designed;

Tensor parallelism (cont.)

Image source: ultrascale playbook

expensive communication;
efficiency reduce with inter-node parallelization;

Pipeline parallelism (PP)

One can also split the model “sequentially”, such that each GPU process only a “segment” of the model and hence the name “pipeline parallelism”. Unlike TP, pipeline parallelism (PP) only require minimal inter-GPU communication, so it is ideal for parallelization across multiple nodes.

However, with naive PP only one GPU will be working a time; this is easy to solve for inference tasks, if one have multiple inputs to be processed (think about setting up a server that serves a LLM, different users’ requests can be queued and the GPU will always be busy for some request.

For training tasks, this is more tricky as next forward pass necessarily depends on the previous backward pass. Nevertheless, it is beneficial to have smaller mini-batches per parameter update, so that the “bubble” of idling GPUs is less significant.

Image source: ultrascale playbook

Training: bubbling problem;
Can be mitigated by smaller batch size (and accumulating the gradients);

Pipeline parallelism (cont.)

Image source: ultrascale playbook

More strategies exist;
Balancing, bubble, memory and communication;
Implement is not trivial.

Hybrid / 3D parallelism

Image source: ultrascale playbook

For really large models one need to combine the techniques;

Implementations

Distributed computing - MPI

# just a simple script
...
hvd.init()
model = Model(...)
optimizer = hvd.DistributedOptimizer()
model.compile( ... )
model.fit( ... )
...

# and run in a job script
srun python train.py

General-purpose HPC communication;
Integrates well with the cluster;
Not so popular in AI/ML.

Distributed computing - Ray

ray.init()

@ray.remote
def preprocess(data):
    ...

@ray.remote
def train(model_name, dataset):
    ...

cleaned = [preprocess.remote(...) for data in dataset ]
trained_models = [train.remote(...) for data in cleand]

results = ray.get(trained_models)

Python-native distributed orchestration;

Distributed computing - Ray (cont.)

# start ray head
srun -J "head ray node-step-%J" \
  apptainer exec ${SIF_IMAGE} ${RAY_CMD_HEAD} &
RAY_HEAD_PID=$!

# start ray worker
srun -J "worker ray node-step-%J" \
  apptainer exec ${SIF_IMAGE} ${RAY_CMD_WORKER} &
sleep 10

# start the actual script
apptainer exec ${SIF_IMAGE} vllm serve ${HF_MODEL} \
  --port ${API_PORT} ${vllm_opts}

Run in server/client mode;
Needs more work to configure.

Distributed computing - torchrun (Elastic Launch)

export GPUS_PER_NODE=4
export MASTER_ADDR=${PROEPI_HEAD_NODE}
export MASTER_PORT=${PROEPI_FREE_PORT}

srun python -u -m torch.distributed.run \
    --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \
    --rdzv_id=$SLURM_JOB_ID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    $ARGS

Popular training frameworks

PyTorch DDP: standard (basic) DP in PyTorch;
PyTorch FSDP: improved with sharding;
DeepSpeed: implements advanced schemes (most sharding);
Megatron-LM: Nvidia’s implementation (tensor-parallelism);
Other options: NeMo, Colossal-AI, FairScale, …

Popular inference frameworks

vLLM: PagedAttention and dynamic batching;
DeepSpeed: fused attention;
Triton: NVIDIA’s inference platform;

Summary

Take home-message

Enough memory: use data parallelism;
On a single node: prefer tensor parallelism;
On many nodes: user pipeline parallelism;
For training: the ultrascale playbook;
For inference: use a inference engine.