Model Parallelism

This section is available as slides which is presented on the workshop. This text version include some additional notes. You can also access the slide version here.

Overview

  • Parallelization schemes that are useful for LLM;
  • Common tools that for running LLMs in parallel;

Strategies

Data parallelism (DP)

Data parallelism (DP)


Image source: ultrascale playbook

  • Trivial to implement;
  • Syncing overhead for training;
  • Memory cost due to replication.

Overlapping in DP


Image source: ultrascale playbook

  • Overlapping reduces communication overhead.

Sharding in DP


Image source: ultrascale playbook

  • Sharding reduces memory usage;
  • Three levels of sharding (optimizer, gradients,weights);
  • ZeRO-methods / FSDP2.

Expert (MoE) parallelism


Image source: ultrascale playbook

  • Activate only subsets of experts per token;
  • Similar to sharding, but reduce need for data exchange.

Tensor parallelism

Tensor parallelism


Image source: ultrascale playbook

  • row vs. column version;
  • needs to be carefully designed;

Tensor parallelism (cont.)


Image source: ultrascale playbook

  • expensive communication;
  • efficiency reduce with inter-node parallelization;

Pipeline parallelism (PP)

Pipeline parallelism (PP)


Image source: ultrascale playbook

  • Training: bubbling problem;
  • Can be mitigated by smaller batch size (and accumulating the gradients);

Pipeline parallelism (cont.)


Image source: ultrascale playbook

  • More strategies exist;
  • Balancing, bubble, memory and communication;
  • Implement is not trivial.

Hybrid / 3D parallelism


Image source: ultrascale playbook

  • For really large models one need to combine the techniques;

Implementations

Distributed computing - MPI

# just a simple script
...
hvd.init()
model = Model(...)
optimizer = hvd.DistributedOptimizer()
model.compile( ... )
model.fit( ... )
...

# and run in a job script
srun python train.py
  • General-purpose HPC communication;
  • Integrates well with the cluster;
  • Not so popular in AI/ML.

Distributed computing - Ray

ray.init()

@ray.remote
def preprocess(data):
    ...

@ray.remote
def train(model_name, dataset):
    ...

cleaned = [preprocess.remote(...) for data in dataset ]
trained_models = [train.remote(...) for data in cleand]

results = ray.get(trained_models)
  • Python-native distributed orchestration;

Distributed computing - Ray (cont.)

# start ray head
srun -J "head ray node-step-%J" \
  apptainer exec ${SIF_IMAGE} ${RAY_CMD_HEAD} &
RAY_HEAD_PID=$!

# start ray worker
srun -J "worker ray node-step-%J" \
  apptainer exec ${SIF_IMAGE} ${RAY_CMD_WORKER} &
sleep 10

# start the actual script
apptainer exec ${SIF_IMAGE} vllm serve ${HF_MODEL} \
  --port ${API_PORT} ${vllm_opts}
  • Run in server/client mode;
  • Needs more work to configure.

Distributed computing - torchrun (Elastic Launch)

export GPUS_PER_NODE=4
export MASTER_ADDR=${PROEPI_HEAD_NODE}
export MASTER_PORT=${PROEPI_FREE_PORT}

srun python -u -m torch.distributed.run \
    --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \
    --rdzv_id=$SLURM_JOB_ID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    $ARGS

Summary

Take home-message

  • Enough memory: use data parallelism;
  • On a single node: prefer tensor parallelism;
  • On many nodes: user pipeline parallelism;
  • For training: the ultrascale playbook;
  • For inference: use a inference engine.