This section is available as slides which is presented on the workshop. This text version include some additional notes. You can also access the slide version here.


Image source: ultrascale
playbook
Image source: ultrascale
playbook
Image source: ultrascale
playbook

Image source: ultrascale
playbook


Image source: ultrascale
playbook

Image source: ultrascale
playbook

Image source: ultrascale
playbook

Image source: ultrascale
playbook

Image source: ultrascale
playbook
# just a simple script
...
hvd.init()
model = Model(...)
optimizer = hvd.DistributedOptimizer()
model.compile( ... )
model.fit( ... )
...
# and run in a job script
srun python train.pyray.init()
@ray.remote
def preprocess(data):
...
@ray.remote
def train(model_name, dataset):
...
cleaned = [preprocess.remote(...) for data in dataset ]
trained_models = [train.remote(...) for data in cleand]
results = ray.get(trained_models)# start ray head
srun -J "head ray node-step-%J" \
apptainer exec ${SIF_IMAGE} ${RAY_CMD_HEAD} &
RAY_HEAD_PID=$!
# start ray worker
srun -J "worker ray node-step-%J" \
apptainer exec ${SIF_IMAGE} ${RAY_CMD_WORKER} &
sleep 10
# start the actual script
apptainer exec ${SIF_IMAGE} vllm serve ${HF_MODEL} \
--port ${API_PORT} ${vllm_opts}export GPUS_PER_NODE=4
export MASTER_ADDR=${PROEPI_HEAD_NODE}
export MASTER_PORT=${PROEPI_FREE_PORT}
srun python -u -m torch.distributed.run \
--nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \
--rdzv_id=$SLURM_JOB_ID --rdzv_backend=c10d --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
$ARGS