Skip to content

GPUs

Learning outcomes

  • Practice using the UPPMAX documentation
  • I can see the GPU and memory usage of jobs
  • I know the correct flags to utilize GPUs.
  • I understand the GPU configuration on Bianca.

Hardware and flags

  • 10 Nodes with 2 Nvidia A100 40 GB each.
  • Nodelist : sens2025xxx-b[201-2010]
  • All GPU nodes have at least 256 GB RAM (fat nodes) with 16 CPU cores and 2 GPUs per node.
  • In order to avoid GPU misuse, a project cannot request more than 7 GPU nodes, in total.
  • SBATCH flags:
#SBATCH -A sens2025xxx
#SBATCH -p node
#SBATCH -N 1
#SBATCH -C gpu
#SBATCH --gpus-per-node=2   #number of GPUs per node
#SBATCH -t 1:00:00

nvidia-smi

You can also ask for 1 gpu per node and few cores in order to use 2 gpus for different jobs:

#SBATCH -A sens2025xxx
#SBATCH -p core
#SBATCH -n 8
#SBATCH -C gpu
#SBATCH --gpus-per-node=1
#SBATCH -t 1:00:00
  • Similarly for interactive session on GPU node, use the same -C gpu --gpus-per-node= flag.

GPU accessibility check

  • Sanity check if CUDA is loaded properly by checking CUDA environment variable:
env | grep CUDA
CUDA_VISIBLE_DEVICES=0

OR

echo CUDA_VISIBLE_DEVICES
0

OR check by loading torch:

module load python_ML_packages/3.9.5-gpu
python -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.cuda.get_device_properties(0)); print(torch.randn(1).cuda())"

output:

1.9.0+cu111
11.1
_CudaDeviceProperties(name='NVIDIA A100-PCIE-40GB', major=8, minor=0, total_memory=40326MB, multi_processor_count=108)
tensor([0.1014], device='cuda:0')

Profiling

  • Monitor GPU utilization with nvidia-smi:

    nvidia-smi dmon -o DT
    nvidia-smi --format=noheader,csv --query-compute-apps=timestamp,gpu_name,pid,name,used_memory --loop=1 -f sample_run.log
    

Tips

  • Use --nv flag while running your apptainer containers to use GPU on a GPU node
  • Correct CUDA modules gets automatically loaded if you use python_ML_packages/3.9.5-gpu
  • Various CUDA libraries are available in module spider cuda