ML with Julia¶
Questions
- Is Julia suitable for Machine Learning (ML)?
- Which machine learning tools are installed at HPCs?
- How to run Julia ML jobs on an HPC system (NSC, PDC, C3SE, UPPMAX, HPC2N, LUNARC)
Objectives
- Short introduction to ML with Julia
- Overview of installed ML tools at Swedish HPC centres
- Workflow
- Show the structure of a suitable batch script
- Examples to try
- We will not learn about:
- How to write and optimize ML/DL code.
- How to use multi-node setup for training models on CPU and GPU.
Machine Learning job on GPUs¶
Julia has already several packages for ML, one of them is Flux. We will work with one of
the test cases provided by Flux which deals with a data set of tiny images (CIFAR10). Follow this steps:
- Create an environment called
ML, move to that environment directory and activate it - Fetch the
vgg_cifar10.jltest case fromFluxrepo (wget https://raw.githubusercontent.com/FluxML/model-zoo/master/vision/vgg_cifar10/vgg_cifar10.jl) - Load CUDA toolkit 11.4.1
- Install (add) the following packages: CUDA, MLDatasets, MLUtils
- The first time you use the data set CIFAR10, it will ask you to download it and accept. Do this in
Julianmode:
- Change the number of epochs in the
vgg_cifar10.jlscript from 50 to something shorter like 5. - Submit the job with the script:
#!/bin/bash
#SBATCH -A hpc2n202w-xyz # your project_ID
#SBATCH -J job-serial # name of the job
#SBATCH -n 1 # nr. tasks #remove this line for UPPMAX
#SBATCH --time=00:15:00 # requested time
#SBATCH --error=job.%J.err # error file
#SBATCH --output=job.%J.out # output file
#SBATCH --gres=gpu:v100:1 # 1 GPU v100 card #remove this line for UPPMAX
# On Rackham use the follwing lines instead (rm one #) by subsituting the related HPC2N lines, se above
##SBATCH -M snowy
##SBATCH -p node
##SBATCH --gres=gpu:1
##SBATCH -N 1
##SBATCH --qos=short
ml purge > /dev/null 2>&1
ml Julia/1.8.5-linux-x86_64
ml CUDA/11.4.1
julia <fix-activate-environment> <fix-name-script>.jl
Answer
ml julia/1.8.5
mkdir ML
cd ML
wget https://raw.githubusercontent.com/FluxML/model-zoo/master/vision/vgg_cifar10/vgg_cifar10.jl
julia
(v1.8) pkg> activate .
(ML) pkg> add CUDA
(ML) pkg> add Flux
(ML) pkg> add MLDatasets
(ML) pkg> add MLUtils
julia> using MLDatasets: CIFAR10
julia> x, y = CIFAR10(:train)[:]
The batch script looks like:
#!/bin/bash -l
#SBATCH -A naiss202t-uv-wxyz # your project_ID
#SBATCH -J job-serial # name of the job
#SBATCH -M snowy
#SBATCH -p node
#SBATCH --gres=gpu:1
#SBATCH -N 1
#SBATCH --time=00:15:00 # requested time
#SBATCH --qos=short
#SBATCH --error=job.%J.err # error file
#SBATCH --output=job.%J.out # output file
ml julia/1.8.5
julia --project=. vgg_cifar10.jl
ml Julia/1.8.5-linux-x86_64
ml CUDA/11.4.1
mkdir ML
cd ML
wget https://raw.githubusercontent.com/FluxML/model-zoo/master/vision/vgg_cifar10/vgg_cifar10.jl
julia
(v1.8) pkg> activate .
(ML) pkg> add CUDA
(ML) pkg> add Flux
(ML) pkg> add MLDatasets
(ML) pkg> add MLUtils
julia> using MLDatasets: CIFAR10
julia> x, y = CIFAR10(:train)[:]
The batch script looks like:
#!/bin/bash
#SBATCH -A hpc2n202w-xyz # your project_ID
#SBATCH -J job-serial # name of the job
#SBATCH -n 1 # nr. tasks
#SBATCH --time=00:20:00 # requested time
#SBATCH --error=job.%J.err # error file
#SBATCH --output=job.%J.out # output file
#SBATCH --gres=gpu:v100:1 # 1 GPU v100 card
ml purge > /dev/null 2>&1
ml Julia/1.8.5-linux-x86_64
ml CUDA/11.4.1
julia --project=. vgg_cifar10.jl
At HPC2N you can use the tool job-usage on the command line:
This will give you a URL that you can paste on your local browser. It would display statistics after a couple of minutes the job started.