ML with Julia¶
Questions
- Is Julia suitable for Machine Learning (ML)?
- Which machine learning tools are installed at HPCs?
- How to run Julia ML jobs on an HPC system (NSC, PDC, C3SE, UPPMAX, HPC2N, LUNARC)
Objectives
- Short introduction to ML with Julia
- Overview of installed ML tools at Swedish HPC centres
- Workflow
- Show the structure of a suitable batch script
- Examples to try
- We will not learn about:
- How to write and optimize ML/DL code.
- How to use multi-node setup for training models on CPU and GPU.
Machine Learning job on GPUs¶
Julia has already several packages for ML, one of them is Flux
. We will work with one of
the test cases provided by Flux
which deals with a data set of tiny images (CIFAR10). Follow this steps:
- Create an environment called
ML
, move to that environment directory and activate it - Fetch the
vgg_cifar10.jl
test case fromFlux
repo (wget https://raw.githubusercontent.com/FluxML/model-zoo/master/vision/vgg_cifar10/vgg_cifar10.jl
) - Load CUDA toolkit 11.4.1
- Install (add) the following packages: CUDA, MLDatasets, MLUtils
- The first time you use the data set CIFAR10, it will ask you to download it and accept. Do this in
Julian
mode:
- Change the number of epochs in the
vgg_cifar10.jl
script from 50 to something shorter like 5. - Submit the job with the script:
#!/bin/bash
#SBATCH -A hpc2n202w-xyz # your project_ID
#SBATCH -J job-serial # name of the job
#SBATCH -n 1 # nr. tasks #remove this line for UPPMAX
#SBATCH --time=00:15:00 # requested time
#SBATCH --error=job.%J.err # error file
#SBATCH --output=job.%J.out # output file
#SBATCH --gres=gpu:v100:1 # 1 GPU v100 card #remove this line for UPPMAX
# On Rackham use the follwing lines instead (rm one #) by subsituting the related HPC2N lines, se above
##SBATCH -M snowy
##SBATCH -p node
##SBATCH --gres=gpu:1
##SBATCH -N 1
##SBATCH --qos=short
ml purge > /dev/null 2>&1
ml Julia/1.8.5-linux-x86_64
ml CUDA/11.4.1
julia <fix-activate-environment> <fix-name-script>.jl
Answer
ml julia/1.8.5
mkdir ML
cd ML
wget https://raw.githubusercontent.com/FluxML/model-zoo/master/vision/vgg_cifar10/vgg_cifar10.jl
julia
(v1.8) pkg> activate .
(ML) pkg> add CUDA
(ML) pkg> add Flux
(ML) pkg> add MLDatasets
(ML) pkg> add MLUtils
julia> using MLDatasets: CIFAR10
julia> x, y = CIFAR10(:train)[:]
The batch script looks like:
#!/bin/bash -l
#SBATCH -A naiss202t-uv-wxyz # your project_ID
#SBATCH -J job-serial # name of the job
#SBATCH -M snowy
#SBATCH -p node
#SBATCH --gres=gpu:1
#SBATCH -N 1
#SBATCH --time=00:15:00 # requested time
#SBATCH --qos=short
#SBATCH --error=job.%J.err # error file
#SBATCH --output=job.%J.out # output file
ml julia/1.8.5
julia --project=. vgg_cifar10.jl
ml Julia/1.8.5-linux-x86_64
ml CUDA/11.4.1
mkdir ML
cd ML
wget https://raw.githubusercontent.com/FluxML/model-zoo/master/vision/vgg_cifar10/vgg_cifar10.jl
julia
(v1.8) pkg> activate .
(ML) pkg> add CUDA
(ML) pkg> add Flux
(ML) pkg> add MLDatasets
(ML) pkg> add MLUtils
julia> using MLDatasets: CIFAR10
julia> x, y = CIFAR10(:train)[:]
The batch script looks like:
#!/bin/bash
#SBATCH -A hpc2n202w-xyz # your project_ID
#SBATCH -J job-serial # name of the job
#SBATCH -n 1 # nr. tasks
#SBATCH --time=00:20:00 # requested time
#SBATCH --error=job.%J.err # error file
#SBATCH --output=job.%J.out # output file
#SBATCH --gres=gpu:v100:1 # 1 GPU v100 card
ml purge > /dev/null 2>&1
ml Julia/1.8.5-linux-x86_64
ml CUDA/11.4.1
julia --project=. vgg_cifar10.jl
At HPC2N you can use the tool job-usage
on the command line:
This will give you a URL that you can paste on your local browser. It would display statistics after a couple of minutes the job started.