Distributed parallelism

Learning outcomes

  • I can schedule jobs with distributed parallelism
  • I know the basic difference between threads and distributed memory in terms of memory share
  • I can explain how Julia/MATLAB/R code makes use of distributed parallelism

Why distributed parallelism is important

Type of parallelism Number of cores Number of nodes Memory Library
Single-threaded 1 1 As given by operating system None
Threaded/shared memory Multiple 1 Shared by all cores OpenMP
Distributed Multiple 1 or Multiple Distributed Language package or MPI (Example OpenMPI/MPICH)

What is distributed parallelism

  • Although threaded programming is convenient because one can achieve considerable initial speedups with little code modifications, this approach does not scale for more than hundreds of cores.
  • Scalability can be achieved with distributed programming.
  • Here, there is not a common shared memory but the individual processes (notice the different terminology with threads in shared memory) have their own memory space.
  • Then, if a process requires data from or should transfer data to another process, it can do that by using send and receive to transfer messages.
  • Must be used for a job that use many different nodes, for example, a weather prediction.
  • Can also be used within a node.

Types

  • A standard API for distributed computing is the Message Passing Interface (MPI). In general, MPI requires refactoring of your code.

    • External libraries (loaded as modules on the cluster)
    • There are two common versions
      • OpenMPI
      • MPICH (on Dardel)
  • In the distributed parallelization scheme the workers (processes) can share some common memory but they can also exchange information by sending and receiving messages for instance.

    • This is often built-in or written in the actual language (R, MATLAB, Julia

Summary

  • Distributed memory

    • Tasks doing individual work
    • Memory sent “on-demand” between the tasks with rather small “packages”
    • Suitable for not very memory-dependent tasks
  • Distributed programming, 2 implementations

    • Native language, packages called “distributed” or “parallel” (or similar)
    • Message Passing Interface (MPI) relying on external libraries
  • Key words

    • tasks
    • processes
    • workers

Important

  • Cores are physical units on a node.
    • In some clusters where with “hyperthreading” is activated one can have 2 threads per core
    • In theses cases Slurm counts these “sub-cores”
  • On a core (physical or hyper) you can define 1 thread (shared memory) or 1 task (distributed memory)
  • Hybrid parallelism combines threading and distributed (not covered here)
    • Example: One node with 20 cores uses 5 tasks á 4 threads, each thread occupying 1 core.

Read more

Arnold (at the left): a robot that was controlled by MPI

Arnold (at the left): a robot that was controlled by MPI

How it is used in programming languages of this course?

Language-specific nuances for distributed programming

  • R doesn’t have a multiprocessing mechanism as the other languages discussed in this course. Some functions provided by certain packages (parallel, doParallel, etc.), for instance, foreach, offer parallel features. The processes generated by these functions have their own workspace which could lead to data replication.
  • MPI is supported in R through the Rmpi package.
  • In Matlab one can use the parpool('my-cluster',X) where X is the number of workers. The total number of processes spawned will always be X+1 where the extra process handles the overhead for the rest. See the documentation for parpool from MatWorks.
  • Matlab doesn’t support MPI function calls in Matlab code, it could be used indirectly through mex functions though.
  • The mechanism here is called Julia processes which can be activated by executing a script as follows julia -p X script.jl, where X is the number of processes. Code modifications are required to support the workers.
  • Julia also supports MPI through the package MPI.jl.

Packages and syntax

  • Packages
    • parallel
    • doParallel
    • Rmpi
    • pdbMPI on Dardel
  • Syntax for parallel/doParallel

    • makeCluster
    • registerDoParallel
    • foreach
    • stopCluster
  • Some syntax for MPI

    • mpi.universe
  • Syntax
    • parpool
    • parcluster
    • parfor
    • parfeval
    • spmd
  • Packages

    • Distributed (native to Julia)
      • Convenient
      • Not difficult to code
    • SharedArrays
    • MPI
  • Syntax for distributed/shared arrays

    • addprocs(nworkers)
    • sharedVector
    • @everywhere
    • @sync
    • @distributed
    • @spawn

Allocating distributed jobs

  • Use --ntasks=<number> or -n=<number>
  • Use srun if you run a script.
  • Use -p <number of processes> together with julia or R to start a language shell using having access to several cores.
  • Use an MPI or “distributed” version of your software: a ‘regular’ non-MPI/distributed version will just run the same command serially but perhaps many times at once!
  • Use mpirun if it is real MPI.

Warning

  • Check if the resources that you allocated are being used properly.
  • Monitor the usage of hardware resources with tools offered at your HPC center, for instance job-usage at HPC2N.
  • Here there are some examples (of many) of what you will need to pay attention when porting a parallel code from your laptop (or another HPC center) to our clusters:

We have a tool to monitor the usage of resources called: job-usage at HPC2N.

If you are in a interactive node session the top command will give you information of the resources usage.


Exercises

Important

Compute allocations in this workshop

  • Pelle/Rackham: uppmax2025-2-360
  • Kebnekaise: hpc2n2025-151
  • Cosmos: lu2025-2-94
  • Tetralith: naiss2025-22-934
  • Dardel: naiss2025-22-934
  • Alvis: naiss2025-22-934

Storage space for this workshop

  • Rackham: /proj/r-matlab-julia-pelle
  • Kebnekaise: /proj/nobackup/fall-courses
  • Tetralith: /proj/courses-fall-2025/users/
  • Dardel: /cfs/klemming/projects/snic/courses-fall-2025
  • Alvis: /mimer/NOBACKUP/groups/courses-fall-2025/

Running parallel with distributed libraries

In this exercise we will run a parallelized code that performs a 2D integration:

\int^{\pi}_{0}\int^{\pi}_{0}\sin(x+y)dxdy = 0

One way to perform the integration is by creating a grid in the x and y directions. More specifically, one divides the integration range in both directions into n bins.

Here is a parallel code using the Distributed package in Julia (call it integration2d_distributed.jl):

integration2d_distributed.jl

   using Distributed
   using SharedArrays
   using LinearAlgebra
   using Printf
   using Dates

   # Add worker processes (replace with actual number of cores you want to use)
   nworkers = *FIXME*
   addprocs(nworkers)

   # Grid size
   n = 20000
   # Number of processes
   numprocesses = nworkers
   # Shared array to store partial sums for each process
   partial_integrals = SharedVector{Float64}(numprocesses)

   # Function for 2D integration using multiprocessing
   # the decorator @everywher instruct Julia to transfer this function to all workers
   @everywhere function integration2d_multiprocessing(n, numprocesses, processindex, partial_integrals)
       # Interval size (same for X and Y)
       h = π / n
       # Cumulative variable
       mysum = 0.0
       # Workload for each process
       workload = div(n, numprocesses)

       # Define the range of work for each process according to index
       begin_index = workload * (processindex - 1) + 1
       end_index = workload * processindex

       # Regular integration in the X axis
       for i in begin_index:end_index
           x = h * (i - 0.5)
           # Regular integration in the Y axis
           for j in 1:n
               y = h * (j - 0.5)
               mysum += sin(x + y)
           end
       end

       # Store the result in the shared array
       partial_integrals[processindex] = h^2 * mysum
   end

   # function for main
   function main()
       # Start the timer
       starttime = now()

       # Distribute tasks to processes
       @sync for i in 1:numprocesses
           @spawnat i integration2d_multiprocessing(n, numprocesses, i, partial_integrals)
       end

       # Calculate the total integral by summing over partial integrals
       integral = sum(partial_integrals)

       # end timing
       endtime = now()

       # Output results
       println("Integral value is $(integral), Error is $(abs(integral - 0.0))")
       println("Time spent: $(Dates.value(endtime - starttime) / 1000) sec")
   end

   # Run the main function
   main()

Run the code with the following batch script.

job.sh

       #!/bin/bash -l
       #SBATCH -A naiss202X-XY-XYZ  # your project_ID
       #SBATCH -J job-serial        # name of the job
       #SBATCH -n *FIXME*           # nr. tasks/coresw
       #SBATCH --time=00:20:00      # requested time
       #SBATCH --error=job.%J.err   # error file
       #SBATCH --output=job.%J.out  # output file

       ml julia/1.8.5

       julia integration2d_distributed.jl
       #!/bin/bash
       #SBATCH -A hpc2n202x-xyz     # your project_ID
       #SBATCH -J job-serial        # name of the job
       #SBATCH -n *FIXME*           # nr. tasks
       #SBATCH --time=00:20:00      # requested time
       #SBATCH --error=job.%J.err   # error file
       #SBATCH --output=job.%J.out  # output file

       ml purge  > /dev/null 2>&1
       ml Julia/1.9.3-linux-x86_64

       julia integration2d_distributed.jl
     #!/bin/bash
     #SBATCH -A lu202X-XX-XX      # your project_ID
     #SBATCH -J job-serial        # name of the job
     #SBATCH -n *FIXME*           # nr. tasks
     #SBATCH --time=00:20:00      # requested time
     #SBATCH --error=job.%J.err   # error file
     #SBATCH --output=job.%J.out  # output file
     # reservation (optional)
     #SBATCH --reservation=RPJM-course*FIXME*

     ml purge  > /dev/null 2>&1
     ml Julia/1.9.3-linux-x86_64

     julia integration2d_distributed.jl
     #!/bin/bash
     #SBATCH -A naiss202t-uv-wxyz # your project_ID
     #SBATCH -J job               # name of the job
     #SBATCH  -p shared           # name of the queue
     #SBATCH --ntasks=*FIXME*     # nr. of tasks
     #SBATCH --cpus-per-task=1    # nr. of cores per-task
     #SBATCH --time=00:03:00      # requested time
     #SBATCH --error=job.%J.err   # error file
     #SBATCH --output=job.%J.out  # output file

     # Load dependencies and Julia version
     ml PDC/23.12 julia/1.10.2-cpeGNU-23.12

     julia integration2d_distributed.jl

```bash

 #!/bin/bash
 #SBATCH -A naiss202t-uv-xyz  # your project_ID
 #SBATCH -J job-serial        # name of the job
 #SBATCH -n *FIXME*           # nr. tasks
 #SBATCH --time=00:20:00      # requested time
 #SBATCH --error=job.%J.err   # error file
 #SBATCH --output=job.%J.out  # output file

 # Load any modules you need, here for Julia
 ml julia/1.9.4-bdist

 julia integration2d_distributed.jl

```

Try different number of cores for this batch script (FIXME string) using the sequence: 1, 2, 4, 8, 12, and 14. Note: this number should match the number of processes (also a FIXME string) in the Julia script. Collect the timings that are printed out in the job.*.out. According to these execution times what would be the number of cores that gives the optimal (fastest) simulation?

Challenge: Increase the grid size (n) to 100000 and submit the batch job with 4 workers (in the Julia script) and request 5 cores in the batch script. Monitor the usage of resources with tools available at your center, for instance top (UPPMAX) or job-usage (HPC2N).

Here is a parallel code using the parallel and doParallel packages in R (call it integration2d.R). Note: check if those packages are already installed for the required R version, otherwise install them with install.packages(). The recommended R version for this exercise is ml GCC/12.2.0 OpenMPI/4.1.4 R/4.2.2 (HPC2N).

integration2d.R

     library(parallel)
     library(doParallel)

     # nr. of workers/cores that will solve the tasks
     nworkers <- *FIXME*

     # grid size
     n <- 840

     # Function for 2D integration (non-optimal implementation)
     integration2d <- function(n, numprocesses, processindex) {
       # Interval size (same for X and Y)
       h <- pi / n
       # Cumulative variable
       mysum <- 0.0
       # Workload for each process
       workload <- floor(n / numprocesses)

       # Define the range of work for each process according to index
       begin_index <- workload * (processindex - 1) + 1
       end_index <- workload * processindex

       # Regular integration in the X axis
       for (i in begin_index:end_index) {
         x <- h * (i - 0.5)
         # Regular integration in the Y axis
         for (j in 1:n) {
           y <- h * (j - 0.5)
           mysum <- mysum + sin(x + y)
         }
       }
       # Return the result
       return(h^2 * mysum)
     }

     # Set up the cluster for doParallel
     cl <- makeCluster(nworkers)
     registerDoParallel(cl)

         # Start the timer
         starttime <- Sys.time()

         # Distribute tasks to processes and combine the outputs into the results list
         results <- foreach(i = 1:nworkers, .combine = c) %dopar% { integration2d(n, nworkers, i) }

         # Calculate the total integral by summing over partial integrals
         integral <- sum(results)

         # End the timing
         endtime <- Sys.time()

         # Print out the result
         print(paste("Integral value is", integral, "Error is", abs(integral - 0.0)))
         print(paste("Time spent:", difftime(endtime, starttime, units = "secs"), "seconds"))

     # Stop the cluster after computation
     stopCluster(cl)

Run the code with the following batch script.

job.sh

         #!/bin/bash -l
         #SBATCH -A naiss202u-wv-xyz  # your project_ID
         #SBATCH -J job-serial        # name of the job
         #SBATCH -n *FIXME*           # nr. tasks/coresw
         #SBATCH --time=00:20:00      # requested time
         #SBATCH --error=job.%J.err   # error file
         #SBATCH --output=job.%J.out  # output file

         ml R_packages/4.1.1

         Rscript --no-save --no-restore integration2d.R
         #!/bin/bash
         #SBATCH -A hpc2n202w-xyz     # your project_ID
         #SBATCH -J job-serial        # name of the job
         #SBATCH -n *FIXME*           # nr. tasks
         #SBATCH --time=00:20:00      # requested time
         #SBATCH --error=job.%J.err   # error file
         #SBATCH --output=job.%J.out  # output file

         ml purge > /dev/null 2>&1
         ml GCC/12.2.0  OpenMPI/4.1.4 R/4.2.2
         Rscript --no-save --no-restore integration2d.R

=== “LUNARC

 ```sh

      #!/bin/bash
      #SBATCH -A lu202u-wy-yz      # your project_ID
      #SBATCH -J job-serial        # name of the job
      #SBATCH -n *FIXME*           # nr. tasks
      #SBATCH --time=00:20:00      # requested time
      #SBATCH --error=job.%J.err   # error file
      #SBATCH --output=job.%J.out  # output file
      #SBATCH --reservation=RPJM-course*FIXME* # reservation (optional)

      ml purge > /dev/null 2>&1
      ml GCC/11.3.0  OpenMPI/4.1.4  R/4.2.1
      Rscript --no-save --no-restore integration2d.R
```

=== “PDC

```bash

     #!/bin/bash
     #SBATCH -A naiss202t-uv-wxyz # your project_ID
     #SBATCH -J job               # name of the job
     #SBATCH  -p shared           # name of the queue
     #SBATCH --ntasks=*FIXME*     # nr. of tasks
     #SBATCH --cpus-per-task=1    # nr. of cores per-task
     #SBATCH --time=00:03:00      # requested time
     #SBATCH --error=job.%J.err   # error file
     #SBATCH --output=job.%J.out  # output file

     # Load dependencies and R version
     ml ...

     Rscript --no-save --no-restore integration2d.R
```

=== “NSC

```bash

     #!/bin/bash
     #SBATCH -A naiss202t-uv-xyz  # your project_ID
     #SBATCH -J job-serial        # name of the job
     #SBATCH -n *FIXME*           # nr. tasks
     #SBATCH --time=00:20:00      # requested time
     #SBATCH --error=job.%J.err   # error file
     #SBATCH --output=job.%J.out  # output file

     # Load any modules you need, here for R
     ml R/4.4.0-hpc1-gcc-11.3.0-bare

     Rscript --no-save --no-restore integration2d.R

```

Try different number of cores for this batch script (FIXME string) using the sequence: 1,2,4,8,12, and 14. Note: this number should match the number of processes (also a FIXME string) in the R script. Collect the timings that are printed out in the job.*.out. According to these execution times what would be the number of cores that gives the optimal (fastest) simulation?

Challenge: Increase the grid size (n) to 10000 and submit the batch job with 4 workers (in the R script) and request 5 cores in the batch script. Monitor the usage of resources with tools available at your center, for instance top (UPPMAX) or job-usage (HPC2N).

Here is a parallel code using the parfor tool from Matlab (call it integration2d.m).

integration2d.m
% Number of workers/processes
num_workers = *FIXME*;

% Use parallel pool with 'parfor'
parpool('profile-name',num_workers);  % Start parallel pool with num_workers workers

% Grid size
n = 6720;

% bin size
h = pi / n;

tic;  % Start timer
% Shared variable to collect partial sums
partial_integrals = 0.0;

% In Matlab one can use parfor to parallelize loops
parfor i = 1:n
    partial_integrals = partial_integrals + integration2d_partial(n,i);
end

% Compute the integrals by multilpying by the bin size
integral = partial_integrals * h^2;
elapsedTime = toc;  % Stop timer

fprintf("Integral value is %e\n", integral);
fprintf("Error is %e\n", abs(integral - 0.0));
fprintf("Time spent: %.2f sec\n", elapsedTime);

% Clean up the parallel pool
delete(gcp('nocreate'));

% Function for the 2D integration only computes a single bin
function mysum = integration2d_partial(n,i)
    % bin size
    h = pi / n;
    % Partial summation
    mysum = 0.0;
        % A single bin is computed
        x = h * (i - 0.5);
        % Regular integration in the Y axis
        for j = 1:n
            y = h * (j - 0.5);
            mysum = mysum + sin(x + y);
        end
end

You can run directly this script from the Matlab GUI. Try different number of cores for this batch script (FIXME string) using the sequence: 1, 2, 4, 8, 12, and 14. Collect the timings that are printed out in the Matlab command window. According to these execution times what would be the number of cores that gives the optimal (fastest) simulation?

Challenge: Increase the grid size (n) to 100000 and submit the batch job with 4 workers. Monitor the usage of resources with tools available at your center, for instance top (UPPMAX), job-usage (HPC2N), or if you’re working in the GUI (e.g. on LUNARC), you can click Parallel and then Monitor Jobs. For job-usage, you can see the job ID if you type squeue --me on a terminal on Kebnekaise.

Extra exercises covering MPI

Julia: Find the difference in coding

In this exercise we will run a parallelized code that performs the 2D integration used above:

\int^{\pi}_{0}\int^{\pi}_{0}\sin(x+y)dxdy = 0
  • Note the differences in writing the code
  • MPI usually needs more rewriting
Julia scripts
# nr. of grid points
n = 100000

function integration2d_julia(n)
# interval size
h = π/n
# cummulative variable
mysum = 0.0
# regular integration in the X axis
for i in 0:n-1
    x = h*(i+0.5)
#   regular integration in the Y axis
    for j in 0:n-1
    y = h*(j + 0.5)
    mysum = mysum + sin(x+y)
    end
end
return mysum*h*h
end

res = integration2d_julia(n)
println(res)
using .Threads

# nr. of grid points
n = 100000

# nr. of threads
numthreads = nthreads()

# array for storing partial sums from threads
partial_integrals = zeros(Float64, numthreads)

function integration2d_julia_threaded(n,numthreads,threadindex)
# interval size
h = π/convert(Float64,n)
# cummulative variable
mysum = 0.0
# workload for each thread
workload = convert(Int64, n/numthreads)
# lower and upper integration limits for each thread
lower_lim = workload * (threadindex - 1)
upper_lim  = workload * threadindex -1

## regular integration in the X axis
for i in lower_lim:upper_lim
    x = h*(i + 0.5)
#   regular integration in the Y axis
    for j in 0:n-1
    y = h*(j + 0.5)
    mysum = mysum + sin(x+y)
    end
end
partial_integrals[threadindex] = mysum*h*h
return
end

# The threads can compute now the partial summations
@threads for i in 1:numthreads
    integration2d_julia_threaded(n,numthreads,threadid())
end

# The main thread now reduces the array
total_sum = sum(partial_integrals)

println("The integral value is $total_sum")
@everywhere begin
using Distributed
using SharedArrays
end

# nr. of grid points
n = 100000

# nr. of workers
numworkers = nworkers()

# array for storing partial sums from workers
partial_integrals = SharedArray( zeros(Float64, numworkers) )

@everywhere function integration2d_julia_distributed(n,numworkers,workerid,A::SharedArray)
# interval size
h = π/convert(Float64,n)
# cummulative variable
mysum = 0.0
# workload for each worker
workload = convert(Int64, n/numworkers)
# lower and upper integration limits for each thread
lower_lim = workload * (workerid - 2)
upper_lim = workload * (workerid - 1) -1

# regular integration in the X axis
for i in lower_lim:upper_lim
    x = h*(i + 0.5)
#   regular integration in the Y axis
    for j in 0:n-1
    y = h*(j + 0.5)
    mysum = mysum + sin(x+y)
    end
end
A[workerid-1] = mysum*h*h
return
end

# The workers can compute now the partial summations
@sync @distributed for i in 1:numworkers
    integration2d_julia_distributed(n,numworkers,myid(),partial_integrals)
end

# The main process now reduces the array
total_sum = sum(partial_integrals)

println("The integral value is $total_sum")
using MPI
MPI.Init()

# Initialize the communicator
comm = MPI.COMM_WORLD
# Get the ranks of the processes
rank = MPI.Comm_rank(comm)
# Get the size of the communicator
size = MPI.Comm_size(comm)

# root process
root = 0

# nr. of grid points
n = 100000

function integration2d_julia_mpi(n,numworkers,workerid)

# interval size
h = π/convert(Float64,n)
# cummulative variable
mysum = 0.0
# workload for each worker
workload = div(n,numworkers)
# lower and upper integration limits for each thread
lower_lim = workload * workerid
upper_lim = workload * (workerid + 1) -1

# regular integration in the X axis
for i in lower_lim:upper_lim
    x = h*(i + 0.5)
#   regular integration in the Y axis
    for j in 0:n-1
    y = h*(j + 0.5)
    mysum = mysum + sin(x+y)
    end
end
partial_integrals = mysum*h*h
return partial_integrals
end

# The workers can compute now the partial summations
p = integration2d_julia_mpi(n,size,rank)

# The root process now reduces the array
integral = MPI.Reduce(p,+,root, comm)

if rank == root
println("The integral value is $integral")
end

MPI.Finalize()
Batch scripts

The corresponding batch scripts for these examples are given here:

#!/bin/bash -l
#SBATCH -A naiss202t-uv-wxyz
#SBATCH -J job
#SBATCH -n 1
#SBATCH --time=00:10:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

ml julia/1.8.5

# "time" command is optional
time julia serial.jl
#!/bin/bash
#SBATCH -A naiss202t-uv-wxyz
#SBATCH -J job
#SBATCH -n 8
#SBATCH --time=00:10:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

ml julia/1.8.5

# "time" command is optional
time julia -t 8 threaded.jl
#!/bin/bash
#SBATCH -A naiss202t-uv-wxyz
#SBATCH -J job
#SBATCH -n 8
#SBATCH --time=00:10:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

ml julia/1.8.5

# "time" command is optional
time julia -p 8 distributed.jl
#!/bin/bash
#SBATCH -A naiss202t-uv-wxyz
#SBATCH -J job
#SBATCH -n 8
#SBATCH --time=00:10:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

ml julia/1.8.5
ml gcc/11.3.0 openmpi/4.1.3
# "time" command is optional

# export the PATH of the Julia MPI wrapper
export PATH=~/.julia/bin:$PATH

time mpiexecjl -np 8 julia mpi.jl
#!/bin/bash
#SBATCH -A hpc2n202w-xyz
#SBATCH -J job
#SBATCH -n 1
#SBATCH --time=00:10:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

ml purge  > /dev/null 2>&1
ml Julia/1.8.5-linux-x86_64

# "time" command is optional
time julia serial.jl
#!/bin/bash
#SBATCH -A hpc2n202w-xyz
#SBATCH -J job
#SBATCH -n 8
#SBATCH --time=00:10:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

ml purge  > /dev/null 2>&1
ml Julia/1.8.5-linux-x86_64

# "time" command is optional
time julia -t 8 threaded.jl
#!/bin/bash
#SBATCH -A hpc2n202w-xyz
#SBATCH -J job
#SBATCH -n 8
#SBATCH --time=00:10:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

ml purge  > /dev/null 2>&1
ml Julia/1.8.5-linux-x86_64

# "time" command is optional
time julia -p 8 distributed.jl
#!/bin/bash
#SBATCH -A hpc2n202w-xyz
#SBATCH -J job
#SBATCH -n 8
#SBATCH --time=00:10:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

ml purge  > /dev/null 2>&1
ml Julia/1.8.5-linux-x86_64
ml foss/2021b

# export the PATH of the Julia MPI wrapper
export PATH=/home/u/username/.julia/bin:$PATH

time mpiexecjl -np 8 julia mpi.jl
#!/bin/bash
#SBATCH -A lu202w-x-yz
#SBATCH -J job
#SBATCH -n 1
#SBATCH --time=00:10:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

ml purge  > /dev/null 2>&1
ml Julia/1.8.5-linux-x86_64

# "time" command is optional
time julia serial.jl
#!/bin/bash
#SBATCH -A lu202w-x-yz
#SBATCH -J job
#SBATCH -n 8
#SBATCH --time=00:10:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

ml purge  > /dev/null 2>&1
ml Julia/1.8.5-linux-x86_64

# "time" command is optional
time julia -t 8 threaded.jl
#!/bin/bash
#SBATCH -A lu202w-x-yz
#SBATCH -J job
#SBATCH -n 8
#SBATCH --time=00:10:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

ml purge  > /dev/null 2>&1
ml Julia/1.8.5-linux-x86_64

# "time" command is optional
time julia -p 8 distributed.jl
#!/bin/bash
#SBATCH -A lu202w-x-yz
#SBATCH -J job
#SBATCH -n 8
#SBATCH --time=00:10:00
#SBATCH --error=job.%J.err
#SBATCH --output=job.%J.out

ml purge  > /dev/null 2>&1
ml Julia/1.8.5-linux-x86_64
ml foss/2021b

# export the PATH of the Julia MPI wrapper
export PATH=/home/u/username/.julia/bin:$PATH

time mpiexecjl -np 8 julia mpi.jl
#!/bin/bash
#SBATCH -A naiss202t-uv-wxyz # your project_ID
#SBATCH -J job               # name of the job
#SBATCH  -p shared           # name of the queue
#SBATCH  --ntasks=1          # nr. of tasks
#SBATCH --cpus-per-task=1    # nr. of cores per-task
#SBATCH --time=00:03:00      # requested time
#SBATCH --error=job.%J.err   # error file
#SBATCH --output=job.%J.out  # output file

# Load dependencies and Julia version
ml PDC/23.12 julia/1.10.2-cpeGNU-23.12

# "time" command is optional
time julia serial.jl
#!/bin/bash
#SBATCH -A naiss202t-uv-wxyz # your project_ID
#SBATCH -J job               # name of the job
#SBATCH  -p shared           # name of the queue
#SBATCH  --ntasks=1          # nr. of tasks
#SBATCH --cpus-per-task=8    # nr. of cores per-task
#SBATCH --time=00:03:00      # requested time
#SBATCH --error=job.%J.err   # error file
#SBATCH --output=job.%J.out  # output file

# Load dependencies and Julia version
ml PDC/23.12 julia/1.10.2-cpeGNU-23.12

# "time" command is optional
time julia -t 8 threaded.jl
#!/bin/bash
#SBATCH -A naiss202t-uv-wxyz # your project_ID
#SBATCH -J job               # name of the job
#SBATCH  -p shared           # name of the queue
#SBATCH  --ntasks=1          # nr. of tasks
#SBATCH --cpus-per-task=8    # nr. of cores per-task
#SBATCH --time=00:03:00      # requested time
#SBATCH --error=job.%J.err   # error file
#SBATCH --output=job.%J.out  # output file

# Load dependencies and Julia version
ml PDC/23.12 julia/1.10.2-cpeGNU-23.12

# "time" command is optional
time julia -p 8 distributed.jl
#!/bin/bash
#SBATCH -A naiss202t-uv-wxyz # your project_ID
#SBATCH -J job               # name of the job
#SBATCH  -p shared           # name of the queue
#SBATCH  --ntasks=8          # nr. of tasks
#SBATCH --cpus-per-task=1    # nr. of cores per-task
#SBATCH --time=00:03:00      # requested time
#SBATCH --error=job.%J.err   # error file
#SBATCH --output=job.%J.out  # output file

# Load dependencies and Julia version
ml PDC/23.12 julia/1.10.2-cpeGNU-23.12

# export the PATH of the Julia MPI wrapper
export PATH=/cfs/klemming/home/u/username/.julia/bin:$PATH

time mpiexecjl -np 8 julia mpi.jl
#!/bin/bash
#SBATCH -A naiss202t-uv-wxyz # your project_ID
#SBATCH -J job               # name of the job
#SBATCH -n 1                 # nr. of tasks
#SBATCH --time=00:03:00      # requested time
#SBATCH --error=job.%J.err   # error file
#SBATCH --output=job.%J.out  # output file

# Load dependencies and Julia version
ml julia/1.9.4-bdist

# "time" command is optional
time julia serial.jl
#!/bin/bash
#SBATCH -A naiss202t-uv-wxyz # your project_ID
#SBATCH -J job               # name of the job
#SBATCH -n 8                 # nr. of tasks
#SBATCH --time=00:03:00      # requested time
#SBATCH --error=job.%J.err   # error file
#SBATCH --output=job.%J.out  # output file

# Load dependencies and Julia version
ml julia/1.9.4-bdist

# "time" command is optional
time julia -t 8 threaded.jl
#!/bin/bash
#SBATCH -A naiss202t-uv-wxyz # your project_ID
#SBATCH -J job               # name of the job
#SBATCH -n 8                 # nr. of tasks
#SBATCH --time=00:03:00      # requested time
#SBATCH --error=job.%J.err   # error file
#SBATCH --output=job.%J.out  # output file

# Load dependencies and Julia version
ml julia/1.9.4-bdist

# "time" command is optional
time julia -p 8 distributed.jl
#!/bin/bash
#SBATCH -A naiss202t-uv-wxyz # your project_ID
#SBATCH -J job               # name of the job
#SBATCH -n 8                 # nr. of tasks
#SBATCH --time=00:03:00      # requested time
#SBATCH --error=job.%J.err   # error file
#SBATCH --output=job.%J.out  # output file

# Load dependencies and Julia version
ml buildtool-easybuild/4.8.0-hpce082752a2 foss/2023b
ml julia/1.9.4-bdist

# export the PATH of the Julia MPI wrapper
export PATH=/home/username/.julia/bin:$PATH

time mpiexecjl -np 8 julia mpi.jl

R: RMPI

RMPI

Short parallel example using package “Rmpi” (“pbdMPI on Dardel”)

Short parallel example (using package “Rmpi”, so we need to load both the module R/4.4.2-gfbf-2024a and the module R-bundle-CRAN/2024.11-foss-2024a. A suitable openmpi module, OpenMPI/5.0.3-GCC-13.3.0, is loaded with these.)

#!/bin/bash -l
#SBATCH -A uppmax2025-2-360
#Asking for 10 min.
#SBATCH -t 00:10:00
#SBATCH -n 8

export OMPI_MCA_mpi_warn_on_fork=0
export OMPI_MCA_btl_openib_allow_ib=1

ml purge > /dev/null 2>&1
ml R/4.4.2-gfbf-2024a
ml OpenMPI/5.0.3-GCC-13.3.0 R-bundle-CRAN/2024.11-foss-2024a R-bundle-Bioconductor/3.20-foss-2024a-R-4.4.2

mpirun -np 1 R CMD BATCH --no-save --no-restore Rmpi.R output.out 

Short parallel example (using packages “Rmpi”). Loading R/4.4.1 and its prerequisites, as well as R-bundle-CRAN/2024.06 and its prerequisites.

#!/bin/bash
#SBATCH -A hpc2n2025-151 # Change to your own project ID
#Asking for 10 min.
#SBATCH -t 00:10:00
#SBATCH -n 8

export OMPI_MCA_mpi_warn_on_fork=0

ml purge > /dev/null 2>&1
ml GCC/13.2.0 R/4.4.1
ml OpenMPI/4.1.6 R-bundle-CRAN/2024.06

mpirun -np 1 Rscript Rmpi.R 

Short parallel example (using packages “Rmpi”). Loading R/4.2.1 and its prerequisites.

#!/bin/bash
#SBATCH -A lu2025-2-94 # Change to your own project ID
# Asking for 10 min.
#SBATCH -t 00:10:00
#SBATCH -n 8

export OMPI_MCA_mpi_warn_on_fork=0

ml purge > /dev/null 2>&1
ml GCC/11.3.0  OpenMPI/4.1.4
ml R/4.2.1

mpirun -np 1 R CMD BATCH --no-save --no-restore Rmpi.R output.out

Short parallel example (using packages “pbdMPI as “Rmpi” does not work correctly on NSC). Loading R/4.2.2.

Note: for NSC you first need to install “pdbMPI” (module load R/4.2.2-hpc1-gcc-11.3.0-bare, start R, install.packages('pbdMPI'), pick CRAN mirror (Denmark, Finland, Sweden or other closeby))

#!/bin/bash
#SBATCH -A naiss2025-22-934 
# Asking for 15 min.
#SBATCH -t 00:15:00
#SBATCH -n 8
#SBATCH --exclusive 

ml purge > /dev/null 2>&1
ml R/4.2.2-hpc1-gcc-11.3.0-bare 

srun --mpi=pmix Rscript pbdMPI.R  

Short parallel example (using packages “pbdMPI”). Loading R/4.4.1.

Note: for PDC you first need to install “pbdMPI” (“Rmpi” does not work).

  • You can find the tarball in /cfs/klemming/projects/supr/courses-fall-2025/pbdMPI_0.5-4.tar.gz.
  • Copy it to your own subdirectory under the project directory and then do:
    • module load PDC/24.11 R/4.4.2-cpeGNU-24.11
    • R CMD INSTALL pbdMPI_0.5-4.tar.gz --configure-args=" --with-mpi-include=/opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/include --with-mpi-libpath=/opt/cray/pe/mpich/8.1.28/ofi/gnu/12.3/lib --with-mpi-type=MPICH2" --no-test-load
#!/bin/bash -l 
#SBATCH -A naiss2025-22-934
# Asking for 10 min.
#SBATCH -t 00:10:00
#SBATCH --nodes 2
#SBATCH --ntasks-per-node=8
#SBATCH -p main
#SBATCH --output=pbdMPI-test_%J.out 

# If you do ml purge you also need to restore the preloaded modules which you should have saved 
# when you logged in. Otherwise leave the two following lines outcommented. 
#ml purge > /dev/null 2>&1
#ml restore preload
ml PDC/24.11
ml R/4.4.2-cpeGNU-24.11

srun -n 4 Rscript pbdMPI.R

This R script uses package “Rmpi”.

# Load the R MPI package if it is not already loaded.
if (!is.loaded("mpi_initialize")) {
library("Rmpi")
}
print(mpi.universe.size())
ns <- mpi.universe.size() - 1
mpi.spawn.Rslaves(nslaves=ns)
#
# In case R exits unexpectedly, have it automatically clean up
# resources taken up by Rmpi (slaves, memory, etc...)
.Last <- function(){
if (is.loaded("mpi_initialize")){
if (mpi.comm.size(1) > 0){
print("Please use mpi.close.Rslaves() to close slaves.")
mpi.close.Rslaves()
}
print("Please use mpi.quit() to quit R")
.Call("mpi_finalize")
}
}
# Tell all slaves to return a message identifying themselves
mpi.remote.exec(paste("I am",mpi.comm.rank(),"of",mpi.comm.size(),system("hostname",intern=T)))

# Test computations
x <- 5
x <- mpi.remote.exec(rnorm, x)
length(x)
x

# Tell all slaves to close down, and exit the program
mpi.close.Rslaves()

mpi.quit()

This R script uses package “pbdMPI”.

library(pbdMPI)

ns <- comm.size()

# Tell all R sessions to return a message identifying themselves
id <- comm.rank()
ns <- comm.size()
host <- system("hostname", intern = TRUE)
comm.cat("I am", id, "on", host, "of", ns, "\n", all.rank = TRUE)

# Test computations
x <- 5
x <- rnorm(x)
comm.print(length(x))
comm.print(x, all.rank = TRUE)

finalize()

Send the script to the batch system:

sbatch <batch script>

MATLAB: parfor and parfeval

Challenge 1. Create and run a parallel code

We have the following code in MATLAB that generates an array of 10000 random numbers and then the sum of all elements is stored in a variable called s:

r = rand(1,10000);
s = sum(r);

We want now to repeat these steps (generating the numbers and taking the sum) 6 times so that the steps are run at the same time. Use parfor to parallelize these steps. Once your code is parallelized enclose it in a parpool section and send the job to the queue.

Solution
% Nr. of workers
nworkers = 6;

% Use parallel pool with 'parfor'
parpool('name-of-your-cluster',nworkers);  % Start parallel pool with nworkers workers

myarray = []; % Optional in this exercise to store partial results
parfor i=1:nworkers
   r = rand(1,10000);
   s = sum(r);
   myarray = [myarray,s];
end

myarray  % print out the results from the workers

% Clean up the parallel pool
delete(gcp('nocreate'));
Challenge 2. Run a parallel code with batch MATLAB function

The following function uses parfeval to do some computation (specifically it takes the average per-column of a matrix with a size nsize equal to 1000):

function results = parfeval_mean(nsize)
   results = parfeval(@mean, 1, rand(nsize))
end

Place this function in a file called parfeval_mean.m and submit this function with the MATLAB batch command.

Solution
c=parcluster('name-of-your-cluster');
j = c.batch(@parfeval_mean,1,{1000},'pool',1);
j.wait;                               % wait for the results
t = j.fetchOutputs{:};                % fetch the results
fprintf('Name of host: %.5f \n', t);    % Print out the results