Thread parallelism

Learning outcomes

  • I can schedule jobs with thread parallelism
  • I can explain how jobs with thread parallelism are scheduled
  • I can explain how Julia/MATLAB/R code makes use of thread parallelism
  • I can explain the results of a correct benchmark
  • I can explain the results of an incorrect benchmark
  • I can argue why I should stick to my programming language, even if it is not the fastest
For teachers

Teaching goals are:

  • Schedule and run a job that needs more cores, with a calculation in their favorite language
  • Learners have scheduled and run a job that needs more cores, with a calculation in their favorite language
  • Learners understand when it is possible/impossible and/or useful/useless to run a job with multiple cores
  • Learners can argue why they should stick to their programming languages, even if it is not the fastest

Prior:

  • What is thread parallelism?
  • Why use thread parallelism?
  • Are there other ways to make your code run faster?
  • What is a benchmark?
  • Why do a benchmark?

Feedback:

  • When to use parallel computing?
  • When not to use parallel computing?
Status overview
HPC cluster Julia MATLAB R Other comments
Alvis Unknown Unknown Unknown .
Bianca Unknown Unknown Unknown .
COSMOS Yes No Yes .
Dardel Yes No Yes .
Kebnekaise Yes No Yes .
LUMI Unknown Unknown Unknown .
Rackham Yes No Yes .
Pelle Yes No Yes .
Tetralith Yes No Yes Seems to eat up jobs
Prefer this session as a video?

The watch the YouTube video R-Julia-MATLAB course, advanced day: Thread parallelism

Why thread parallelism is important

Because it is one way to speedup (pun intended) the calculation.

Goal

In this session, we are going to benchmark thread parallelism, as we should not make claims about performance without measurements [CppCore Per.6].

flowchart TD user[User] benchmark_script[Benchmark script] slurm_script[Slurm script] r_script[R script] julia_script[Julia script] matlab_script[MATLAB script] user --> |Account, language| benchmark_script benchmark_script --> |Account, language, number of cores| slurm_script slurm_script --> julia_script slurm_script --> matlab_script slurm_script --> r_script

Benchmark script

benchmark_2d_integration.sh is the script that starts a benchmark, by submitting multiple jobs to the Slurm queue, using the Slurm script below.

The goal of the benchmark script is to do a fixed unit of work with increasingly more cores.

As the script itself only does light calculations, you can run it directly. Here is how to call the script:

bash benchmark_2d_integration.sh [account] [language]
Why not call the script with ./benchmark_2d_integration.sh?

Because that would require one extra step: to make the script executable.

For example:

bash benchmark_2d_integration.sh staff r

If you use the incorrect spelling, the script will help you.

Slurm script

This is the script that schedules a job with thread parallelism.

The goal of the script is to submit a calculation that uses thread parallelism, with a custom amount of cores.

This Slurm script is called by the benchmark script, i.e. not directly by a user. If the Slurm script is absent, the benchmark script will (try to) download it for you.

How do I run it anyways?

You do not, instead you will run the benchmark script below.

However, you can run it as such:

sbatch -A [account] -n [number_of_cores] do_[language]_2d_integration.sh

For example:

sbatch -A staff -n 1 do_r_2d_integration.sh

# On Dardel
sbatch -A staff -n 1 -p main do_r_2d_integration.sh

There are 3 Slurm scripts, 1 per language:

Language Script with calculation
Julia do_julia_2d_integration.sh
MATLAB do_matlab_2d_integration.sh
R do_r_2d_integration.sh

Each of these Slurm scripts are called by the benchmark script, where the benchmark script supplies the desired number of cores.

Language script

This is the code (in your favorite language) that performs a job with thread parallelism.

The goal of the language script is to have a fixed unit of work that can be done by a custom amount of cores.

This language script is called by the Slurm script, i.e. not directly by a user. If the language script is absent, the benchmark script will (try to) download it for you.

How do I run it anyways?

Check the Slurm script for your favorite language.

In general, you can run it as such:

[interpreter] [script_name] [number_of_cores] [grid_size]

On a login node, use 1 core and a grid size of 1 to start the lightest calculation possible:

julia integration2d.jl 1 1
Rscript integration2d.R 1 1
Language Script with calculation
Julia do_2d_integration.jl
MATLAB do_2d_integration.m
R do_2d_integration.R

Exercises

Exercise 1: start the benchmark on your HPC cluster

The goal of this exercise is to start the benchmark script on your HPC cluster, as well as some troubleshooting.

On your HPC cluster:

  • Download the benchmark script
How to do that?

There are many ways to do so.

One way is to download it directly from this course’s repository:

wget https://raw.githubusercontent.com/UPPMAX/R-matlab-julia-HPC/refs/heads/main/docs/advanced/thread_parallelism/benchmark_2d_integration.sh
How to do that?

The ‘Benchmark script’ section shows how:

bash benchmark_2d_integration.sh staff r

You can use our projects overview page to find the course NAISS project for your HPC cluster.

  • Check the Slurm output files for problems. If there are problems: fix these, then run the benchmark script again
How to do that?

There are many ways to do so.

One way is to show all files with the .out extension:

cat *.out

Exercise 2: read the benchmark script

Now that the benchmark script is running, we have the time to figure out what it is doing.

  • What is the most important single line in this script, i.e. the line it is all about? Tip: start looking from the bottom of the script
Answer

For all HPC clusters except Dardel:

sbatch -A "${slurm_job_account}" -N "${n_nodes}" -n "${n_cores}" "${slurm_script_name}"

For the Dardel HPC cluster:

sbatch -A "${slurm_job_account}" -N "${n_nodes}" --ntasks "${n_cores}" -p shared "${slurm_script_name}"
  • In English, describe what the line does in general terms
Answer

Schedule to run …

  • on some account
  • with some amount of nodes
  • with some amount of cores
  • (on Dardel) on the main partition
  • a script with some name
  • This line of code is part of a for loop. In English, what does the for loop achieve?
Answer

The for loop achieves that the same calculation is scheduled to be done with 1 core, then with 2 cores, then with 3 cores, etc., to 64 cores.

Exercise 3: read the Slurm script

The benchmark script submits a Slurm script of your favorite language multiple times to the queue: once with 1 cores, once with 2 cores, etc.

  • What is the most important single line in this script, i.e. the line it is all about? Tip: start looking from the bottom of the script
Answer

The last line.

Language Most important line
Julia julia --threads "${SLURM_NPROCS}" do_2d_integration.jl "${SLURM_NPROCS}"
MATLAB matlab -nodisplay -nosplash -nojvm -batch "run(\"${matlab_target_filename}\"); exit;"
R Rscript --no-save --no-restore do_2d_integration.R "${SLURM_NPROCS}"
  • In English, describe what the line does in general terms. Tip: this is the same answer for all programming languages. Tip 2: assume ‘procedure’ is synonym for ‘core’. Tip 3: the Julia line is closest to English.
Answer

Run a Julia/MATLAB/R script for the booked number of cores, without doing anything else (e.g. showing a splash screen or restoring a computational environment).

Exercise 4: read the language script

The Slurm script runs a script of your favorite language for a specified number of cores.

  • Locate the lines of code that make the calculation perform in parallel.
Answer
Threads.@threads for worker_index = 1:n_workers
   results[worker_index] = integration2d(grid_size, n_workers, worker_index)
end
parfor worker_index = 1:n_workers
    partial_results(worker_index) = integration2d(grid_size, n_workers, worker_index);
end
results_of_workers <- foreach(worker_index = 1:n_workers, .combine = c) %dopar% {
  integration2d(grid_size, n_workers, worker_index)
}
  • In English, describe what these lines does in general terms.
Answer

For each available worker: per worker, do part of a calculation and combine the results

  • Optional: what is grid_size? What does it do? What would be a better variable name?
Answer

grid_size determines the accuracy of the calculation: the bigger grid_size, the smaller intervals will be integrated.

A better variable name could be accuracy. However, with such a variable name, there is no natural understanding that the range of its value goes from 1 to infinity. For the name grid_size, this range is easier to feel right, as sizes are non-zero positive values by nature

  • Locate the keyword that make the calculation perform in parallel. Or: locate the word that, when removed, would ‘downgrade’ the calculation to be single-threaded.
Answer

The last line.

Language Keyword to indicate a parallel calculation
Julia Threads.@threads
MATLAB parfor
R %dopar%
  • The function that is run in parallel (i.e. integration2d) is made suitable to be run in parallel. In English, describe which changes are made to make it suitable.
Answer

The function has three (instead of one) arguments:

Language Function signature
Julia function integration2d(grid_size::Int, n_workers::Int, worker_index::Int)
MATLAB function integration2d(grid_size, n_workers, worker_index)
R integration2d <- function(grid_size, n_workers, worker_index)

The extra arguments are n_workers and worker_index.

These allow the worker_index-th worker to know which share of the calculation it must do.

  • What do these changes achieve?
Answer

These changes allow each thread to know what it needs to know to run its part of the calculation.

  • Bonus: do you spot the bug in integrate2d? Is this a problem? Why would someone keep it in anyways?
Answer

The calculation of begin_index and end_index:

workload = fld(grid_size, n_workers)
begin_index = workload * (worker_index - 1) + 1
end_index = workload * worker_index
grid_cells_per_worker = floor(grid_size / n_workers);
begin_index = grid_cells_per_worker * (worker_index - 1) + 1;
end_index = grid_cells_per_worker * worker_index;
grid_cells_per_worker <- floor(grid_size / n_workers)
begin_index <- grid_cells_per_worker * (worker_index - 1) + 1
end_index <- grid_cells_per_worker * worker_index

To most clearly demonstrate this, imagine a grid_size of 3 for an n_workers of 2:

Variable name Value
grid_size 3
n_workers 2
grid_cells_per_worker 3 / 2 (rounded down) = 1
begin_index for first worker 1 * (1 - 1) + 1 = 1
end_index for first worker 1 * 1 = 1
begin_index for second worker 1 * (2 - 1) + 1 = 2
end_index for second worker 1 * 2 = 2

This means that, although the calculation is split in 3 parts, only 2 of these are performed.

For bigger grid sizes, however, this problem gets less.

One would keep such a bug in for readability: this session is about thread parallelism, not about the extensive calculation of these indices.

Exercise 5: share the results

By now, some of the calculations in exercise 1 will be finished. If not: no worries, just continue!

  • In the terminal on your favorite HPC cluster, run the following command to collect all results:
grep -EoRh "^[jmlr].*,.*" --include=*.out | sort | uniq

Exercise 6: analyse the results

Take it easy

Doing this analysis yourself is useful, if you are fluent in analysing data.

If not, you are encouraged to use:

  • Copy-paste the results of the previous exercise into a comma-separated file called my_results.csv.

These are the descriptions of the variables:

Parameter Value
language Your programming language
hpc_cluster Your HPC cluster
grid_size Accuracy
n_workers Number of cores used
core_secs Core seconds used, i.e. the time used by all cores together
  • Load the comma-separated file (my_results.csv) in a spreadsheet or read it in your favorite programming language
  • Add a column called wall_clock_sec, which equals core_secs divided by n_workers. wall_clock_sec is the time it took the calculation to complete
  • Add a column called speedup, which equals the wall clock time for 1 core divided by the wall clock time of that amount of cores
  • Plot the speedup (on the y axis) per number of workers (on the x axis).
  • Compare your speedup with the Amdahl’s Law figure of the previous session, above ‘Exercises’). What do you estimate is the maximum speedup?
  • What do you estimate is the percentage of code that can be parallelized (i.e. the ‘parallel portion’ in the figure of Amdahl’s Law)?

Exercise 7: compare to others

Compare your results to others, that ran the same benchmark, yet for other HPC clusters and other programming languages:

Benchmark results: core seconds

Benchmark: the total core seconds per number of workers

Benchmark results: efficiency

Benchmark: Efficiency per number of workers

Benchmark results: speedup

Benchmark: Speedup per number of workers

  • What do you notice?
Answer

This question is vague on purpose: there are many things going on here and many answers are correct:

  • Measurements are messier than you may have thought
  • Julia is 30x faster than R
  • The MATLAB code is single-threaded
  • R works best at being efficient in parallel
  • HPC clusters differ
  • The points on Kebnekaise differ more than other HPC clusters: this is because Kebnekaise is the most heterogeneous (i.e. has the highest number of different hardware)

Exercise X1: job scheduling problem?

You have just submitted some multithreaded jobs to the queue. What went wrong here? Why is this a problem?

[richel@pelle1 thread_parallelism]$ squeue --me
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
             54197     pelle do_r_2d_   richel  R       0:14      1 p66
             54200     pelle do_r_2d_   richel  R       0:14      4 p[64-67]
             54216     pelle do_r_2d_   richel  R       0:14      3 p[104-106]
             54217     pelle do_r_2d_   richel  R       0:14      6 p[106-111]
             54169     pelle do_r_2d_   richel  R       0:15      1 p70
Answer

The multithreaded jobs are split over multiple nodes. This means that the cores booked do not have shared memory. This will make the calculation slower, as the results have to be shared by sending data between nodes.

Exercise X2: learn a faster programming language?

As can be seen in the benchmark, some programming languages are faster than others. This warrants the question: should you learn to program in the programming language that does calculations fastest?

The theoretically fastest programming languages allow you to write machine code. Assembler lets you do so directly. Some other languages (most notably C, C++ and Rust) allow you to insert machine code. Hence, these are the theoretically fastest languages.

To write fast code, should one learn those languages instead?

Below is a figure from [Prechelt, 2000]. It shows the distribution of runtime speeds of a certain problem (called z1000), for different programming languages.

Figure 2, from Prechelt, 2000

Take a close look at the figure. The paper has an advice to yes/no learn a ‘faster’ programming language. What do you think the advice is?

Answer

The variance within a programming language is bigger than variance between languages (adapted fig 2, from [Prechelt, 2000]).

Instead of learning a faster language, learn how to be fast in your language.

  • Still, the Julia code is 30x faster than the R code. Why would this advice still hold?
Answer

Because none of the code is optimized for speed.

Both Julia and R can call C code, where C is the fastest higher-level language.

It would be interesting to see how this benchmark would look like for optimized code.

  • Are there other factors that decide which programming language to use? If yes, name some.
Answer

There are many reasons why to use a ‘slower’ programming language:

  • you already know the ‘slower’ programming language
  • you need access to specific libraries/packages
  • you have colleagues that are willing to teach you
  • you need to to work from code that has been written someone else

Where to go next?

If you want to scale up, distributed parallelism allows you to do a calculation on many computers.

Troubleshooting

T1. Invalid account or account/partition combination specified

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

You’ve specified the wrong account.

Run:

projinfo

T2. There is no package called ‘doParallel’

This is an R error.

You can find it by checking the log files:

cat *.out

When you see, for example, the text below, it is clearly stated that there is no package called doParallel.

HPC cluster: tetralith
Slurm job account used: naiss2025-22-934
Number of cores booked in Slurm: 32
Error in library(doParallel, quietly = TRUE) : 
  there is no package called ‘doParallel’
Execution halted

To fix this:

  • load the correct module
  • install that package from the terminal.

To load the correct module, load the R module(s) as loaded by the do_r_2d_integration.sh script, for example:

module load R/4.4.0-hpc1-gcc-11.3.0-bare
Could you expand on that?

Open the do_r_2d_integration.sh script.

Search for the part where modules are loaded, which is at the bottom.

Find the lines where the modules are loaded for your favorite HPC cluster, e.g.

if [ ${hpc_cluster} == "rackham" ]
then
  module load R_packages/4.1.1 >/dev/null 2>&1
fi

Copy the part that loads the modules, excluding the > and after, and run these in a terminal on your favorite HPC cluster:

module load R_packages/4.1.1

You have now loaded the packages needed for the calculation.

To install that package from the terminal, check this course’s material on how to do so.

T3. ‘namespace ‘rlang’ 0.4.12 is already loaded, but >= 1.1.0 is required’

Error in loadNamespace(i, c(lib.loc, .libPaths()), versionCheck = vI[[i]]) : 
  namespace ‘rlang’ 0.4.12 is already loaded, but >= 1.1.0 is required
Calls: <Anonymous> ... waldo_compare -> loadNamespace -> namespaceImport -> loadNamespace
Execution halted

This only happens on Rackham, since 2025-09-25.

MATLAB error

Warning: Executing startup failed in matlabrc.
This indicates a potentially serious problem in your MATLAB setup, which should
be resolved as soon as possible.  Error detected was:
MATLAB:undefinedVarOrClass
Unable to resolve the name 'java.net.InetAddress.getLocalHost.getHostAddress'. 
Error using run
RUN cannot execute the file 'do_2d_integration.m 48'. RUN requires a valid
MATLAB script

References