Parallel computation¶

Learning outcomes

I can name and describe 3 types of parallel computation
I can explain at least 1 advantage of parallel computation
I can explain at least 2 disadvantages of parallel computation
I can explain how to use my computational resources effectively

For teachers

Teaching goals are:

Learners have scheduled and run a job that needs more cores, with a calculation in their favorite language
Learners understand when it is possible/impossible and/or useful/useless to run a job with multiple cores

Lesson plan:

10 mins: Prior
5 mins: Present
15 mins: Challenge
15 mins: Feedback

Prior:

In an HPC context, what is a core?
In an HPC context, what is a node?
In an HPC context, what are computational resources?
What is parallel computing?
How does parallel computing work?
When to use parallel computing?

Feedback:

When to use parallel computing?
When not to use parallel computing?
The main section is called ‘The ideal effectiveness of parallelism’. What does ‘ideal’ mean in this context? What could make parallelism less ideal?

Prefer this session as a video?

The watch the YouTube video R-Julia-MATLAB course, advanced day: Parallel computation

CoRA, the robotic platform for Arnold

Figure 1: CoRA, a robotic platform in which all computers (e.g. the one connected to the camera, another connected to the gripper, etc) sent messages to one receiving computer: an example of distributed parallelism

Why parallel computing is important¶

Most HPC clusters use 7-10 days as a maximum duration for a job. Your calculation may take longer than that. One technique that may work is to use parallel computing, where one uses multiple CPU cores to work together on a same calculation

HPC cluster architecture¶

Here is a simplified picture of HPC cluster architecture:

flowchart TD subgraph hpc_cluster[HPC cluster] subgraph node_1[Node] subgraph cpu_1_1[CPU] core_1_1_1[Core] end end end

Term	What it loosely is	Amount
Core	Something that does a calculation	One or more per CPU
CPU	A collection of cores that share the same memory	One or more per node
Node	A collection of CPUs that share the same memory	One or more per HPC cluster
HPC cluster	A collection of nodes	One or more per universe
The universe `[1]`	A collection of HPC clusters	One

[1] this is a pun to distributed parallelism

Simplifications used

1 core is 1 worker is 1 thread is 1 procedure is 1 task

Why do we use this simplification?

Because we start from the assumption that we are reasonably smart, i.e. we will run as much threads as we have cores.

There is an exercise where this assumption is broken.

Types of ‘doing more things at the same time’¶

There are many types of ‘doing more things at the same time’. One way to distinguish these, is to separate these on the extent of the parallelism:

Extent	Parallelism
Core	Single-threaded (you already do this)
Node	Thread parallelism (this session)
HPC cluster	Distributed parallelism (next session)
The universe	Distributed parallelism

Today, we will extend your toolkit from a single-threaded calculation (you already do this) to thread parallelism.

The ideal effectiveness of parallelism¶

Before going into details, we will look at the effectiveness of parallelism in the most optimal case, with the goal that you can determine if it is worth it.

By now, you can probably guess that parallel computing spreads a calculation over multiple things that can calculate.

Imagine a calculation that takes 16 time units, represented as this:

1 core

Figure 2: a calculation of 16 time units run on 1 core, following the legend below:

Square	A unit of calculation time that …
	cannot be run in parallel
	can be run in parallel
	is spent doing nothing

This calculation time is expressed in a time unit such as seconds and hence, can be split up into smaller blocks.

Using 2 calculation units, this results in:

2 cores

Figure 2: a calculation of 16 time units run on 2 cores, where square is a time unit of calculation.

This takes the calculation down from 16 to 10 time units. The so-called ‘speedup’ of using two workers is 16 / 10 = 1.6.

How did you calculate the speedup exactly?

Following [Hennessy and Patterson, 2011] (section 1.8), the speedup n (as in the sentence ‘X is n times faster than Y’) equals:

n = t_y / t_x

where:

t_x is the time the new/enhanced process takes
t_y is the time the regular/unenhanced process takes

In this context, the ‘new/enhanced process’ is the calculation performed by multiple cores.

Do not be confused by another version of Amdahl’s Law, that is expressed and the calculation units used (and where the numerator and denominator are swapped):

n =  performance_x / performance_y

where:

performance_x is the performance (e.g. tasks completed per time unit) the new/enhanced process does
performance_y is the performance of the regular process

Isn’t that Gustafson’s Law?

Not directly.

We do use the same term ‘speedup’ as is calculated in Gustafson’s Law, yet we apply it to compare between a single-core and a multi-core process.

Gustafson’s Law predict the maximum speedup, which is

S = s + (p * N) = N - ((N - 1) * s) = 1 + ((N - 1) * p)

S is the speedup
s is fraction of the calculation that cannot be parallelized. The ‘s’ stands ‘serial’
p is fraction of the calculation that can be parallelized
N is the number of workers, in our case: cores

However, 4 (out of 20) calculations units are spent waiting. This means that 16 / 20 = 80% of the calculation time is spent efficiently.

How did you calculate the efficiency exactly?

The efficiency, f, equals:

f = t_used_effectively / t_total`

where:

t_used_effectively is the time spend on a calculation, summed up over all cores
t_total is the total time all spent by all cores

These two can be calculated as such:

t_used_effectively = (p + s) + (p * (N - 1))
t_total = time * N

where:

s is fraction of the calculation that cannot be parallelized. The ‘s’ stands ‘serial’
p is fraction of the calculation that can be parallelized
N is the number of workers, in our case: cores

Here one can see this calculation for more cores:

Program runtime	Number of cores	Time	Speedup	Efficiency
	1	16	16 / 16 = 100%	(10 + (0 * 12)) / 16 = 100%
	2	10	16 / 10 = 160%	(10 + (1 6)) / (10 2) = 80%
	3	8	16 / 8 = 200%	(10 + (2 4)) / (10 3) = 60%
	4	7	16 / 7 = 229%	(10 + (3 3)) / (10 4) = 48%
	6	6	16 / 6 = 267%	(10 + (5 2)) / (10 6) = 33%
.	12	5	16 / 5 = 320%	(10 + (11 1)) / (10 12) = 18%
.	24	4.5	16 / 4.5 = 356%	(10 + (23 0.5)) / (10 24) = 9%
.	48	4.25	16 / 4.25 = 376%	(10 + (47 0.25)) / (10 48) = 5%

The best possible speed gain (as shown here) is called Amdahl’s Law (inspired by [Amdahl, 1967]) and, in a general form, is plotted like this:

Amdahl's law

Exercises¶

Question -1: change your Zoom name¶

Change your Zoom name, to include your HPC cluster and favorite programming language, e.g. make Sven into Sven [Alvis, R].

You can do so by right-click on the video of yourself, then click ‘Rename’ to change your Zoom name.

Question 0: remember your Zoom room¶

HPC cluster name	Main breakout room
Alvis	Room 1
Bianca	Room 2
COSMOS	Room 3
Dardel	Room 4
Kebnekaise	Room 5
LUMI	Room 6
Pelle	Room 7
Rackham	Room 8
Tetralith	Room 9

Question 1¶

Which of the lines in the graph of Amdahl’s Law corresponds with the worked-out example of 16 time units?

Answer

The red dotted line.

Using a different unit (i.e. ‘relative speed’, instead of ‘speedup’) was done on purpose. It is easy to convert between the two: just take the inverse (i.e. divide 1 by the value you have)

Question 2¶

In the example of 16 time units, what is the shortest amount of time that can be spent on the calculation, given infinite resources?

Answer

The length of the calculation that cannot be run in parallel, which is 4 time units.

Question 3¶

In the example of 16 time units, what is the maximal possible speedup?

Answer

400%, as the calculation takes 16 units of work on a single core, and 4 time units on an infinite amount of cores.

In mathematical form, the speedup, S, equals:

S = t_regular / t_enhanced
S = 16 / 4
S = 4
S = 400%

where:

t_enhanced is the time the enhanced process takes
t_regular is the time the regular/unenhanced process takes

Question 4¶

For your research project, you need to run a lot of calculations. Each calculation takes 10 hours. How do you make optimal use of your computational resources?

Answer

Run the calculation on a single core for 100% efficiency

Question 5¶

For your research project, you also have a calculation that takes 11 days. Your HPC cluster allows a calculation of at most 10 days. Assume your HPC center will not extend your job (they will probably do so when asked: we are there to help!). How do you make optimal use of your computational resources?

Answer

If your calculation already has parallelism built-in, then run the calculation on two cores: this only involves changing your Slurm script, with a low loss of computational resources.

If you are a really tight on computational resources, you can implement a ‘save state’ in your calculation, so that you can schedule two runs of nine days in succession, each with 100% efficiency.

Alternatively, you can added thread parallelism to allow running with multiple cores.

Question 6¶

Your colleague runs many jobs with a lot of cores. ‘It is way faster!’, he/she states. That same colleague, however, also complains about long waiting times before his/her jobs start. How would you explain this situation?

Answer

The colleague used up (or: ‘wasted’) all his/her computational resources (commonly 10,000 core hours per month on UPPMAX).

Due to this, his/her jobs are only run when the HPC cluster has a low workload and activates the so-called ‘bonus queue’ (UPPMAX) or generally have to wait for their priority to go up again.

Question 7¶

Your colleague has also done the exercises in this session and decided to measure the same features of her code. Below you can see the plots produced by this benchmark. What seemed to be the percentage of his/her code that could be run in parallel?

Total core secondsEfficiencySpeedup

Benchmark core seconds

Benchmark: the total core seconds per number of workers

Benchmark efficiency

Benchmark: Efficiency per number of workers

Benchmark speedup

Benchmark: Speedup per number of workers

Answer

It is unknown what portion of the code can be run in parallel.

Instead, this benchmark shows how it looks if code is run in single-thread mode. In this case, the colleague forgot that the number of threads used must be specified when calling Julia.

Note that these plots are less clear than you may have expected:

The total core seconds is expected to increase linearily
The efficiency is expected to follow the dashed line
The speedup is expected to be at 100%

It may be unexpected that also in computer science, results can be messier than you would expect :-)

References¶

[Amdahl, 1967] Amdahl, Gene M. “Validity of the single processor approach to achieving large scale computing capabilities.” Proceedings of the April 18-20, 1967, spring joint computer conference. 1967. Fun facts: it contains a single hand-drawn plot and no equations
[Hennessy and Patterson, 2011] Hennessy, John L., and David A. Patterson. Computer architecture: a quantitative approach. Elsevier, 2011.