Parallel and multithreaded functions

Questions

  • What is parallel programming?

  • Why do we need it?

  • Where I can use it?

Objectives

  • Short introduction to parallel programming

  • Common paradigms to write a parallel code

What is parallel programming?

Parallel programming is the art of writing code that execute tasks on different computing units (cores) simultaneously. In the past computers were shiped with a single core per Central Processing Unit (CPU) and therefore it could only perform a single computation at the time (serial program).

Nowadays computer architectures are more complex than the single core CPU mentioned already. For instance, common architectures include those where several cores in a CPU share a common memory space and also those where CPUs are connected through some network interconnect.

../_images/shared-distributed-mem.svg

Shared Memory and Distributed Memory architectures.

A more realistic picture of a computer architecture can be seen in the following picture where we have 14 cores that shared a common memory of 64 GB. These cores form the socket and the two sockets shown in this picture constitute a node.

../_images/cpus.png

1 standard node on Kebnekaise @HPC2N

It is interesting to notice that there are different types of memory that are available for the cores, ranging from the L1 cache to the node’s memory for a single node. In the former, the bandwidth can be TB/s while in the latter GB/s.

Now you can see that on a single node you already have several computing units (cores) and also a hierarchy of memory resources which is denoted as Non Uniform Memory Access (NUMA).

Besides the standard CPUs, nowadays one finds Graphic Processing Units (GPUs) architectures in HPC clusters, a K80 engine looks like this:

../_images/gpu.png

A single GPU engine of a K80 card. Each green dot represents a core (single precision) which runs at a frequency of 562 MHz. The cores are arranged in slots called streaming multiprocessors (SMX) in the figure. Cores in the same SMX share some local and fast cache memory.

In a typical cluster, some GPUs are attached to a single node resulting in a CPU-GPU hybrid architecture. The CPU component is called the host and the GPU part the device. One possible layout (Kebnekaise) is as follows:

../_images/cpu-gpu.png

Schematics of a hybrid CPU-GPU architecture. A GPU K80 card consisting of two engines is attached to a NUMA island which in turn contains 14 cores. The NUMA island and the GPUs are connected through a PCI-E interconnect which makes the data transfer between both components rather slow.

Why is parallel programming needed?

There is no “free lunch” when trying to use features (computing/memory resources) in modern architectures. If you want your code to be aware of those features, you will need to either add them explicitly (by coding them yourself) or implicitly (by using libraries that were coded by others).

In your local machine, you may have some number of cores available and some memory attached to them which can be exploited by using a parallel program. There can be some limited resources for running your data-production simulations as you may use your local machine for other purposes such as writing a manuscript, making a presentation, etc. One alternative to your local machine can be a High Performance Computing (HPC) cluster another could be a cloud service. A common layout for the resources in an HPC cluster is a shown in the figure below.

../_images/workflow-hpc.svg

High Performance Computing (HPC) cluster.

Although a serial application can run in such a cluster, it would not gain much of the HPC resources. The situation would be similar to turn on many washing machines to wash a single item, we can waste energy easily.

../_images/laundry-machines.svg
../_images/laundry-machines.svg

Under-using a cluster.

Common parallel programming paradigms

Now the question is how to take advantage of modern architectures which consist of many-cores, interconnected through networks, and that have different types of memory available? Python, Julia, Matlab, and R languages have different tools and libraries that can help you to get more from your local machine or HPC cluster resources.

Threaded programming

To take advantage of the shared memory of the cores, threaded mechanisms can be used. Low-level programming languages, such as Fortran/C/C++, use OpenMP as the standard application programming interface (API) to parallelize programs by using a threaded mechanism. Here, all threads have access to the same data and can do computations simultaneously. From this we infer that without doing any modification to our code we can get the benefits from parallel computing by turning-on/off external libraries, by setting environment variables such as OMP_NUM_THREADS.

Higher-level languages have their own mechanisms to generate threads and this can be confusing especially if the code is using external libraries, linear algebra for instance (LAPACK, BLAS, …). These libraries have their own threads (OpenMP for example) and the code you are writing (R, Julia, Python, or Matlab) can also have some internal threded mechanism.

Warning

  • Check if the libraries/packages that you are using have a threaded mechanism.

  • Monitor the usage of hardware resources with tools offered at your HPC center, for instance job-usage at HPC2N.

  • Here there are some examples (of many) of what you will need to pay attention when porting a parallel code from your laptop (or another HPC center) to our clusters:

For some linear algebra operations Numpy supports threads (set with the OMP_NUM_THREADS variable). If your code contains calls to these operations in a loop that is already parallelized by n processes, and you allocate n cores for this job, this job will exceed the allocated resources unless the number of threads is explicitly set to 1.

A common issue with shared memory programming is data racing which happens when different threads write on the same memory address.

GPUs

Graphical processing unit (GPU) programming has similar patterns to shared memory programming but there are major differences, for instance in the former one works with highly optimized pieces of code that can run on thousand of cores (kernels). Also the APIs are different, with CUDA (NVIDIA) and ROCM (AMD) being two of the most common ones in GPU programming.

Keep in mind

  • NVIDIA GPUs can be found at: HPC2N, UPPMAX, LUNARC, NSC, and C3SE.

  • AMD GPUs can be found at: HPC2N and PDC.

Distributed programming

Although threaded programming is convenient because one can achieve considerable initial speedups with little code modifications, this approach does not scale for more than hundreds of cores. Scalability can be achieved with distributed programming. Here, there is not a common shared memory but the individual processes (notice the different terminology with threads in shared memory) have their own memory space. Then, if a process requires data from or should transfer data to another process, it can do that by using send and receive to transfer messages. A standard API for distributed computing is the Message Passing Interface (MPI). In general, MPI requires refactoring of your code.

Big data

Sometimes the workflow you are targeting doesn’t require extensive computations but mainly dealing with big pieces of data. An example can be, reading a column-structured file and doing some transformation per-column. Fortunately, all languages covered in this course have already several tools to deal with big data. We list some of these tools in what follows but notice that other tools doing similar jobs can be available for each language.



Exercises