Parallel and multithreaded functions

Questions

  • What is parallel programming?

  • Why do we need it?

  • When can I use it?

Objectives

  • Learn basic concepts in parallel programming

  • Gain knowledge on the tools for parallel programming in different languages

  • Familiarize with the tools to monitor the usage of resources

What is parallel programming?

Parallel programming is the science and art of writing code that execute tasks on different computing units (cores) simultaneously. In the past computers were shiped with a single core per Central Processing Unit (CPU) and therefore only a single computation at the time (serial program) could be executed.

Nowadays computer architectures are more complex than the single core CPU mentioned already. For instance, common architectures include those where several cores in a CPU share a common memory space and also those where CPUs are connected through some network interconnect.

../_images/shared-distributed-mem.svg

Shared Memory and Distributed Memory architectures.

A more realistic picture of a computer architecture can be seen in the following picture where we have 14 cores that shared a common memory of 64 GB. These cores form the socket and the two sockets shown in this picture constitute a node.

../_images/cpus.png

1 standard node on Kebnekaise @HPC2N

It is interesting to notice that there are different types of memory available for the cores, ranging from the L1 cache to the node’s memory for a single node. In the former, the bandwidth can be TB/s while in the latter GB/s.

Now you can see that on a single node you already have several computing units (cores) and also a hierarchy of memory resources which is denoted as Non Uniform Memory Access (NUMA).

Besides the standard CPUs, nowadays one finds Graphic Processing Units (GPUs) architectures in HPC clusters.

Why is parallel programming needed?

There is no “free lunch” when trying to use features (computing/memory resources) in modern architectures. If you want your code to be aware of those features, you will need to either add them explicitly (by coding them yourself) or implicitly (by using libraries that were coded by others).

In your local machine, you may have some number of cores available and some memory attached to them which can be exploited by using a parallel program. There can be some limited resources for running your data-production simulations as you may use your local machine for other purposes such as writing a manuscript, making a presentation, etc. One alternative to your local machine can be a High Performance Computing (HPC) cluster another could be a cloud service. A common layout for the resources in an HPC cluster is a shown in the figure below.

../_images/workflow-hpc.png

High Performance Computing (HPC) cluster.

Although a serial application can run in such a cluster, it would not gain much of the HPC resources. If fact, one can underuse the cluster if one allocates more resources than what the simulation requires.

../_images/laundry-machines.svg

Under-using a cluster.

Warning

  • Check if the resources that you allocated are being used properly.

  • Monitor the usage of hardware resources with tools offered at your HPC center, for instance job-usage at HPC2N.

  • Here there are some examples (of many) of what you will need to pay attention when porting a parallel code from your laptop (or another HPC center) to our clusters:

We have a tool to monitor the usage of resources called: job-usage at HPC2N.

Common parallel programming paradigms

Now the question is how to take advantage of modern architectures which consist of many-cores, interconnected through networks, and that have different types of memory available? Python, Julia, Matlab, and R languages have different tools and libraries that can help you to get more from your local machine or HPC cluster resources.

Threaded programming

To take advantage of the shared memory of the cores, threaded mechanisms can be used. Low-level programming languages, such as Fortran/C/C++, use OpenMP as the standard application programming interface (API) to parallelize programs by using a threaded mechanism. Here, all threads have access to the same data and can do computations simultaneously. From this we infer that without doing any modification to our code we can get the benefits from parallel computing by turning-on/off external libraries, by setting environment variables such as OMP_NUM_THREADS.

Higher-level languages have their own mechanisms to generate threads and this can be confusing especially if the code is using external libraries, linear algebra for instance (LAPACK, BLAS, …). These libraries have their own threads (OpenMP for example) and the code you are writing (R, Julia, Python, or Matlab) can also have some internal threded mechanism.

Warning

  • Check if the libraries/packages that you are using have a threaded mechanism.

  • Monitor the usage of hardware resources with tools offered at your HPC center, for instance job-usage at HPC2N.

  • Here there are some examples (of many) of what you will need to pay attention when porting a parallel code from your laptop (or another HPC center) to our clusters:

For some linear algebra operations Numpy supports threads (set with the OMP_NUM_THREADS variable). If your code contains calls to these operations in a loop that is already parallelized by n processes, and you allocate n cores for this job, this job will exceed the allocated resources unless the number of threads is explicitly set to 1.

A common issue with shared memory programming is data racing which happens when different threads write on the same memory address.

Distributed programming

Although threaded programming is convenient because one can achieve considerable initial speedups with little code modifications, this approach does not scale for more than hundreds of cores. Scalability can be achieved with distributed programming. Here, there is not a common shared memory but the individual processes (notice the different terminology with threads in shared memory) have their own memory space. Then, if a process requires data from or should transfer data to another process, it can do that by using send and receive to transfer messages. A standard API for distributed computing is the Message Passing Interface (MPI). In general, MPI requires refactoring of your code.

Big data

Sometimes the workflow you are targeting doesn’t require extensive computations but mainly dealing with big pieces of data. An example can be, reading a column-structured file and doing some transformation per-column. Fortunately, all languages covered in this course have already several tools to deal with big data. We list some of these tools in what follows but notice that other tools doing similar jobs can be available for each language.



Exercises