Multi-Threaded Programs

Last updated on 2024-09-04 | Edit this page

Overview

Questions

  • What is the multi-threaded shared-memory programming model?

Objectives

  • Learn about multi-threaded computing and how to use it.
  • Understand the strengths and limitations of this approach.

Most computer languages have the ability to do multi-threaded computing. C/C++ and Fortran use the OpenMP package which is by far the most extensive and well developed. It uses pragma statements to control the parallelization of loops so that multiple compute cores work at the same time on different parts of the data. OpenMP is not only extremely efficient, but it also provides very advanced features offering greater control on how the parallelization is to be done, all without encumbering the programmer too much. The pymp package for Python is a stripped down version of OpenMP supporting just the basic functionality. It is an actively developed project and is very efficient and relatively easy to use as well. R takes a very different approach in doing multi-threading using the mclapply() function which is a multi-core replacement for the lapply() function. This operates similarly to OpenMP and pymp but uses a very different syntax. It is also not nearly as efficient and requires some non-default choices to make it more perform better. Matlab also uses a different syntax in its Parallel Computing Toolbox where it uses a parfor command to do a parallel for loop.

All of these methods behave basically the same by forking, or splitting off, extra compute threads when called. Each thread gets its own virtual memory space, meaning most large arrays are not copied during the initialization stage. If any thread writes to an array, only then is that array copied to that thread’s memory, and only the page of memory (4096 Bytes) that has been changed. This is called a copy-on-write method and is handled by the operating system. Forking is very efficient in this way, only doing the work it needs. For Python, this gets around the Global Interface Lock which is designed to protect python objects. Unfortunately the Windows operating system itself does not have support for the fork function so you cannot run multi-threaded Python codes like pymp on Windows, at least from the Power Shell. However, if you have the Windows Subsystem for Linux (WSL) installed this provides a robust Linux system that bypasses Windows and its limitations so pymp codes can be run in this manner.

The figure below illustrates how multi-threading works on a dot product between two vectors. Since the program uses shared-memory, the initialization is done entirely on the main thread of processor 1. Then each of 4 threads in this example does a partial sum on the vector elements it is responsible for, so all 4 threads are running at the same time but operating on different parts of the data. After they have all completed their parts of the computation, the master thread sums all the partial sums into the dot product and prints it out.

Shared-memory multi-threaded dot product showing the memory layout of both vectors
Diagram of a shared-memory multi-threaded dot product

The multi-threaded dot product code

So parallelizing this program really only requires us to change around 11 lines of code, and from that we get the benefit of being able to apply much greater computing power. In Python for example we do have some control over how the parallelization works internally. Using p.range(N) in our for loop will use static scheduling where each thread is responsible for a pre-determined set of indices at regular intervals as in the figure above. If instead we use p.xrange(N) then dynamic scheduling will be used where each index will be assigned to the next available thread. This can be very useful if the amount of work in each pass through the loop varies greatly. Dynamic scheduling can produce much more efficient results in cases where there is a great load imbalance.

Understanding what can cause inefficient scaling

A scaling study is designed to expose inefficiencies in a parallel code and to determine how many cores to use for a given problem size. That last part is important to understand. If there is too little work in each iteration of a loop, then loop overhead can limit scaling. Calculations on larger data sets usually scale better.

A loop may be very scalable in itself, but if there is too much time spent in the scalar part of the code like initialization, doing the reductions, or doing input at the beginning and output at the end, then the entire code may not scale well. Load imbalance can also be a problem. If the time it takes to pass through a loop varies, then using dynamic scheduling is very important.

Shared arrays are an extremely important part of multi-threaded packages. Since they do involve the copy-on-write mechanism, they can lead to inefficiency in the loop. In general this is minimal but something to be aware of.

Multi-threading packages like OpenMP and pymp provide mechanisms that force loops in the algorithm out of multi-threaded operation and back into single-threaded operation. This always leads to terrible scaling and should almost never be used.

Scaling Study of the Multi-Threaded Dot Product Code

Key Points

  • Multi-threaded computing is powerful and fairly easy to use but only works on one compute node.
  • Understand key factors that can limit the efficient scaling of multi-threaded programs.