Dissertation — Parallelization and Hardware

Chapter 3 Parallelization and Hardware

This chapter presents the basic concepts of parallelization and multiprocessor programming [102, 103]. The subsequently established terminology allows for a precise description and analysis of the developed algorithms. At the end of the chapter the hardware resources used for the benchmarks presented in the remainder of this work are listed.

3.1 General Parallelization Strategies

Parallelization is the transformation of an algorithm to be able to execute it in parallel. The goal of parallelization is the reduction of the run-time required for execution of an algorithm, because (some of) the instructions of the algorithm are executed simultaneously. To be able to parallelize an algorithm there are two prerequisites: 1) The instructions of the algorithm have to allow reordering (concurrency) and 2) there must be hardware available to perform instructions simultaneously (parallel execution).

Concurrency

Concurrency is the ability to execute instructions, e.g., of an algorithm, out-of-order, without affecting the final outcome [102]. If a problem does not allow for concurrency, the instructions are not parallelizable. To illustrate the difference two situations are described in detail.

Consider two arrays storing a set of arbitrary numbers. The goal is to add both arrays element wise together. In this case a serial algorithm may add the first number of each array, then the second number and so forth. For the result (the element wise sum of the array numbers) the order in which each element pair of the arrays is added does not matter. Thus the example allows for concurrency and it is possible to parallelize the algorithm and compute the sum of the individual element pairs simultaneously.

Now consider the task to find the end of a linked list given its head (first element). The only way to reach the end of a linked list is to follow the pointer to the next element of the current element until there is no next element, thus the end is reached. Because there is only a single way to reach the end of the linked list and the steps require a definite ordering the task is unparallelizable.

Parallel Execution

To actually reduce the run-time of the concurrent instructions (the goal of parallelization), the independent sets of instructions resulting from concurrency considerations have to be executed simultaneously. As mentioned previously, in this work only shared-memory compute systems are considered. The parallel execution on shared-memory compute systems is typically achieved by scheduling threads to different cores on a processor.

Figure 3.1 shows a schematic of a typical shared-memory parallel compute system. On the depicted shared-memory parallel compute system there are two sockets each equipped with a processor consisting of four physical cores each supporting two-way simultaneous multithreading (i.e., support for executing two threads per core): a total of eight logical cores is thus provided by each processor.

A processor has a hierarchy of caches to reduce the data access time from the main memory. The size of a cache is indirectly proportional to the access time.

On a shared-memory parallel compute system, the main memory storing all data of a program is split into memory domains, one for each socket. Each core has still access to the entire main memory of the compute system. However, access to data residing in another memory domain comes with the cost of higher access latencies and lower bandwidths as data has to be transferred via an additional coherent link connecting the processor sockets and the associated memory domains. Such systems are labeled non-uniform memory access (NUMA) systems and are widely spread in professional workstations and large-scale compute clusters/supercomputers.

Data exchange among threads being executed within the same process is inherently possible due to shared access of the main memory. This data exchange mechanism is considered low overhead, considering alternatives like pipes and sockets offered by the operating system. However, the reading and writing to the same memory location has to be carefully implemented, potentially requiring the need for dedicated mechanisms (lock, or atomic operations) to get exclusive access to a memory location. This avoids data races (non-synchronized accesses to the same memory location), which may lead to unexpected results of the computation (undefined behavior).

In this work, parallelism is classified into two types:

• Coarse-grained
• Fine-grained

Coarse-grained parallelism is characterized by a relatively large amount of work per thread. Synchronization and data exchanges between threads are typically costly, thus they have to be rare so that the overall created overhead is low. Coarse-grained parallelization suits algorithms where the run-time of threads is predictable. This enables the creation of threads which will have the same run-time, so that scenarios where a single long-running thread blocks the continuation of execution for other threads are mitigated.

Considering the previous example where two arrays are added element wise together, the array could for instance be split into as many chunks of data as cores are available. Data exchange has to be performed in the beginning, i.e., distributing the chunks of the array to the threads, and in the end, i.e., synchronizing the threads, to ensure all threads are finished. The so created chunks of data are coarse (thus the name), because the number of instructions performed by a thread is high compared to the number of data exchanges between threads. However, the splitting of data creates additional computations (overhead), because determining the chunk size also requires computations. If the number of computations per thread is low (which of course is problem-specific), the overhead diminishes all gains from parallel execution.

Within the context of hierarchical grids computations on a single block typically correspond to a suitable chunk of data for a thread (balancing overhead and parallel performance).

Fine-grained parallelism, on the contrary, is characterized by a relatively small amount of work per thread before synchronization between threads is required. Fine-grained parallelization suits algorithms where the run-time of threads is unpredictable, but synchronization costs are low.

Again, considering the previous example where the two arrays are added element wise together, the array is split into significantly more chunks of data than cores are available. Thus a core is going to execute more than a single thread over the duration of the program. If a core is assigned a new thread synchronization is involved, i.e., identifying the not processed threads. Thus the overall run-time is typically larger compared to the coarse-grained parallelization approach for the considered example adding two arrays element wise together. However, in case the run-time of the operation performed on each array element is unpredictable the fine-grained parallelization approach would be superior because run-times of different threads are balancing each other. In the end the cores will finish the execution of all threads almost at the same time (almost no idling).

Programming Model

The OpenMP application programming interface (API) specification [104] is widely used to develop software for shared-memory systems and is also used in this work OpenMP allows easy parallelization of existing programs, especially of for-loops, by using standard compiler directives, e.g., #pragma omp parallel for. By setting the number of threads the range of the for-loop is automatically divided into number of threads chunks which are executed in parallel.

To handle more complex parallelization scenarios where a straightforward parallelization of for-loops is not possible, OpenMP has the concept of OpenMP tasks. Conceptually, a task is the same as a thread, however, a task is managed by OpenMP, instead of the operating system. OpenMP internally utilizes a thread pool to enable fast and low overhead parallel execution of tasks. A thread pool has a fixed number of threads (managed by the operating system) which dynamically execute scheduled tasks (managed by OpenMP).

To understand the viable performance gains from parallelization the theoretical limits are explored in the next section.

Amdahl’s Law

The theoretical expected maximum speedup (run-time reduction factor) achievable through parallelization is given by Amdahl’s law. Most algorithms consist of parts which allow for concurrency and some parts which do not. Let the fraction (relative to the full algorithm) which allows for concurrency of an algorithm be denoted by $c$ and the number of used threads (equal to the available cores) be $t$ then the maximum parallel speedup $S$ (ignoring any introduced overhead by parallelization) is given by

$\begin{array}{r} (3.1) & S = \frac{1}{1 - c + \frac{c}{t}} . \end{array}$ In the limit of an infinite number of available threads the parallel speedup is limited by

$\begin{array}{r} (3.2) & S = \frac{1}{1 - c}, \end{array}$ which is indirectly proportional to the fraction which does not allow for concurrency. Considering an algorithm with $c = 0.9$ , i.e., 10 % of the algorithm is serial, the parallel speedup is limited by a factor of 10. Therefore, it is essential that even minor parts of an algorithm are parallelized (minimizing the serial part limiting the parallel speedup), if a highly parallel execution is targeted, i.e., multi-core processors.

In high performance computing the efficiency of parallel algorithms is evaluated by two common analyses quantifying the scalability. Strong scaling analysis is defined as how the run-time varies with the number of threads for a fixed total problem size. The parallel speedup is defined as the single-threaded run-time compared to the multi-threaded run-time. Ideally the run-time is reduced linearly if more threads are used. The run-time reduction typically saturates for a high number of threads due to Amdahl’s law.

Weak scaling analysis is defined as how the run-time varies with the number of threads for a fixed problem size per thread. For example, if the number of threads is doubled, the problem size is also doubled, but the run-time would be the same in case of optimal weak scaling. Amdahl’s law is not applicable to the weak scaling analysis, because the total problem size is not fixed, i.e., the problem size grows with the number of used threads.

As will be shown in later chapters, this work primarily conducts strong scaling analyses as they allow for intuitive interpretation regarding parallel speedup when considering process TCAD simulation workflows.

The next section introduces the hardware used for benchmarking

3.2 Benchmark Systems

This section gives an overview of all the compute systems used for evaluating the performance henceforth denoted as benchmark systems of the implementations of the proposed algorithms. The benchmark systems are single compute nodes from two generations of Vienna Scientific Cluster¹ (VSC) supercomputers and an industrial compute system. In Table 3.1 the key properties of the three available benchmark systems are summarized.

Table 3.1: Summary and key properties of the used benchmark systems.

	VSC3	VSC4	ICS
Frequency (GHz)	2.6	3.1	2.8
Sockets	2	2	2
Cores per CPU	8	24	10
Logical cores per CPU	16	48	20
L1i cache	32 KByte	32 KByte	32 KByte
L1d cache	32 KByte	32 KByte	32 KByte
L2 cache	256 KByte	1024 KByte	256 KByte
L3 cache	20 MByte	33 MByte	26 MByte
Main memory	64 GByte	96 GByte	226 GByte

¹ The VSC is a collaboration of several Austrian universities that provides supercomputer resources and corresponding services [105].