Revision as of 12:05, 30 March 2012

SHARED MEMORY PARADIGM

Achieving scalable performance for parallel applications on modern shared-memory multiprocessing platforms is not straightforward. Apart from issues inherent to each application, for example the ability to efficiently decompose it in parallel tasks, other factors such as the interaction with the underlying architecture may affect scalability significantly. Often, even applications with ample and easy to extract parallelism (e.g., large independent parallel portions, sparse synchronization, etc.) fail to scale well on multiple cores due to architectural bottlenecks. Similarly, applications exhibiting good scalability on a specific platform may need extensive re-factoring to run equally well on a different one. Things can become even worse if the application itself raises inherent difficulties for efficient parallelization.

Some of the most common sources of poor scalability met in practice are the following:

a) Excessive read-write sharing, false sharing and misaligned memory accesses

One of the most essential features of modern multiprocessor systems is the cache coherence protocol. Its responsibility is to ensure that all the processors' caches share data from system memory properly and do not use stale data that has been modified in another processor's cache. In other words, the protocol makes it possible for all processors to work like they are all connected directly to a single, globally shared memory module, while actually working with caches and multiple cached copies of data. This abstraction of shared memory eases parallel programming significantly, because the parallel threads of an application can refer directly to memory locations (just as in sequential programs), but may hide serious performance pitfalls.

Excessive read-write sharing between processors on a centralized data structure can introduce a large amount of traffic on the bus. This is because on every write operation a processor has to invalidate all existing copies of the cache line holding the data structure, and on subsequent reads from other processors the modified cache line has to be transferred to their caches. This "ping-pong" effect usually introduces large performance penalties, not only because of the actual data transfers but also due to the large amount of protocol control messages being exchanged. The cost associated with each such read or write operation can be many times larger than an ordinary read or write to private data, and increases with the number of sharers, the distance (in terms of processor topology) of the requestors, the frequency of read-write operations, etc.. Typical examples of variables being heavily shared across processors under a read-write pattern are reduction and synchronization variables.

Another common issue is when threads on different processors modify variables that happen to reside on the same cache line. This problem is known as false sharing, because it occurs unintentionally (i.e. threads do not actually access the same variable), but has all the performance drawbacks of true read-write sharing discussed above. While the programmer cannot usually do many things to avoid or eliminate true sharing at the application level, he can do much more to remedy false sharing, as we will discuss in following sections.

Finally, a not so common but extremely serious, in terms of performance, scenario is that of misaligned memory accesses. The x86 architecture is infamous for allowing the execution of loads and stores at unaligned addresses, i.e. at addresses that are not evenly divisible by the word size or multiples of it.

Practically, most compilers force variables of primitive data types to be aligned at a boundary that is a multiple of the word size, but it might be possible for a variable to be allocated at a misaligned address, e.g. through customized allocation mechanisms on the heap. The danger here is that the variable may end up spanning adjacent cache lines, both of which should be transferred each time it is referenced. This introduces enormous overhead when such variables are being shared by many processors, which scales linearly with the number of sharers. In practice, even when only two processors share a misaligned variable they can suffer an average access time as high as 50 times larger than the aligned case.

b) Memory bandwidth saturation

Memory access can still become a serious bottleneck even when threads of a parallel application work mostly on private data, without incurring notable inter-processor traffic. This is particularly true for memory-intensive parallel applications with large working sets (e.g. streaming applications), which may suffer from memory bandwidth saturation as more threads are introduced. In such cases, the application will not probably scale as expected, however efficiently parallelized it is, and the performance will be rather poor.

Each socket in the platform has a maximum bandwidth to memory thatis shared by all processing elements it encompasses (cores, hardware threads, etc.). Depending on the architecture, even multiple sockets might share access to main memory through a common bus. Given that even under perfect conditions (e.g. full software optimizations) the memory subsystem cannot fulfill a single thread's requests without having its core stalled, we can imagine the amount of pressure put on memory bus and how quickly it can be saturated when the number of cores increases. This is why the processor industry strives to provide improved bus speeds or alternate memory paths on each new processor generation (e.g. through Non-Uniform Memory Access designs), but unfortunately these enhancements have never been enough to make the memory subsystem keep pace with the increasing core counts.

c) Load imbalance

Most parallelism in high performance applications lies within loops, and static scheduling has long been the method of choice for parallelizing them. In static scheduling, the number of iterations is divided by the number of processors and each processor is assigned a specific chunk to process. While this approach is easy to implement, has low overhead and works well when all iterations of the loop perform approximately the same amount of work (e.g. in dense linear algebra computations), it results in poor performance and load imbalance when iterations perform unpredictable and varying amounts of work, usually as a result of different input data. This happens for example in sparse linear algebra kernels, where the sparsity pattern of the matrix determines the amount of computations associated with a specific iteration sub-range. Because at the end of each parallel loop all threads are synchronized before proceeding to the next phase, the parallel execution time of the loop is actually determined by the execution time of the slowest thread (which forms the critical path of parallel execution). Under considerable load imbalance, some threads may end up taking over the vast majority of computations while others sitting idle, wasting CPU cycles and not contributing to overall progress. This leads to inefficient utilization of system resources, preventing scalability from reaching higher levels.

d) Synchronization overhead

Parallel applications use locks to synchronize entry to program regions where access should be atomic, usually to protect some shared resource from concurrent updates, such as a shared variable or an output file. Such regions are known as critical sections.While a thread is inside a critical section no other thread can enter, implying that all threads requesting access should wait. As a result, the execution of critical sections is serialized. There are many parameters of lock usage and implementation that can affect performance of a parallel application. Below we mention the most important of them:

critical section extent: Since critical sections constitute a serialization point in parallel execution, they should be as small as possible to reduce the amount of time other threads sit idle waiting to enter the critical section, and to therefore allow the application to scale efficiently.
lock contention: If there is high demand among threads for accessing the resource protected by a critical section, there could be simultaneously many threads contending for the corresponding lock. In the best case, the waiting time for a thread could scale linearly with the total number of competing threads. In practice, however, things can be even worse, as the period a thread spends waiting can introduce significant coherence traffic due to read-write sharing on the lock variable.
locking overhead: The operations to acquire and release a lock entail by themselves some measurable overhead. In general, there are two main classes of lock implementations with different characteristics with respect to cost and scalability: user-level spinning and OS-level blocking. In the first case, threads poll continuously the value of a user-level lock variable until it appears to be free. Spin-based locks are efficient and thus preferable for small thread counts (e.g. less than 10) and short critical sections. Because they operate entirely at user space, acquire and release operations have low latency and are easier to implement. However, in critical sections thatincur long waiting times (i.e., large extent and/or high contention) they can introduce notable coherence traffic. The same happens also for large thread counts. In these cases spin-based locks do not scale well and OS-based ones may perform better. A thread that uses an OS-based lock goes to sleep in kernel mode if it fails to acquire the lock. This action, as well as the action of waking up a sleeping thread comes at the cost of executing a system call. It is harder to implement, it penalizes short critical sections (e.g., critical sections with execution time almost equal or less than the invocation time of acquire/release operations) but it is much more efficient and scalable for long-term waiting and large thread counts, since waiters do not consume resources (CPU, memory).
lock granularity: The granularity is a measure of the amount of data the lock is protecting. In many cases, the programmer has the freedom to select between coarse-grain and fine-grain schemes, as regards the implementation of synchronized access on a shared data structure. In the former, a small number of locks is utilized to protect large segments in the data structure. This approach is easy in its conception and implementation, but entails a lot of performance drawbacks, i.e. large critical sections and increased likelihood for contention. Conversely, fine-grain schemes use many locks to protect small pieces of data. They are usually more difficult and error-prone to program, but yield smaller critical sections with reduced contention. However, such small critical sections, even completely uncontended, can introduce overhead in two ways: first, because of the additional bus traffic due to the large number of lock variables, and second, because of the increased locking overhead associated with each memory operation on the protected data structure (compared to a coarser-grain scheme). Finding the best granularity level is not trivial and requires considering all the aforementioned factors.

e) Task granularity

Finally, a possible reason for bad performance can be the wrong choice of parallel tasks granularity. Granularity is the amount of work executed by a task. The programmer can control it usually by means of changing the number of consecutive iterations assigned to a parallel thread, or changing the loop level in a loop nest where parallelization is applied. If granularity is too coarse, performance may suffer from load imbalance since it is more likely to find uneven work in large iteration chunks. If granularity is too fine and many small chunks are assigned to each thread, better load balancing can be achieved but performance could suffer from parallelization overhead. This overhead is usually the aggregate cost of the following operations: the cost of assigning a parallel task (e.g., a chunk of iterations) to a thread, the cost of sharing data between threads and the cost of thread synchronization. The cost of task assignment is usually irrelevant to the task granularity, but due to the large number of fine-grain tasks it is increased overall. Furthermore, a fine-grain scheme may require more frequent data sharing and synchronization between threads compared to a coarser-grain one, both of which affect performance negatively, as we have already discussed.

OpenMP

OpenMP is a successful programming model for shared memory parallel architectures, particularly popular in the HPC community. It provides a simple and portable high-level interface for developing parallel applications in C, C++ and Fortran, in a wide range of processor architectures and operating systems. It consists of a set of compiler directives (pragmas), library routines and environment variables. Its simplicity lies in the fact that a serial application can be transformed into a parallel one in a rather straightforward manner, just by adding certain compiler pragmas chosen from a small set, and without requiring major source code restructuring. An advantage of this incremental parallelization approach is that OpenMP pragmas are ignored when sequential compiling options are used, therefore a single source code version needs to be maintained both for serial and parallel applications. OpenMP is supported by all mainstream compilers, including gcc and Intel compilers.

OpenMP is based on the fork-join threading model. Each time an omp parallel pragma is encountered, the executing thread (known as master thread) spawns a specified number of slave threads thatrun concurrently on the different cores of the platform. Using a number of work-sharing constructs (for, do, sections, single, master) the programmer specifies how work (loop iterations, code blocks) should be assigned to one or all of the threads. The actual work distribution is performed by the runtime system. Since OpenMP is a shared-memory programming model, most variables in the source code are visible to all threads by default. But sometimes it is desirable to keep some variables private (i.e., local to each thread), or to specify in more detail how a variable mutates in its transition from the sequential part to the parallel region. For this reason OpenMP provides data sharing attribute clauses (shared, private, firstprivate, lastprivate, reduction, etc.) which can be appended to the work-sharing constructs in order to guide the compiler/runtime system on how to treat each variable in the associated code block. Finally, an important set of extensions provided by OpenMP are clauses related to synchronization (critical, atomic, barrier, nowait), which we will further explore in the following paragraphs.

Using OpenMP to cope with scalability problems

OpenMP can help a parallel application improve its scalability in the following ways:

i) To balance the load in loops with varying or unpredictable work in each iteration step, OpenMP provides two dynamic scheduling options to be used in the work-sharing loop constructs, as an alternative to the default static scheduling policy. The first one, dynamic, assigns dynamically chunks of contiguous iterations to threads. The chunk size, i.e. the number of iterations assigned to each thread, can be optionally specified as an argument. Whenever a thread "consumes" its chunk of iterations, it requests and gets assigned by the runtime system the next available. In this way threads are kept always busy, system resources are utilized efficiently and the load imbalance is reduced. As we mentioned in the previous section, small iteration chunks achieve better load balancing but entail larger overhead, and vice-versa. There is no general recipe for choosing the optimal chunk size for every dynamically scheduled loop; the user should experiment with many options to find the one that makes the best compromise between low overhead and good scalability. In the following example it is shown how an OpenMP parallel for loop can be annotated to use dynamic scheduling.

Default (static) scheduling chunk size = N/#threads	Dynamic scheduling with chunk size 10
#pragma omp parallel for for (i=0; i<N; i++) { varying_work(i); }	#pragma omp parallel for schedule(dynamic,10) for (i=0; i<N; i++) { varying_work(i); }

The second dynamic policy, guided, assigns initially large iteration chunks to threads, reducing them exponentially with each successive assignment. Due to the "geometry" of the chunks allocation, guided scheduling has usually less overhead than dynamic. The minimum size of chunk size beyond of which no further reduction is allowed can be optionally specified through an argument, as in the case of dynamic policy.

Another possible scenario is when the load of a loop is unbalanced in a clearly coarse-grain fashion: work is uneven across large iteration sub-ranges, but within a certain sub-range consecutive iterations almost have an equal weight. In that case, it could be more profitable to use static scheduling with a small chunk size instead of paying the overhead of dynamic scheduling. By specifying a chunk size C in static scheduling, a thread is assigned C consecutive iterations and this assignment repeats cyclically, i.e. every C*T iterations, T being the number of threads. The key concept here is that the iteration assignment for each thread is pre-computed at compile time, thus no runtime overhead must be paid, yet the loop's slowly changing nature combined with the cyclic work assignment guarantee a fair level of load balancing.

Default (static) scheduling chunk size = N/#threads	Static scheduling with chunk size 5
#pragma omp parallel for for (i=0; i<N; i++) { slowly_varying_work(i); }	#pragma omp parallel for schedule(static,5) for (i=0; i<N; i++) { slowly_varying_work(i); }

ii)For critical section synchronizationOpenMP provides the critical and atomic clauses. The first one executes atomically the enclosed code block using lock acquire and release methods, as implemented internally in the library. Although the OpenMP standard does not specify how these methods should be implemented, the most popular OpenMP libraries (e.g. gcc, Intel) use blocking-style acquire/release methods, meaning that threads are de-scheduled in the kernel when they fail to acquire the lock. In any case, the critical clause restricts programmer to use a particular lock implementation, with all its pros and cons. However, as we have already discussed, there is no single lock implementation that could perform equally well under all circumstances, and in that sense the critical clause does not provide flexibility.

The atomic clause instructs the compiler to use atomic instructions provided by the underlying architecture in order to execute atomically the specified block. Many architecture extend their instruction set with annotations that enable simple read-modify-write memory operations (e.g. addition, subtraction, increase, logical or, etc.) to execute atomically, as an indivisible operation. At the low level, the hardware guarantees atomicity by issuing special signals to lock the entire or part of the memory subsystem while the corresponding instruction executes. For example, the x86 architecture uses the lock prefix to annotate certain types of instructions, which further asserts a #LOCK signal on the bus. Atomic instructions are much more efficient compared to lock-based implementations, since the lock acquire and release operations are realized entirely in hardware and usually require only a few cycles. However, they have some disadvantages: first, they are not appropriate for any use, but only for small critical sections consisting of single instruction that the underlying ISA can execute atomically. If the OpenMP compiler decides that an atomic code block cannot be transformed into an equivalent atomic instruction, it falls back to ordinary lock acquire/release operations as in the critical clause. Second, even hardware locking does not scale well as the thread number increases. Thus, using atomic instructions under certain conditions (e.g. heavy contention) may become a bottleneck.

Global critical sections, as implemented with the critical clause, may unnecessarily harm performance because every thread is actually competing to acquire a single, global lock, without often needing to do so. Let's consider for example the following code snippet on the left, where in each loop iteration no concurrent updates on the same element of B (by means of "process()" function) are allowed. The index of the element of B that needs to be updated in each iteration step is known only at runtime, through accessing array A. The naive approach is to use a critical clause and protect the execution of "process()". This, however, permits only one thread to enter the critical section each time, even if others intend to work on different elements of B. What we actually need is a more fine-grain control over the critical section, with lock release and acquire operations on a per-element basis. This can be achieved in OpenMP using a distinct lock for each element of B, and acquiring the right one using the corresponding index computed at runtime. This is shown on the code snippet on the right. Here, the library methods omp_set_lock and omp_unset_lock are used to directly acquire and release, respectively, the appropriate lock.

Optimization techniques for scalability

From HP-SEE Wiki

Revision as of 12:05, 30 March 2012

Contents

SHARED MEMORY PARADIGM

OpenMP

Using OpenMP to cope with scalability problems

POSIX threads

Generic Optimization Guidelines

MESSAGE PASSING PARADIGM

MPICH

Profilers

gprof

valgrind

Oprofile

Use cases

Views

Personal tools

Navigation

Search

Toolbox

@@ Line 42: / Line 42: @@
 == OpenMP ==
+OpenMP is a successful programming model for shared memory parallel architectures, particularly popular in the HPC community. It provides a simple and portable high-level interface for developing parallel applications in C, C++ and Fortran, in a wide range of processor architectures and operating systems. It consists of a set of compiler directives (pragmas), library routines and environment variables. Its simplicity lies in the fact that a serial application can be transformed into a parallel one in a rather straightforward manner, just by adding certain compiler pragmas chosen from a small set, and without requiring major source code restructuring. An advantage of this incremental parallelization approach is that OpenMP pragmas are ignored when sequential compiling options are used, therefore a single source code version needs to be maintained both for serial and parallel applications. OpenMP is supported by all mainstream compilers, including gcc and Intel compilers.
+OpenMP is based on the fork-join threading model. Each time an omp parallel pragma is encountered, the executing thread (known as master thread) spawns a specified number of slave threads thatrun concurrently on the different cores of the platform. Using a number of work-sharing constructs (for, do, sections, single, master) the programmer specifies how work (loop iterations, code blocks) should be assigned to one or all of the threads. The actual work distribution is performed by the runtime system. Since OpenMP is a shared-memory programming model, most variables in the source code are visible to all threads by default. But sometimes it is desirable to keep some variables private (i.e., local to each thread), or to specify in more detail how a variable mutates in its transition from the sequential part to the parallel region. For this reason OpenMP provides data sharing attribute clauses (shared, private, firstprivate, lastprivate, reduction, etc.) which can be appended to the work-sharing constructs in order to guide the compiler/runtime system on how to treat each variable in the associated code block. Finally, an important set of extensions provided by OpenMP are clauses related to synchronization (critical, atomic, barrier, nowait), which we will further explore in the following paragraphs.
+=== Using OpenMP to cope with scalability problems ===
+OpenMP can help a parallel application improve its scalability in the following ways:
+i) To balance the load in loops with varying or unpredictable work in each iteration step, OpenMP provides two dynamic scheduling options to be used in the work-sharing loop constructs, as an alternative to the default static scheduling policy. The first one, dynamic, assigns dynamically chunks of contiguous iterations to threads. The chunk size, i.e. the number of iterations assigned to each thread, can be optionally specified as an argument. Whenever a thread "consumes" its chunk of iterations, it requests and gets assigned by the runtime system the next available. In this way threads are kept always busy, system resources are utilized efficiently and the load imbalance is reduced. As we mentioned in the previous section, small iteration chunks achieve better load balancing but entail larger overhead, and vice-versa. There is no general recipe for choosing the optimal chunk size for every dynamically scheduled loop; the user should experiment with many options to find the one that makes the best compromise between low overhead and good scalability. In the following example it is shown how an OpenMP parallel for loop can be annotated to use dynamic scheduling.
+{| style="border-collapse: separate; border-spacing: 0; border-width: 1px; border-style: solid; border-color: #000; padding: 0"
+|-
+! style="border-style: solid; border-width: 0 1px 1px 0"| Default (static) scheduling chunk size = N/#threads
+! style="border-style: solid; border-width: 0 0 1px 0"| Dynamic scheduling with chunk size 10
+|-
+| style="border-style: solid; border-width: 0 1px 0 0"|
+  #pragma omp parallel for
+  for (i=0; i<N; i++)
+  {
+    varying_work(i);
+  }
+| style="border-style: solid; border-width: 0"|
+  #pragma omp parallel for schedule('''dynamic''',10)
+  for (i=0; i<N; i++)
+  {
+     varying_work(i);
+  }
+|}
+The second dynamic policy, guided, assigns initially large iteration chunks to threads, reducing them exponentially with each successive assignment. Due to the "geometry" of the chunks allocation, guided scheduling has usually less overhead than dynamic. The minimum size of chunk size beyond of which no further reduction is allowed can be optionally specified through an argument, as in the case of dynamic policy.
+Another possible scenario is when the load of a loop is unbalanced in a clearly coarse-grain fashion: work is uneven across large iteration sub-ranges, but within a certain sub-range consecutive iterations almost have an equal weight. In that case, it could be more profitable to use static scheduling with a small chunk size instead of paying the overhead of dynamic scheduling. By specifying a chunk size C in static scheduling, a thread is assigned C consecutive iterations and this assignment repeats cyclically, i.e. every C*T iterations, T being the number of threads. The key concept here is that the iteration assignment for each thread is pre-computed at compile time, thus no runtime overhead must be paid, yet the loop's slowly changing nature combined with the cyclic work assignment guarantee a fair level of load balancing.
+{| style="border-collapse: separate; border-spacing: 0; border-width: 1px; border-style: solid; border-color: #000; padding: 0"
+|-
+! style="border-style: solid; border-width: 0 1px 1px 0"|Default (static) scheduling chunk size = N/#threads
+! style="border-style: solid; border-width: 0 0 1px 0"| Static scheduling with chunk size 5
+|-
+| style="border-style: solid; border-width: 0 1px 0 0"|
+  #pragma omp parallel for
+  for (i=0; i<N; i++)
+  {
+    slowly_varying_work(i);
+  }
+| style="border-style: solid; border-width: 0"|
+  #pragma omp parallel for schedule(static,5)
+  for (i=0; i<N; i++)
+  {
+    slowly_varying_work(i);
+  }
+|}
+ii)For critical section synchronizationOpenMP provides the critical and atomic clauses. The first one executes atomically the enclosed code block using lock acquire and release methods, as implemented internally in the library. Although the OpenMP standard does not specify how these methods should be implemented, the most popular OpenMP libraries (e.g. gcc, Intel) use blocking-style acquire/release methods, meaning that threads are de-scheduled in the kernel when they fail to acquire the lock. In any case, the critical clause restricts programmer to use a particular lock implementation, with all its pros and cons. However, as we have already discussed, there is no single lock implementation that could perform equally well under all circumstances, and in that sense the critical clause does not provide flexibility.
+The atomic clause instructs the compiler to use atomic instructions provided by the underlying architecture in order to execute atomically the specified block. Many architecture extend their instruction set with annotations that enable simple read-modify-write memory operations (e.g. addition, subtraction, increase, logical or, etc.) to execute atomically, as an indivisible operation. At the low level, the hardware guarantees atomicity by issuing special signals to lock the entire or part of the memory subsystem while the corresponding instruction executes. For example, the x86 architecture uses the lock prefix to annotate certain types of instructions, which further asserts a #LOCK signal on the bus. Atomic instructions are much more efficient compared to lock-based implementations, since the lock acquire and release operations are realized entirely in hardware and usually require only a few cycles. However, they have some disadvantages: first, they are not appropriate for any use, but only for small critical sections consisting of single instruction that the underlying ISA can execute atomically. If the OpenMP compiler decides that an atomic code block cannot be transformed into an equivalent atomic instruction, it falls back to ordinary lock acquire/release operations as in the critical clause. Second, even hardware locking does not scale well as the thread number increases. Thus, using atomic instructions under certain conditions (e.g. heavy contention) may become a bottleneck.
+Global critical sections, as implemented with the critical clause, may unnecessarily harm performance because every thread is actually competing to acquire a single, global lock, without often needing to do so. Let's consider for example the following code snippet on the left, where in each loop iteration no concurrent updates on the same element of B (by means of "process()" function) are allowed. The index of the element of B that needs to be updated in each iteration step is known only at runtime, through accessing array A. The naive approach is to use a critical clause and protect the execution of "process()". This, however, permits only one thread to enter the critical section each time, even if others intend to work on different elements of B. What we actually need is a more fine-grain control over the critical section, with lock release and acquire operations on a per-element basis. This can be achieved in OpenMP using a distinct lock for each element of B, and acquiring the right one using the corresponding index computed at runtime. This is shown on the code snippet on the right. Here, the library methods omp_set_lock and omp_unset_lock are used to directly acquire and release, respectively, the appropriate lock.
 == POSIX threads ==
 == Generic Optimization Guidelines ==