Optimization techniques for scalability

From HP-SEE Wiki

Revision as of 11:54, 30 March 2012 by Roczei (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

SHARED MEMORY PARADIGM

Achieving scalable performance for parallel applications on modern shared-memory multiprocessing platforms is not straightforward. Apart from issues inherent to each application, for example the ability to efficiently decompose it in parallel tasks, other factors such as the interaction with the underlying architecture may affect scalability significantly. Often, even applications with ample and easy to extract parallelism (e.g., large independent parallel portions, sparse synchronization, etc.) fail to scale well on multiple cores due to architectural bottlenecks. Similarly, applications exhibiting good scalability on a specific platform may need extensive re-factoring to run equally well on a different one. Things can become even worse if the application itself raises inherent difficulties for efficient parallelization.

Some of the most common sources of poor scalability met in practice are the following:

a) Excessive read-write sharing, false sharing and misaligned memory accesses

One of the most essential features of modern multiprocessor systems is the cache coherence protocol. Its responsibility is to ensure that all the processors' caches share data from system memory properly and do not use stale data that has been modified in another processor's cache. In other words, the protocol makes it possible for all processors to work like they are all connected directly to a single, globally shared memory module, while actually working with caches and multiple cached copies of data. This abstraction of shared memory eases parallel programming significantly, because the parallel threads of an application can refer directly to memory locations (just as in sequential programs), but may hide serious performance pitfalls.

Excessive read-write sharing between processors on a centralized data structure can introduce a large amount of traffic on the bus. This is because on every write operation a processor has to invalidate all existing copies of the cache line holding the data structure, and on subsequent reads from other processors the modified cache line has to be transferred to their caches. This "ping-pong" effect usually introduces large performance penalties, not only because of the actual data transfers but also due to the large amount of protocol control messages being exchanged. The cost associated with each such read or write operation can be many times larger than an ordinary read or write to private data, and increases with the number of sharers, the distance (in terms of processor topology) of the requestors, the frequency of read-write operations, etc.. Typical examples of variables being heavily shared across processors under a read-write pattern are reduction and synchronization variables.

Another common issue is when threads on different processors modify variables that happen to reside on the same cache line. This problem is known as false sharing, because it occurs unintentionally (i.e. threads do not actually access the same variable), but has all the performance drawbacks of true read-write sharing discussed above. While the programmer cannot usually do many things to avoid or eliminate true sharing at the application level, he can do much more to remedy false sharing, as we will discuss in following sections.

Finally, a not so common but extremely serious, in terms of performance, scenario is that of misaligned memory accesses. The x86 architecture is infamous for allowing the execution of loads and stores at unaligned addresses, i.e. at addresses that are not evenly divisible by the word size or multiples of it.

Practically, most compilers force variables of primitive data types to be aligned at a boundary that is a multiple of the word size, but it might be possible for a variable to be allocated at a misaligned address, e.g. through customized allocation mechanisms on the heap. The danger here is that the variable may end up spanning adjacent cache lines, both of which should be transferred each time it is referenced. This introduces enormous overhead when such variables are being shared by many processors, which scales linearly with the number of sharers. In practice, even when only two processors share a misaligned variable they can suffer an average access time as high as 50 times larger than the aligned case.

b) Memory bandwidth saturation

Memory access can still become a serious bottleneck even when threads of a parallel application work mostly on private data, without incurring notable inter-processor traffic. This is particularly true for memory-intensive parallel applications with large working sets (e.g. streaming applications), which may suffer from memory bandwidth saturation as more threads are introduced. In such cases, the application will not probably scale as expected, however efficiently parallelized it is, and the performance will be rather poor.

Each socket in the platform has a maximum bandwidth to memory thatis shared by all processing elements it encompasses (cores, hardware threads, etc.). Depending on the architecture, even multiple sockets might share access to main memory through a common bus. Given that even under perfect conditions (e.g. full software optimizations) the memory subsystem cannot fulfill a single thread's requests without having its core stalled, we can imagine the amount of pressure put on memory bus and how quickly it can be saturated when the number of cores increases. This is why the processor industry strives to provide improved bus speeds or alternate memory paths on each new processor generation (e.g. through Non-Uniform Memory Access designs), but unfortunately these enhancements have never been enough to make the memory subsystem keep pace with the increasing core counts.

c) Load imbalance

Most parallelism in high performance applications lies within loops, and static scheduling has long been the method of choice for parallelizing them. In static scheduling, the number of iterations is divided by the number of processors and each processor is assigned a specific chunk to process. While this approach is easy to implement, has low overhead and works well when all iterations of the loop perform approximately the same amount of work (e.g. in dense linear algebra computations), it results in poor performance and load imbalance when iterations perform unpredictable and varying amounts of work, usually as a result of different input data. This happens for example in sparse linear algebra kernels, where the sparsity pattern of the matrix determines the amount of computations associated with a specific iteration sub-range. Because at the end of each parallel loop all threads are synchronized before proceeding to the next phase, the parallel execution time of the loop is actually determined by the execution time of the slowest thread (which forms the critical path of parallel execution). Under considerable load imbalance, some threads may end up taking over the vast majority of computations while others sitting idle, wasting CPU cycles and not contributing to overall progress. This leads to inefficient utilization of system resources, preventing scalability from reaching higher levels.

d) Synchronization overhead

Parallel applications use locks to synchronize entry to program regions where access should be atomic, usually to protect some shared resource from concurrent updates, such as a shared variable or an output file. Such regions are known as critical sections.While a thread is inside a critical section no other thread can enter, implying that all threads requesting access should wait. As a result, the execution of critical sections is serialized. There are many parameters of lock usage and implementation that can affect performance of a parallel application. Below we mention the most important of them:

  • critical section extent: Since critical sections constitute a serialization point in parallel execution, they should be as small as possible to reduce the amount of time other threads sit idle waiting to enter the critical section, and to therefore allow the application to scale efficiently.
  • lock contention: If there is high demand among threads for accessing the resource protected by a critical section, there could be simultaneously many threads contending for the corresponding lock. In the best case, the waiting time for a thread could scale linearly with the total number of competing threads. In practice, however, things can be even worse, as the period a thread spends waiting can introduce significant coherence traffic due to read-write sharing on the lock variable.
  • locking overhead: The operations to acquire and release a lock entail by themselves some measurable overhead. In general, there are two main classes of lock implementations with different characteristics with respect to cost and scalability: user-level spinning and OS-level blocking. In the first case, threads poll continuously the value of a user-level lock variable until it appears to be free. Spin-based locks are efficient and thus preferable for small thread counts (e.g. less than 10) and short critical sections. Because they operate entirely at user space, acquire and release operations have low latency and are easier to implement. However, in critical sections thatincur long waiting times (i.e., large extent and/or high contention) they can introduce notable coherence traffic. The same happens also for large thread counts. In these cases spin-based locks do not scale well and OS-based ones may perform better. A thread that uses an OS-based lock goes to sleep in kernel mode if it fails to acquire the lock. This action, as well as the action of waking up a sleeping thread comes at the cost of executing a system call. It is harder to implement, it penalizes short critical sections (e.g., critical sections with execution time almost equal or less than the invocation time of acquire/release operations) but it is much more efficient and scalable for long-term waiting and large thread counts, since waiters do not consume resources (CPU, memory).
  • lock granularity: The granularity is a measure of the amount of data the lock is protecting. In many cases, the programmer has the freedom to select between coarse-grain and fine-grain schemes, as regards the implementation of synchronized access on a shared data structure. In the former, a small number of locks is utilized to protect large segments in the data structure. This approach is easy in its conception and implementation, but entails a lot of performance drawbacks, i.e. large critical sections and increased likelihood for contention. Conversely, fine-grain schemes use many locks to protect small pieces of data. They are usually more difficult and error-prone to program, but yield smaller critical sections with reduced contention. However, such small critical sections, even completely uncontended, can introduce overhead in two ways: first, because of the additional bus traffic due to the large number of lock variables, and second, because of the increased locking overhead associated with each memory operation on the protected data structure (compared to a coarser-grain scheme). Finding the best granularity level is not trivial and requires considering all the aforementioned factors.

e) Task granularity

Finally, a possible reason for bad performance can be the wrong choice of parallel tasks granularity. Granularity is the amount of work executed by a task. The programmer can control it usually by means of changing the number of consecutive iterations assigned to a parallel thread, or changing the loop level in a loop nest where parallelization is applied. If granularity is too coarse, performance may suffer from load imbalance since it is more likely to find uneven work in large iteration chunks. If granularity is too fine and many small chunks are assigned to each thread, better load balancing can be achieved but performance could suffer from parallelization overhead. This overhead is usually the aggregate cost of the following operations: the cost of assigning a parallel task (e.g., a chunk of iterations) to a thread, the cost of sharing data between threads and the cost of thread synchronization. The cost of task assignment is usually irrelevant to the task granularity, but due to the large number of fine-grain tasks it is increased overall. Furthermore, a fine-grain scheme may require more frequent data sharing and synchronization between threads compared to a coarser-grain one, both of which affect performance negatively, as we have already discussed.


OpenMP

POSIX threads

Generic Optimization Guidelines

MESSAGE PASSING PARADIGM

MPICH

Profilers

gprof

valgrind

Oprofile

Use cases

Personal tools