Optimization techniques for scalability

From HP-SEE Wiki

(Difference between revisions)
Jump to: navigation, search
(gprof)
Line 317: Line 317:
= Profilers=
= Profilers=
== gprof ==
== gprof ==
 +
 +
Gprof is a widely used profiling tool for Unix systems thatproduces an execution profile of C and Fortran programs. It can show the application call graph, which represents the calling relationships between functions in the program, and the percentage of total execution time spent in each function. The tool first instruments the application executable with information that will help build the call graph. It then runs the application and samples periodically the execution to track what code is currently being executed. This process is known as statistical sampling. Instrumentation can accurately determine the number of times a function is called, while sampling approximates the time spent within each function. Because of the bookkeeping overhead of instrumentation and the use of a special system call needed for sampling, it is likely that gprof will slow down the application execution.
 +
 +
To profile an application with gprof, the first step is to compile its source code with the -pg option of gcc. It is sometimes useful to turn on symbol information using the -g3 option of gcc, which is necessary if the user wishes the source code to be annotated with profiling information. Then, the user can run the instrumented application and a special output file is generated (by default gmon.out). Finally, the user can invoke gprof as follows to display the profiling results:
 +
 +
  gprof [--flat-profile --graph --annotated-source] ./program
 +
 +
The --flat-profile option prints the total amount of time spent and the number of calls to each function. The --graph option shows the call graph. The --annotated-source option prints profiling information next to the original source code.
 +
The main advantage of gprof is that it offers a "quick-and-dirty" way to identify hot-spots in an application, so that the programmer can focus on the most time-consuming parts of it as candidates for parallelization. Unfortunately, the tool can only profile the "user-level" part of an application, meaning that it cannot profile external shared libraries (e.g. libc) or system calls. A serious sideeffect of this is that multithreaded applications cannot be profiled. Actually, in such cases profiling information is gathered only for the main application thread, which is not helpful.
 +
== valgrind ==
== valgrind ==
== Oprofile ==
== Oprofile ==
= Use cases =
= Use cases =

Revision as of 12:41, 30 March 2012

Contents

SHARED MEMORY PARADIGM

Achieving scalable performance for parallel applications on modern shared-memory multiprocessing platforms is not straightforward. Apart from issues inherent to each application, for example the ability to efficiently decompose it in parallel tasks, other factors such as the interaction with the underlying architecture may affect scalability significantly. Often, even applications with ample and easy to extract parallelism (e.g., large independent parallel portions, sparse synchronization, etc.) fail to scale well on multiple cores due to architectural bottlenecks. Similarly, applications exhibiting good scalability on a specific platform may need extensive re-factoring to run equally well on a different one. Things can become even worse if the application itself raises inherent difficulties for efficient parallelization.

Some of the most common sources of poor scalability met in practice are the following:

a) Excessive read-write sharing, false sharing and misaligned memory accesses

One of the most essential features of modern multiprocessor systems is the cache coherence protocol. Its responsibility is to ensure that all the processors' caches share data from system memory properly and do not use stale data that has been modified in another processor's cache. In other words, the protocol makes it possible for all processors to work like they are all connected directly to a single, globally shared memory module, while actually working with caches and multiple cached copies of data. This abstraction of shared memory eases parallel programming significantly, because the parallel threads of an application can refer directly to memory locations (just as in sequential programs), but may hide serious performance pitfalls.

Excessive read-write sharing between processors on a centralized data structure can introduce a large amount of traffic on the bus. This is because on every write operation a processor has to invalidate all existing copies of the cache line holding the data structure, and on subsequent reads from other processors the modified cache line has to be transferred to their caches. This "ping-pong" effect usually introduces large performance penalties, not only because of the actual data transfers but also due to the large amount of protocol control messages being exchanged. The cost associated with each such read or write operation can be many times larger than an ordinary read or write to private data, and increases with the number of sharers, the distance (in terms of processor topology) of the requestors, the frequency of read-write operations, etc.. Typical examples of variables being heavily shared across processors under a read-write pattern are reduction and synchronization variables.

Another common issue is when threads on different processors modify variables that happen to reside on the same cache line. This problem is known as false sharing, because it occurs unintentionally (i.e. threads do not actually access the same variable), but has all the performance drawbacks of true read-write sharing discussed above. While the programmer cannot usually do many things to avoid or eliminate true sharing at the application level, he can do much more to remedy false sharing, as we will discuss in following sections.

Finally, a not so common but extremely serious, in terms of performance, scenario is that of misaligned memory accesses. The x86 architecture is infamous for allowing the execution of loads and stores at unaligned addresses, i.e. at addresses that are not evenly divisible by the word size or multiples of it.

Practically, most compilers force variables of primitive data types to be aligned at a boundary that is a multiple of the word size, but it might be possible for a variable to be allocated at a misaligned address, e.g. through customized allocation mechanisms on the heap. The danger here is that the variable may end up spanning adjacent cache lines, both of which should be transferred each time it is referenced. This introduces enormous overhead when such variables are being shared by many processors, which scales linearly with the number of sharers. In practice, even when only two processors share a misaligned variable they can suffer an average access time as high as 50 times larger than the aligned case.

b) Memory bandwidth saturation

Memory access can still become a serious bottleneck even when threads of a parallel application work mostly on private data, without incurring notable inter-processor traffic. This is particularly true for memory-intensive parallel applications with large working sets (e.g. streaming applications), which may suffer from memory bandwidth saturation as more threads are introduced. In such cases, the application will not probably scale as expected, however efficiently parallelized it is, and the performance will be rather poor.

Each socket in the platform has a maximum bandwidth to memory thatis shared by all processing elements it encompasses (cores, hardware threads, etc.). Depending on the architecture, even multiple sockets might share access to main memory through a common bus. Given that even under perfect conditions (e.g. full software optimizations) the memory subsystem cannot fulfill a single thread's requests without having its core stalled, we can imagine the amount of pressure put on memory bus and how quickly it can be saturated when the number of cores increases. This is why the processor industry strives to provide improved bus speeds or alternate memory paths on each new processor generation (e.g. through Non-Uniform Memory Access designs), but unfortunately these enhancements have never been enough to make the memory subsystem keep pace with the increasing core counts.

c) Load imbalance

Most parallelism in high performance applications lies within loops, and static scheduling has long been the method of choice for parallelizing them. In static scheduling, the number of iterations is divided by the number of processors and each processor is assigned a specific chunk to process. While this approach is easy to implement, has low overhead and works well when all iterations of the loop perform approximately the same amount of work (e.g. in dense linear algebra computations), it results in poor performance and load imbalance when iterations perform unpredictable and varying amounts of work, usually as a result of different input data. This happens for example in sparse linear algebra kernels, where the sparsity pattern of the matrix determines the amount of computations associated with a specific iteration sub-range. Because at the end of each parallel loop all threads are synchronized before proceeding to the next phase, the parallel execution time of the loop is actually determined by the execution time of the slowest thread (which forms the critical path of parallel execution). Under considerable load imbalance, some threads may end up taking over the vast majority of computations while others sitting idle, wasting CPU cycles and not contributing to overall progress. This leads to inefficient utilization of system resources, preventing scalability from reaching higher levels.

d) Synchronization overhead

Parallel applications use locks to synchronize entry to program regions where access should be atomic, usually to protect some shared resource from concurrent updates, such as a shared variable or an output file. Such regions are known as critical sections.While a thread is inside a critical section no other thread can enter, implying that all threads requesting access should wait. As a result, the execution of critical sections is serialized. There are many parameters of lock usage and implementation that can affect performance of a parallel application. Below we mention the most important of them:

  • critical section extent: Since critical sections constitute a serialization point in parallel execution, they should be as small as possible to reduce the amount of time other threads sit idle waiting to enter the critical section, and to therefore allow the application to scale efficiently.
  • lock contention: If there is high demand among threads for accessing the resource protected by a critical section, there could be simultaneously many threads contending for the corresponding lock. In the best case, the waiting time for a thread could scale linearly with the total number of competing threads. In practice, however, things can be even worse, as the period a thread spends waiting can introduce significant coherence traffic due to read-write sharing on the lock variable.
  • locking overhead: The operations to acquire and release a lock entail by themselves some measurable overhead. In general, there are two main classes of lock implementations with different characteristics with respect to cost and scalability: user-level spinning and OS-level blocking. In the first case, threads poll continuously the value of a user-level lock variable until it appears to be free. Spin-based locks are efficient and thus preferable for small thread counts (e.g. less than 10) and short critical sections. Because they operate entirely at user space, acquire and release operations have low latency and are easier to implement. However, in critical sections thatincur long waiting times (i.e., large extent and/or high contention) they can introduce notable coherence traffic. The same happens also for large thread counts. In these cases spin-based locks do not scale well and OS-based ones may perform better. A thread that uses an OS-based lock goes to sleep in kernel mode if it fails to acquire the lock. This action, as well as the action of waking up a sleeping thread comes at the cost of executing a system call. It is harder to implement, it penalizes short critical sections (e.g., critical sections with execution time almost equal or less than the invocation time of acquire/release operations) but it is much more efficient and scalable for long-term waiting and large thread counts, since waiters do not consume resources (CPU, memory).
  • lock granularity: The granularity is a measure of the amount of data the lock is protecting. In many cases, the programmer has the freedom to select between coarse-grain and fine-grain schemes, as regards the implementation of synchronized access on a shared data structure. In the former, a small number of locks is utilized to protect large segments in the data structure. This approach is easy in its conception and implementation, but entails a lot of performance drawbacks, i.e. large critical sections and increased likelihood for contention. Conversely, fine-grain schemes use many locks to protect small pieces of data. They are usually more difficult and error-prone to program, but yield smaller critical sections with reduced contention. However, such small critical sections, even completely uncontended, can introduce overhead in two ways: first, because of the additional bus traffic due to the large number of lock variables, and second, because of the increased locking overhead associated with each memory operation on the protected data structure (compared to a coarser-grain scheme). Finding the best granularity level is not trivial and requires considering all the aforementioned factors.

e) Task granularity

Finally, a possible reason for bad performance can be the wrong choice of parallel tasks granularity. Granularity is the amount of work executed by a task. The programmer can control it usually by means of changing the number of consecutive iterations assigned to a parallel thread, or changing the loop level in a loop nest where parallelization is applied. If granularity is too coarse, performance may suffer from load imbalance since it is more likely to find uneven work in large iteration chunks. If granularity is too fine and many small chunks are assigned to each thread, better load balancing can be achieved but performance could suffer from parallelization overhead. This overhead is usually the aggregate cost of the following operations: the cost of assigning a parallel task (e.g., a chunk of iterations) to a thread, the cost of sharing data between threads and the cost of thread synchronization. The cost of task assignment is usually irrelevant to the task granularity, but due to the large number of fine-grain tasks it is increased overall. Furthermore, a fine-grain scheme may require more frequent data sharing and synchronization between threads compared to a coarser-grain one, both of which affect performance negatively, as we have already discussed.


OpenMP

OpenMP is a successful programming model for shared memory parallel architectures, particularly popular in the HPC community. It provides a simple and portable high-level interface for developing parallel applications in C, C++ and Fortran, in a wide range of processor architectures and operating systems. It consists of a set of compiler directives (pragmas), library routines and environment variables. Its simplicity lies in the fact that a serial application can be transformed into a parallel one in a rather straightforward manner, just by adding certain compiler pragmas chosen from a small set, and without requiring major source code restructuring. An advantage of this incremental parallelization approach is that OpenMP pragmas are ignored when sequential compiling options are used, therefore a single source code version needs to be maintained both for serial and parallel applications. OpenMP is supported by all mainstream compilers, including gcc and Intel compilers.

OpenMP is based on the fork-join threading model. Each time an omp parallel pragma is encountered, the executing thread (known as master thread) spawns a specified number of slave threads thatrun concurrently on the different cores of the platform. Using a number of work-sharing constructs (for, do, sections, single, master) the programmer specifies how work (loop iterations, code blocks) should be assigned to one or all of the threads. The actual work distribution is performed by the runtime system. Since OpenMP is a shared-memory programming model, most variables in the source code are visible to all threads by default. But sometimes it is desirable to keep some variables private (i.e., local to each thread), or to specify in more detail how a variable mutates in its transition from the sequential part to the parallel region. For this reason OpenMP provides data sharing attribute clauses (shared, private, firstprivate, lastprivate, reduction, etc.) which can be appended to the work-sharing constructs in order to guide the compiler/runtime system on how to treat each variable in the associated code block. Finally, an important set of extensions provided by OpenMP are clauses related to synchronization (critical, atomic, barrier, nowait), which we will further explore in the following paragraphs.

Using OpenMP to cope with scalability problems

OpenMP can help a parallel application improve its scalability in the following ways:

i) To balance the load in loops with varying or unpredictable work in each iteration step, OpenMP provides two dynamic scheduling options to be used in the work-sharing loop constructs, as an alternative to the default static scheduling policy. The first one, dynamic, assigns dynamically chunks of contiguous iterations to threads. The chunk size, i.e. the number of iterations assigned to each thread, can be optionally specified as an argument. Whenever a thread "consumes" its chunk of iterations, it requests and gets assigned by the runtime system the next available. In this way threads are kept always busy, system resources are utilized efficiently and the load imbalance is reduced. As we mentioned in the previous section, small iteration chunks achieve better load balancing but entail larger overhead, and vice-versa. There is no general recipe for choosing the optimal chunk size for every dynamically scheduled loop; the user should experiment with many options to find the one that makes the best compromise between low overhead and good scalability. In the following example it is shown how an OpenMP parallel for loop can be annotated to use dynamic scheduling.


Default (static) scheduling chunk size = N/#threads Dynamic scheduling with chunk size 10
 #pragma omp parallel for 
 for (i=0; i<N; i++) 
 {
   varying_work(i);
 }
 #pragma omp parallel for schedule(dynamic,10)
 for (i=0; i<N; i++) 
 {
    varying_work(i);
 }


The second dynamic policy, guided, assigns initially large iteration chunks to threads, reducing them exponentially with each successive assignment. Due to the "geometry" of the chunks allocation, guided scheduling has usually less overhead than dynamic. The minimum size of chunk size beyond of which no further reduction is allowed can be optionally specified through an argument, as in the case of dynamic policy.

Another possible scenario is when the load of a loop is unbalanced in a clearly coarse-grain fashion: work is uneven across large iteration sub-ranges, but within a certain sub-range consecutive iterations almost have an equal weight. In that case, it could be more profitable to use static scheduling with a small chunk size instead of paying the overhead of dynamic scheduling. By specifying a chunk size C in static scheduling, a thread is assigned C consecutive iterations and this assignment repeats cyclically, i.e. every C*T iterations, T being the number of threads. The key concept here is that the iteration assignment for each thread is pre-computed at compile time, thus no runtime overhead must be paid, yet the loop's slowly changing nature combined with the cyclic work assignment guarantee a fair level of load balancing.


Default (static) scheduling chunk size = N/#threads Static scheduling with chunk size 5
 #pragma omp parallel for
 for (i=0; i<N; i++) 
 {
   slowly_varying_work(i);
 }
 #pragma omp parallel for schedule(static,5)
 for (i=0; i<N; i++) 
 {
   slowly_varying_work(i);
 }


ii)For critical section synchronizationOpenMP provides the critical and atomic clauses. The first one executes atomically the enclosed code block using lock acquire and release methods, as implemented internally in the library. Although the OpenMP standard does not specify how these methods should be implemented, the most popular OpenMP libraries (e.g. gcc, Intel) use blocking-style acquire/release methods, meaning that threads are de-scheduled in the kernel when they fail to acquire the lock. In any case, the critical clause restricts programmer to use a particular lock implementation, with all its pros and cons. However, as we have already discussed, there is no single lock implementation that could perform equally well under all circumstances, and in that sense the critical clause does not provide flexibility.

The atomic clause instructs the compiler to use atomic instructions provided by the underlying architecture in order to execute atomically the specified block. Many architecture extend their instruction set with annotations that enable simple read-modify-write memory operations (e.g. addition, subtraction, increase, logical or, etc.) to execute atomically, as an indivisible operation. At the low level, the hardware guarantees atomicity by issuing special signals to lock the entire or part of the memory subsystem while the corresponding instruction executes. For example, the x86 architecture uses the lock prefix to annotate certain types of instructions, which further asserts a #LOCK signal on the bus. Atomic instructions are much more efficient compared to lock-based implementations, since the lock acquire and release operations are realized entirely in hardware and usually require only a few cycles. However, they have some disadvantages: first, they are not appropriate for any use, but only for small critical sections consisting of single instruction that the underlying ISA can execute atomically. If the OpenMP compiler decides that an atomic code block cannot be transformed into an equivalent atomic instruction, it falls back to ordinary lock acquire/release operations as in the critical clause. Second, even hardware locking does not scale well as the thread number increases. Thus, using atomic instructions under certain conditions (e.g. heavy contention) may become a bottleneck.

Global critical sections, as implemented with the critical clause, may unnecessarily harm performance because every thread is actually competing to acquire a single, global lock, without often needing to do so. Let's consider for example the following code snippet on the left, where in each loop iteration no concurrent updates on the same element of B (by means of "process()" function) are allowed. The index of the element of B that needs to be updated in each iteration step is known only at runtime, through accessing array A. The naive approach is to use a critical clause and protect the execution of "process()". This, however, permits only one thread to enter the critical section each time, even if others intend to work on different elements of B. What we actually need is a more fine-grain control over the critical section, with lock release and acquire operations on a per-element basis. This can be achieved in OpenMP using a distinct lock for each element of B, and acquiring the right one using the corresponding index computed at runtime. This is shown on the code snippet on the right. Here, the library methods omp_set_lock and omp_unset_lock are used to directly acquire and release, respectively, the appropriate lock.


Coarse-grain locking Fine-grain locking
 #pragma omp parallel for
 for (i=0; i<N; i++) 
 {
   j = A[i];
   #pragma omp critical
   process(B[j]);
 }
 omp_lock_t locks[N];
 ...
 #pragma omp parallel for
 for (i=0; i<N; i++) 
 { 
   j = A[i];
   omp_set_lock(&locks[j]);
   process(B[j]);
   omp_unset_lock(&locks[j]);
 }


Many threading libraries, including OpenMP, provide alternatives to OS-blocking or spin-waiting when lock acquisition fails. Non-blocking synchronization mechanisms allow a thread to immediately return on an unsuccessful attempt to acquire the lock, instead of waiting until it succeeds. If the application can be refactored to provide alternate execution paths for threads failing to acquire a lock, then it could benefit from such mechanisms and increase the percentage of time where threads perform useful work. The method that OpenMP provides for this purpose is omp_test_lock; the thread first tests the condition of the lock, and only if it appears to be free it attempts to acquire it, otherwise the thread immediately returns.

iii) To minimize read-write sharing, a common practice in parallel processing is to decentralize shared memory references as much as possible, keep them local on each thread, until the very last moment they need to be merged or communicated between threads. This approach not only reduces coherence traffic due to data localization, but also eliminates synchronization and all its sideeffects. Consider, for example, a simple parallel for loop that counts the zero elements of an array. A naive implementation is depicted in the following code snippet on the left. On each encounter of a zero, this version suffers the overhead of executing an atomic instruction (locking part of the memory subsystem), plus the cost of transferring the cache line containing the variable "zeros" to the current processor's cache (if it is not already there). The suggested implementation is shown on the right. It uses per-thread local variables ("my_zeros") to keep the partial sums of zero elements found by each thread. After the parallel loop execution, the partial sums are reduced to a global sum in a second parallel section. Depending on the zero pattern of the array, the second version can provide dramatically better scalability compared to the first one.


Naive version Thread-local version
 int zeros = 0;
 #pragma omp parallel for
 for (i=0; i<N; i++) 
 {
   if (!A[i]) 
   {
     #pragma omp atomic 
     zeros++;
   }
 }
 int zeros = 0;
 int my_zeros = 0;
 #pragma omp threadprivate(my_zeros)
 
 #pragma omp parallel for
 for (i=0; i<N; i++) 
 {
   if (!A[i]) 
     my_zeros++;
 }
 
 #pragma omp parallel shared(zeros)
 {
    #pragma omp atomic
    zeros += my_zeros;
 }


iv) Regarding task granularity, there is no certain recipe for choosing the best one, as in the chunk size case. Nevertheless, a general rule of thumb is that one should start with coarse-grain tasks and refine them progressively until a fair amount of load balancing is achieved. This means that in a loop nest the outer loop level should be considered first for parallelization before examining the inner levels. Similarly, in a single loop the programmer should start with large chunk sizes and gradually decrease them, again until good scalability is attained. Finally, the programmer should be particularly careful not to choose loops whose parallelization would be unprofitable. These are usually short-duration loops that have small trip count, few instructions within their body, or both. Sometimes it is preferable to let such loops execute sequentially, since the total parallelization cost would be larger than their sequential execution time.

POSIX threads

POSIX threads (Pthreads) provide a portable, low-level interface for raw thread management and synchronization. The programmer is responsible to manage every detail of parallel application: he needs to create, schedule and coordinate threads, define precisely the function executed by each one and orchestrate work distribution and synchronization on shared data. Because of this, Pthreads offer maximum flexibility, but at the cost of high programming effort, error proneness, debugging time, and often performance loss. We could say that they represent somehow the "assembly language" of parallel programming.

The Pthreads API defines a rich set of C types, functions and constants for thread management (create, join, yield, kill, etc.), synchronization (locks, mutexes, barriers, condition variables, etc.) and thread attribute configuration (e.g. thread stack size). The functions and types are usually prefixed by "pthread_".

Using an explicit threading API such as Pthreads, the programmer can gain more fine-grain control over what is being executed by each thread and on which processor, therefore enabling for better interaction between the parallel application and the architecture. This functionality can be supported in OpenMP, as well, but not in a natural and incremental way, since OpenMP lacks appropriate extensions to explicitly control thread-processor mapping. On the other hand, in Pthreads there is no such thing as a runtime system to automatically schedule loop iterations to threads. As a result, the programmer has the responsibility to efficiently manage threads, by avoiding situations that could severely harm performance (e.g. excessive thread creation/destruction, thread oversubscription or undersubscription, etc.), and provide good scalability, load balancing, and adaptation under varying execution environments (different thread counts, architectures, operating systems, etc.).

Using Pthreads to cope with scalability problems

The OpenMP optimizations discussed in the previous section apply also to Pthreads, yet it is generally more tedious to implement them using the low-level API of Pthreads. The following table presents the correspondence between OpenMP and Pthreads with respect to features provided by each to support these optimizations.


OpenMP Pthreads
Load balancing schedule(dynamic|guided|static) clause requires manual orchestration
Synchronization critical sections critical clause pthread_mutex_*, pthread_spin_*, pthread_rwlock_* functions
atomic operations atomic clause compiler-specific atomic builtins e.g., __synch_* intrinsics in gcc and Intel compilers
non-blocking synchronization omp_test_lock function *_trylock and *_timedlock functions
Per-thread variables threadprivate clause

1) pthread_key_*, pthread_setspecific, pthread_getspecific functions

2) special compiler extensions, e.g. __thread storage class specifier in gcc and Intel compilers

Note: the previous apply to variables with global scope, which are thread-shared by default in Pthreads. Variables local in a thread routine are always thread-local.

Task granularity same rules apply in both cases


We note that, regarding synchronization, Pthreads provide three main lock versions: mutexes (OS-based blocking locks), spin-locks (user-level spin-based locks) and readers-writer locks. In this way, the programmer has the flexibility to choose the option that best fits the profile of a given critical section (w.r.t. thread count, contention, etc.). Readers-writer locks are particularly profitable when the synchronization pattern is such that the critical section is accessed mostly for reading and less for writing. These locks allow concurrent access to multiple threads that intend to read within the critical section, but restrict access to a single thread for writing. Most of the aforementioned lock implementations come with their non-blocking variants, by means of either try-lock functions (similar to omp_test_lock) or timed-lock functions. In the latter, a thread blocks until either it acquires the lock or the specified timeout expires.

In Pthreads the programmer needs to define explicitly the way that work is being distributed to threads, the data range that each thread processes and how data is being shared betweenthreads. Despite the increased flexibility this model offers, performance may still be limited due to inefficient interaction of threads with the parallel architecture. More specifically, a typical OS scheduler will usually assign a thread to a specific core according to some policy, trying to avoid thread bouncing between cores and establish a "1-1" mapping. Yet, the scheduler has no knowledge about each thread's working set or the sharing and communication patterns of the application, and, hence, cannot decide how threads should be optimally scheduled to mitigate performance problems due to inefficient utilization of memory hierarchy. Below we describe some situations where user-directed thread scheduling can offer performance advantages compared to the "application-agnostic" OS scheduler's decisions:

Case 1 - constructive cache sharing: Modern parallel architectures are characterized by complex cache organizations, with different cache levels being shared hierarchically among cores. Within a single chip, fast and small caches (e.g., L1) are usually private or shared by a small number of hardware threads, while large but slow ones (e.g. L3) are shared by all cores. When a team of threads in the parallel application work on the same or adjacent data, it might be beneficial to place them on the same chip, under the fastest level of cache that they can all share. In this way, a cache block will need to be fetched only by the first thread that requests it, and afterwards can be reused in the shared cache by the rest threads of the team. On the contrary, if threads were distributed to different chips on the platform, their common data should be transferred multiple times from main memory and placed in each chip's cache as a separate copy. At best, this would increase the pressure on the memory bus. At worst, if threads are further updating these data, there would be additional coherence traffic due to multiple bounces of the corresponding cache lines between chips.

Case 2 - destructive cache sharing: The opposite situation can also be true: when application threads work mostly on disjoint data, then having them under a common cache can be harmful. Threads compete for the same cache space, evicting each other’s data from cache. This reduces the effective cache capacity for each thread, increases the amount of data that need to be fetched multiple times as a result of inter-thread evictions and thus consumes bus bandwidth. On the contrary, if threads in such scenarios are scheduled under separate caches, cache thrashing and additional bus pressure can be minimized.

Case 3 - NUMA-aware thread placement: In the last years, processor industry has made a shift from shared-bus multiprocessors to Non-Uniform Memory Access (NUMA) designs, in an attempt to enhance scalability. Typical examples of such technologies are AMD's HyperTransport and Intel's Quick-Path Interconnect. Such systems incorporate multiple integrated memory controllers and buses, so that each chip (or cluster of chips) can have a separate path to main memory. Potentially, this can eliminate bus contention and provide better scalability, but at the cost of asymmetric memory access times. That is, a processor can use its local bus to access an adjacent memory node in a fast and uncontended way, but accessing a remote node incurs additional latency. As in the previous cases, an arbitrary placement of threads on cores may not take advantage of NUMA capabilities. For example, a pathologic situation is when threads have disjoint working sets but all of them are allocated on a single memory node. This can lead to bus saturation as all threads contend for the same bus. Instead, a much better approach would be to evenly distribute data sets across memory nodes, and then schedule each thread to the core that is local to the memory node containing its data set. Things can be even worse when data/thread placement happens to be so bad, that each thread's accesses are satisfied at their majority by remote nodes. This will significantly degrade performance, even if the parallel application is not so memory-intensive to saturate the bus.

In all aforementioned scenarios performance problems can be solved if the programmer explicitly controls the mapping of application threads to processors. Using processor affinity functions provided by some Pthreads implementations (e.g. Linux), the programmer can bind threads to specific processors on the platform, and thus impose certain mappings between each thread's data and the location of memory hierarchy it resides. Furthermore, using memory affinity functions provided by special NUMA libraries, he can control what data is being allocated on which memory node in a NUMA system, move data between nodes, etc.. All these optimizations help to mitigate bandwidth saturation problems and allow better cache utilization.

The Pthreads implementation in Linux provides the pthread_setaffinity_np and pthread_getaffinity_np functions to set and get, respectively, the processor affinity of a single thread. By processor affinity we mean the set of processors where a thread is allowed to run, and is defined by means of a bit-mask which is known as the CPUset. By setting a single bit in the CPUset of a thread, we can restrict it to run on the corresponding processor. Once the processor affinity of a thread is set, the OS scheduler will take it into account and will not move the thread to a different processor than this specified. The CPUsets can be manipulated using the CPU_* macros defined in "sched.h" header. The user can find information about the processors id's, topology, cache sharing, etc., using the /sys/devices/system/cpu/ pseudo file system.

In NUMA architectures, one can use functions provided by the libnuma library available in Linux, in order to specify a certain page-allocation policy for each thread or to explicitly control allocation of data. For example, using the numa_alloc_onnode function the programmer directs OS to dynamically allocate data on a specific memory node. Note that many of these functions serve as hints to the OS, in the sense that allocation is usually done in a best-effort manner; if data cannot be allocated to a preferred memory node, e.g. because it has run out of pages, it will be allocated somewhere else. Similarly to CPUsets, libnuma's functions work with node sets, which are bit-masks specifying the memory nodes where a function's actions should be applied. The user can learn about the memory nodes id's and the set of cores that are local to each node via the /sys/devices/system/node/ file system.

Generic Optimization Guidelines

In the following sections we give some generic guidelines on how to cope with the rest scalability issues mentioned in section 1. These guidelines can by applied regardless of the parallelization library used.

Avoiding false-sharing and misaligned memory accesses

To correct falsesharing, we have to make sure that the variables causing it are allocated in memory with enough space between them, so that they cannot reside on the same cache line. A common case where falsesharing is met in practice is between two variables declared consecutively, e.g. within a structure, as shown in the following code snippet on the left. Most compilers will allocate structure's variables in contiguous memory locations in order to save space, so it is quite likely that "var1" and "var2" will fall into the same cache line. To avoid this, we can use compiler directives to force variables to be placed at a different cache line, by aligning them at the cache line boundary (64 bytes, in most systems). The code on the right shows how this is achieved in the case of gcc compiler. Intel compiler provides the __declspec (align(N)) directive, N being the alignment boundary.


Default case (causes false-sharing) Properly aligned variables to avoid false-sharing (gcc)
 struct shared_data {
   int var1; //updated by thread1
   int var2; //updated by thread2
 }
 struct shared_data {
   int var1 __attribute__ ((aligned (64)));
   int var1 __attribute__ ((aligned (64)));
 };


Another problematic case is when using an array of structures and the aggregate size of the structure's variables is not a multiple of the cache line size. For the same reason as in the previous example, it might be possible that a new array element does not start from a new cache line, and therefore accessing consecutive array elements may cause falsesharing. The following example presents such a scenario, with an array of structures representing threads' arguments. What is needed in that case is to pad the structure at the end, in order to make sure that the next element begins on a new cache line, and also align the array itself (its first element) at the cache line boundary.


Default case (causes false-sharing) Properly aligned array elements to avoid false-sharing (gcc)
 struct thread_data { 
   //total size: 3*4 = 12 bytes
   int my_x; 
   int my_y;
   int my_sum; 
 }
 
 struct thread_data args[8];
 struct thread_data { 
   int my_x; 
   int my_y;
   int my_sum; 
   int pad[13]; //adds 13*4=52 dummy bytes to fill
                     //the remainder of the cache line
 }
 
 //align array to begin at the beginning of a cache line
 struct thread_data args[8] __attribute__ ((aligned (64)));


The same alignment directives can be used to cope with the less frequent, yet possible scenario of a single variable being allocated at a misaligned memory address, in such a way that it spans consecutive cache lines.

Other memory optimizations

Placing each shared variable in the program on a separate cache line would be the ideal solution to solve issues such as falsesharing. This approach is impractical, however, since it would unnecessarily increase the memory footprint of the program and would not allow to take advantage of spatial locality in shared cache configurations, e.g. when threads work on adjacent data. What we actually need is a more judicious classification of shared variables according to their usage pattern.

Global variables that are read-only (or rarely written) should be grouped together, in hope that the compiler will allocate them in a separate memory region. In this way, we decrease the likelihood that a variable that is mostly written lies somewhere between the read-only data, which could be another source of falsesharing. Furthermore, in NUMA platforms, when threads share a large amount of variables in a read-only fashion, it could make sense to have them replicated on each memory node. This would allow all threads to access the variables at the highest speed, without putting all pressure on a single memory bus.

Global read-write variables that are often used close to each other in time can also be grouped, e.g. by being declared consecutively or within a structure. This would reduce their memory footprint and potentially increase the spatial locality of the program, since multiple variables could be transferred with a few cache line fetches.

Global read-write variables that are mostly accessed by different threads should be placed on a different cache line. As explained in the previous section, this is achieved by aligning each variable to the cache line boundary, and then add padding to fill the remainder of the cache line.

Profilers

gprof

Gprof is a widely used profiling tool for Unix systems thatproduces an execution profile of C and Fortran programs. It can show the application call graph, which represents the calling relationships between functions in the program, and the percentage of total execution time spent in each function. The tool first instruments the application executable with information that will help build the call graph. It then runs the application and samples periodically the execution to track what code is currently being executed. This process is known as statistical sampling. Instrumentation can accurately determine the number of times a function is called, while sampling approximates the time spent within each function. Because of the bookkeeping overhead of instrumentation and the use of a special system call needed for sampling, it is likely that gprof will slow down the application execution.

To profile an application with gprof, the first step is to compile its source code with the -pg option of gcc. It is sometimes useful to turn on symbol information using the -g3 option of gcc, which is necessary if the user wishes the source code to be annotated with profiling information. Then, the user can run the instrumented application and a special output file is generated (by default gmon.out). Finally, the user can invoke gprof as follows to display the profiling results:

 gprof [--flat-profile --graph --annotated-source] ./program

The --flat-profile option prints the total amount of time spent and the number of calls to each function. The --graph option shows the call graph. The --annotated-source option prints profiling information next to the original source code. The main advantage of gprof is that it offers a "quick-and-dirty" way to identify hot-spots in an application, so that the programmer can focus on the most time-consuming parts of it as candidates for parallelization. Unfortunately, the tool can only profile the "user-level" part of an application, meaning that it cannot profile external shared libraries (e.g. libc) or system calls. A serious sideeffect of this is that multithreaded applications cannot be profiled. Actually, in such cases profiling information is gathered only for the main application thread, which is not helpful.

valgrind

Oprofile

Use cases

Personal tools