From HP-SEE Wiki

Shared memory paradigm

Achieving scalable performance for parallel applications on modern shared-memory multiprocessing platforms is not straightforward. Apart from issues inherent to each application, for example the ability to efficiently decompose it in parallel tasks, other factors such as the interaction with the underlying architecture may affect scalability significantly. Often, even applications with ample and easy to extract parallelism (e.g., large independent parallel portions, sparse synchronization, etc.) fail to scale well on multiple cores due to architectural bottlenecks. Similarly, applications exhibiting good scalability on a specific platform may need extensive re-factoring to run equally well on a different one. Things can become even worse if the application itself raises inherent difficulties for efficient parallelization.

Some of the most common sources of poor scalability met in practice are the following:

a) Excessive read-write sharing, false sharing and misaligned memory accesses

One of the most essential features of modern multiprocessor systems is the cache coherence protocol. Its responsibility is to ensure that all the processors' caches share data from system memory properly and do not use stale data that has been modified in another processor's cache. In other words, the protocol makes it possible for all processors to work like they are all connected directly to a single, globally shared memory module, while actually working with caches and multiple cached copies of data. This abstraction of shared memory eases parallel programming significantly, because the parallel threads of an application can refer directly to memory locations (just as in sequential programs), but may hide serious performance pitfalls.

Excessive read-write sharing between processors on a centralized data structure can introduce a large amount of traffic on the bus. This is because on every write operation a processor has to invalidate all existing copies of the cache line holding the data structure, and on subsequent reads from other processors the modified cache line has to be transferred to their caches. This "ping-pong" effect usually introduces large performance penalties, not only because of the actual data transfers but also due to the large amount of protocol control messages being exchanged. The cost associated with each such read or write operation can be many times larger than an ordinary read or write to private data, and increases with the number of sharers, the distance (in terms of processor topology) of the requestors, the frequency of read-write operations, etc.. Typical examples of variables being heavily shared across processors under a read-write pattern are reduction and synchronization variables.

Another common issue is when threads on different processors modify variables that happen to reside on the same cache line. This problem is known as false sharing, because it occurs unintentionally (i.e. threads do not actually access the same variable), but has all the performance drawbacks of true read-write sharing discussed above. While the programmer cannot usually do many things to avoid or eliminate true sharing at the application level, he can do much more to remedy false sharing, as we will discuss in following sections.

Finally, a not so common but extremely serious, in terms of performance, scenario is that of misaligned memory accesses. The x86 architecture is infamous for allowing the execution of loads and stores at unaligned addresses, i.e. at addresses that are not evenly divisible by the word size or multiples of it.

Practically, most compilers force variables of primitive data types to be aligned at a boundary that is a multiple of the word size, but it might be possible for a variable to be allocated at a misaligned address, e.g. through customized allocation mechanisms on the heap. The danger here is that the variable may end up spanning adjacent cache lines, both of which should be transferred each time it is referenced. This introduces enormous overhead when such variables are being shared by many processors, which scales linearly with the number of sharers. In practice, even when only two processors share a misaligned variable they can suffer an average access time as high as 50 times larger than the aligned case.

b) Memory bandwidth saturation

Memory access can still become a serious bottleneck even when threads of a parallel application work mostly on private data, without incurring notable inter-processor traffic. This is particularly true for memory-intensive parallel applications with large working sets (e.g. streaming applications), which may suffer from memory bandwidth saturation as more threads are introduced. In such cases, the application will not probably scale as expected, however efficiently parallelized it is, and the performance will be rather poor.

Each socket in the platform has a maximum bandwidth to memory thatis shared by all processing elements it encompasses (cores, hardware threads, etc.). Depending on the architecture, even multiple sockets might share access to main memory through a common bus. Given that even under perfect conditions (e.g. full software optimizations) the memory subsystem cannot fulfill a single thread's requests without having its core stalled, we can imagine the amount of pressure put on memory bus and how quickly it can be saturated when the number of cores increases. This is why the processor industry strives to provide improved bus speeds or alternate memory paths on each new processor generation (e.g. through Non-Uniform Memory Access designs), but unfortunately these enhancements have never been enough to make the memory subsystem keep pace with the increasing core counts.

c) Load imbalance

Most parallelism in high performance applications lies within loops, and static scheduling has long been the method of choice for parallelizing them. In static scheduling, the number of iterations is divided by the number of processors and each processor is assigned a specific chunk to process. While this approach is easy to implement, has low overhead and works well when all iterations of the loop perform approximately the same amount of work (e.g. in dense linear algebra computations), it results in poor performance and load imbalance when iterations perform unpredictable and varying amounts of work, usually as a result of different input data. This happens for example in sparse linear algebra kernels, where the sparsity pattern of the matrix determines the amount of computations associated with a specific iteration sub-range. Because at the end of each parallel loop all threads are synchronized before proceeding to the next phase, the parallel execution time of the loop is actually determined by the execution time of the slowest thread (which forms the critical path of parallel execution). Under considerable load imbalance, some threads may end up taking over the vast majority of computations while others sitting idle, wasting CPU cycles and not contributing to overall progress. This leads to inefficient utilization of system resources, preventing scalability from reaching higher levels.

d) Synchronization overhead

Parallel applications use locks to synchronize entry to program regions where access should be atomic, usually to protect some shared resource from concurrent updates, such as a shared variable or an output file. Such regions are known as critical sections.While a thread is inside a critical section no other thread can enter, implying that all threads requesting access should wait. As a result, the execution of critical sections is serialized. There are many parameters of lock usage and implementation that can affect performance of a parallel application. Below we mention the most important of them:

critical section extent: Since critical sections constitute a serialization point in parallel execution, they should be as small as possible to reduce the amount of time other threads sit idle waiting to enter the critical section, and to therefore allow the application to scale efficiently.
lock contention: If there is high demand among threads for accessing the resource protected by a critical section, there could be simultaneously many threads contending for the corresponding lock. In the best case, the waiting time for a thread could scale linearly with the total number of competing threads. In practice, however, things can be even worse, as the period a thread spends waiting can introduce significant coherence traffic due to read-write sharing on the lock variable.
locking overhead: The operations to acquire and release a lock entail by themselves some measurable overhead. In general, there are two main classes of lock implementations with different characteristics with respect to cost and scalability: user-level spinning and OS-level blocking. In the first case, threads poll continuously the value of a user-level lock variable until it appears to be free. Spin-based locks are efficient and thus preferable for small thread counts (e.g. less than 10) and short critical sections. Because they operate entirely at user space, acquire and release operations have low latency and are easier to implement. However, in critical sections thatincur long waiting times (i.e., large extent and/or high contention) they can introduce notable coherence traffic. The same happens also for large thread counts. In these cases spin-based locks do not scale well and OS-based ones may perform better. A thread that uses an OS-based lock goes to sleep in kernel mode if it fails to acquire the lock. This action, as well as the action of waking up a sleeping thread comes at the cost of executing a system call. It is harder to implement, it penalizes short critical sections (e.g., critical sections with execution time almost equal or less than the invocation time of acquire/release operations) but it is much more efficient and scalable for long-term waiting and large thread counts, since waiters do not consume resources (CPU, memory).
lock granularity: The granularity is a measure of the amount of data the lock is protecting. In many cases, the programmer has the freedom to select between coarse-grain and fine-grain schemes, as regards the implementation of synchronized access on a shared data structure. In the former, a small number of locks is utilized to protect large segments in the data structure. This approach is easy in its conception and implementation, but entails a lot of performance drawbacks, i.e. large critical sections and increased likelihood for contention. Conversely, fine-grain schemes use many locks to protect small pieces of data. They are usually more difficult and error-prone to program, but yield smaller critical sections with reduced contention. However, such small critical sections, even completely uncontended, can introduce overhead in two ways: first, because of the additional bus traffic due to the large number of lock variables, and second, because of the increased locking overhead associated with each memory operation on the protected data structure (compared to a coarser-grain scheme). Finding the best granularity level is not trivial and requires considering all the aforementioned factors.

e) Task granularity

Finally, a possible reason for bad performance can be the wrong choice of parallel tasks granularity. Granularity is the amount of work executed by a task. The programmer can control it usually by means of changing the number of consecutive iterations assigned to a parallel thread, or changing the loop level in a loop nest where parallelization is applied. If granularity is too coarse, performance may suffer from load imbalance since it is more likely to find uneven work in large iteration chunks. If granularity is too fine and many small chunks are assigned to each thread, better load balancing can be achieved but performance could suffer from parallelization overhead. This overhead is usually the aggregate cost of the following operations: the cost of assigning a parallel task (e.g., a chunk of iterations) to a thread, the cost of sharing data between threads and the cost of thread synchronization. The cost of task assignment is usually irrelevant to the task granularity, but due to the large number of fine-grain tasks it is increased overall. Furthermore, a fine-grain scheme may require more frequent data sharing and synchronization between threads compared to a coarser-grain one, both of which affect performance negatively, as we have already discussed.

OpenMP

OpenMP is a successful programming model for shared memory parallel architectures, particularly popular in the HPC community. It provides a simple and portable high-level interface for developing parallel applications in C, C++ and Fortran, in a wide range of processor architectures and operating systems. It consists of a set of compiler directives (pragmas), library routines and environment variables. Its simplicity lies in the fact that a serial application can be transformed into a parallel one in a rather straightforward manner, just by adding certain compiler pragmas chosen from a small set, and without requiring major source code restructuring. An advantage of this incremental parallelization approach is that OpenMP pragmas are ignored when sequential compiling options are used, therefore a single source code version needs to be maintained both for serial and parallel applications. OpenMP is supported by all mainstream compilers, including gcc and Intel compilers.

OpenMP is based on the fork-join threading model. Each time an omp parallel pragma is encountered, the executing thread (known as master thread) spawns a specified number of slave threads thatrun concurrently on the different cores of the platform. Using a number of work-sharing constructs (for, do, sections, single, master) the programmer specifies how work (loop iterations, code blocks) should be assigned to one or all of the threads. The actual work distribution is performed by the runtime system. Since OpenMP is a shared-memory programming model, most variables in the source code are visible to all threads by default. But sometimes it is desirable to keep some variables private (i.e., local to each thread), or to specify in more detail how a variable mutates in its transition from the sequential part to the parallel region. For this reason OpenMP provides data sharing attribute clauses (shared, private, firstprivate, lastprivate, reduction, etc.) which can be appended to the work-sharing constructs in order to guide the compiler/runtime system on how to treat each variable in the associated code block. Finally, an important set of extensions provided by OpenMP are clauses related to synchronization (critical, atomic, barrier, nowait), which we will further explore in the following paragraphs.

Using OpenMP to cope with scalability problems

OpenMP can help a parallel application improve its scalability in the following ways:

i) To balance the load in loops with varying or unpredictable work in each iteration step, OpenMP provides two dynamic scheduling options to be used in the work-sharing loop constructs, as an alternative to the default static scheduling policy. The first one, dynamic, assigns dynamically chunks of contiguous iterations to threads. The chunk size, i.e. the number of iterations assigned to each thread, can be optionally specified as an argument. Whenever a thread "consumes" its chunk of iterations, it requests and gets assigned by the runtime system the next available. In this way threads are kept always busy, system resources are utilized efficiently and the load imbalance is reduced. As we mentioned in the previous section, small iteration chunks achieve better load balancing but entail larger overhead, and vice-versa. There is no general recipe for choosing the optimal chunk size for every dynamically scheduled loop; the user should experiment with many options to find the one that makes the best compromise between low overhead and good scalability. In the following example it is shown how an OpenMP parallel for loop can be annotated to use dynamic scheduling.

Default (static) scheduling chunk size = N/#threads	Dynamic scheduling with chunk size 10
#pragma omp parallel for for (i=0; i<N; i++) { varying_work(i); }	#pragma omp parallel for schedule(dynamic,10) for (i=0; i<N; i++) { varying_work(i); }

The second dynamic policy, guided, assigns initially large iteration chunks to threads, reducing them exponentially with each successive assignment. Due to the "geometry" of the chunks allocation, guided scheduling has usually less overhead than dynamic. The minimum size of chunk size beyond of which no further reduction is allowed can be optionally specified through an argument, as in the case of dynamic policy.

Another possible scenario is when the load of a loop is unbalanced in a clearly coarse-grain fashion: work is uneven across large iteration sub-ranges, but within a certain sub-range consecutive iterations almost have an equal weight. In that case, it could be more profitable to use static scheduling with a small chunk size instead of paying the overhead of dynamic scheduling. By specifying a chunk size C in static scheduling, a thread is assigned C consecutive iterations and this assignment repeats cyclically, i.e. every C*T iterations, T being the number of threads. The key concept here is that the iteration assignment for each thread is pre-computed at compile time, thus no runtime overhead must be paid, yet the loop's slowly changing nature combined with the cyclic work assignment guarantee a fair level of load balancing.

Default (static) scheduling chunk size = N/#threads	Static scheduling with chunk size 5
#pragma omp parallel for for (i=0; i<N; i++) { slowly_varying_work(i); }	#pragma omp parallel for schedule(static,5) for (i=0; i<N; i++) { slowly_varying_work(i); }

ii)For critical section synchronizationOpenMP provides the critical and atomic clauses. The first one executes atomically the enclosed code block using lock acquire and release methods, as implemented internally in the library. Although the OpenMP standard does not specify how these methods should be implemented, the most popular OpenMP libraries (e.g. gcc, Intel) use blocking-style acquire/release methods, meaning that threads are de-scheduled in the kernel when they fail to acquire the lock. In any case, the critical clause restricts programmer to use a particular lock implementation, with all its pros and cons. However, as we have already discussed, there is no single lock implementation that could perform equally well under all circumstances, and in that sense the critical clause does not provide flexibility.

The atomic clause instructs the compiler to use atomic instructions provided by the underlying architecture in order to execute atomically the specified block. Many architecture extend their instruction set with annotations that enable simple read-modify-write memory operations (e.g. addition, subtraction, increase, logical or, etc.) to execute atomically, as an indivisible operation. At the low level, the hardware guarantees atomicity by issuing special signals to lock the entire or part of the memory subsystem while the corresponding instruction executes. For example, the x86 architecture uses the lock prefix to annotate certain types of instructions, which further asserts a #LOCK signal on the bus. Atomic instructions are much more efficient compared to lock-based implementations, since the lock acquire and release operations are realized entirely in hardware and usually require only a few cycles. However, they have some disadvantages: first, they are not appropriate for any use, but only for small critical sections consisting of single instruction that the underlying ISA can execute atomically. If the OpenMP compiler decides that an atomic code block cannot be transformed into an equivalent atomic instruction, it falls back to ordinary lock acquire/release operations as in the critical clause. Second, even hardware locking does not scale well as the thread number increases. Thus, using atomic instructions under certain conditions (e.g. heavy contention) may become a bottleneck.

Global critical sections, as implemented with the critical clause, may unnecessarily harm performance because every thread is actually competing to acquire a single, global lock, without often needing to do so. Let's consider for example the following code snippet on the left, where in each loop iteration no concurrent updates on the same element of B (by means of "process()" function) are allowed. The index of the element of B that needs to be updated in each iteration step is known only at runtime, through accessing array A. The naive approach is to use a critical clause and protect the execution of "process()". This, however, permits only one thread to enter the critical section each time, even if others intend to work on different elements of B. What we actually need is a more fine-grain control over the critical section, with lock release and acquire operations on a per-element basis. This can be achieved in OpenMP using a distinct lock for each element of B, and acquiring the right one using the corresponding index computed at runtime. This is shown on the code snippet on the right. Here, the library methods omp_set_lock and omp_unset_lock are used to directly acquire and release, respectively, the appropriate lock.

Coarse-grain locking

Fine-grain locking

 #pragma omp parallel for
 for (i=0; i<N; i++) 
 {
   j = A[i];
   #pragma omp critical
   process(B[j]);
 }

 omp_lock_t locks[N];
 ...
 #pragma omp parallel for
 for (i=0; i<N; i++) 
 { 
   j = A[i];
   omp_set_lock(&locks[j]);
   process(B[j]);
   omp_unset_lock(&locks[j]);
 }

Many threading libraries, including OpenMP, provide alternatives to OS-blocking or spin-waiting when lock acquisition fails. Non-blocking synchronization mechanisms allow a thread to immediately return on an unsuccessful attempt to acquire the lock, instead of waiting until it succeeds. If the application can be refactored to provide alternate execution paths for threads failing to acquire a lock, then it could benefit from such mechanisms and increase the percentage of time where threads perform useful work. The method that OpenMP provides for this purpose is omp_test_lock; the thread first tests the condition of the lock, and only if it appears to be free it attempts to acquire it, otherwise the thread immediately returns.

iii) To minimize read-write sharing, a common practice in parallel processing is to decentralize shared memory references as much as possible, keep them local on each thread, until the very last moment they need to be merged or communicated between threads. This approach not only reduces coherence traffic due to data localization, but also eliminates synchronization and all its sideeffects. Consider, for example, a simple parallel for loop that counts the zero elements of an array. A naive implementation is depicted in the following code snippet on the left. On each encounter of a zero, this version suffers the overhead of executing an atomic instruction (locking part of the memory subsystem), plus the cost of transferring the cache line containing the variable "zeros" to the current processor's cache (if it is not already there). The suggested implementation is shown on the right. It uses per-thread local variables ("my_zeros") to keep the partial sums of zero elements found by each thread. After the parallel loop execution, the partial sums are reduced to a global sum in a second parallel section. Depending on the zero pattern of the array, the second version can provide dramatically better scalability compared to the first one.

Naive version

Thread-local version

 int zeros = 0;
 #pragma omp parallel for
 for (i=0; i<N; i++) 
 {
   if (!A[i]) 
   {
     #pragma omp atomic 
     zeros++;
   }
 }

 int zeros = 0;
 int my_zeros = 0;
 #pragma omp threadprivate(my_zeros)
 
 #pragma omp parallel for
 for (i=0; i<N; i++) 
 {
   if (!A[i]) 
     my_zeros++;
 }
 
 #pragma omp parallel shared(zeros)
 {
    #pragma omp atomic
    zeros += my_zeros;
 }

iv) Regarding task granularity, there is no certain recipe for choosing the best one, as in the chunk size case. Nevertheless, a general rule of thumb is that one should start with coarse-grain tasks and refine them progressively until a fair amount of load balancing is achieved. This means that in a loop nest the outer loop level should be considered first for parallelization before examining the inner levels. Similarly, in a single loop the programmer should start with large chunk sizes and gradually decrease them, again until good scalability is attained. Finally, the programmer should be particularly careful not to choose loops whose parallelization would be unprofitable. These are usually short-duration loops that have small trip count, few instructions within their body, or both. Sometimes it is preferable to let such loops execute sequentially, since the total parallelization cost would be larger than their sequential execution time.

POSIX threads

POSIX threads (Pthreads) provide a portable, low-level interface for raw thread management and synchronization. The programmer is responsible to manage every detail of parallel application: he needs to create, schedule and coordinate threads, define precisely the function executed by each one and orchestrate work distribution and synchronization on shared data. Because of this, Pthreads offer maximum flexibility, but at the cost of high programming effort, error proneness, debugging time, and often performance loss. We could say that they represent somehow the "assembly language" of parallel programming.

The Pthreads API defines a rich set of C types, functions and constants for thread management (create, join, yield, kill, etc.), synchronization (locks, mutexes, barriers, condition variables, etc.) and thread attribute configuration (e.g. thread stack size). The functions and types are usually prefixed by "pthread_".

Using an explicit threading API such as Pthreads, the programmer can gain more fine-grain control over what is being executed by each thread and on which processor, therefore enabling for better interaction between the parallel application and the architecture. This functionality can be supported in OpenMP, as well, but not in a natural and incremental way, since OpenMP lacks appropriate extensions to explicitly control thread-processor mapping. On the other hand, in Pthreads there is no such thing as a runtime system to automatically schedule loop iterations to threads. As a result, the programmer has the responsibility to efficiently manage threads, by avoiding situations that could severely harm performance (e.g. excessive thread creation/destruction, thread oversubscription or undersubscription, etc.), and provide good scalability, load balancing, and adaptation under varying execution environments (different thread counts, architectures, operating systems, etc.).

Using Pthreads to cope with scalability problems

The OpenMP optimizations discussed in the previous section apply also to Pthreads, yet it is generally more tedious to implement them using the low-level API of Pthreads. The following table presents the correspondence between OpenMP and Pthreads with respect to features provided by each to support these optimizations.

		OpenMP	Pthreads
Load balancing		schedule(dynamic\|guided\|static) clause	requires manual orchestration
Synchronization	critical sections	critical clause	pthread_mutex_, pthread_spin_, pthread_rwlock_* functions
	atomic operations	atomic clause	compiler-specific atomic builtins e.g., __synch_* intrinsics in gcc and Intel compilers
	non-blocking synchronization	omp_test_lock function	_trylock and _timedlock functions
Per-thread variables		threadprivate clause	1) pthread_key_*, pthread_setspecific, pthread_getspecific functions 2) special compiler extensions, e.g. __thread storage class specifier in gcc and Intel compilers Note: the previous apply to variables with global scope, which are thread-shared by default in Pthreads. Variables local in a thread routine are always thread-local.
Task granularity		same rules apply in both cases

We note that, regarding synchronization, Pthreads provide three main lock versions: mutexes (OS-based blocking locks), spin-locks (user-level spin-based locks) and readers-writer locks. In this way, the programmer has the flexibility to choose the option that best fits the profile of a given critical section (w.r.t. thread count, contention, etc.). Readers-writer locks are particularly profitable when the synchronization pattern is such that the critical section is accessed mostly for reading and less for writing. These locks allow concurrent access to multiple threads that intend to read within the critical section, but restrict access to a single thread for writing. Most of the aforementioned lock implementations come with their non-blocking variants, by means of either try-lock functions (similar to omp_test_lock) or timed-lock functions. In the latter, a thread blocks until either it acquires the lock or the specified timeout expires.

In Pthreads the programmer needs to define explicitly the way that work is being distributed to threads, the data range that each thread processes and how data is being shared betweenthreads. Despite the increased flexibility this model offers, performance may still be limited due to inefficient interaction of threads with the parallel architecture. More specifically, a typical OS scheduler will usually assign a thread to a specific core according to some policy, trying to avoid thread bouncing between cores and establish a "1-1" mapping. Yet, the scheduler has no knowledge about each thread's working set or the sharing and communication patterns of the application, and, hence, cannot decide how threads should be optimally scheduled to mitigate performance problems due to inefficient utilization of memory hierarchy. Below we describe some situations where user-directed thread scheduling can offer performance advantages compared to the "application-agnostic" OS scheduler's decisions:

Case 1 - constructive cache sharing: Modern parallel architectures are characterized by complex cache organizations, with different cache levels being shared hierarchically among cores. Within a single chip, fast and small caches (e.g., L1) are usually private or shared by a small number of hardware threads, while large but slow ones (e.g. L3) are shared by all cores. When a team of threads in the parallel application work on the same or adjacent data, it might be beneficial to place them on the same chip, under the fastest level of cache that they can all share. In this way, a cache block will need to be fetched only by the first thread that requests it, and afterwards can be reused in the shared cache by the rest threads of the team. On the contrary, if threads were distributed to different chips on the platform, their common data should be transferred multiple times from main memory and placed in each chip's cache as a separate copy. At best, this would increase the pressure on the memory bus. At worst, if threads are further updating these data, there would be additional coherence traffic due to multiple bounces of the corresponding cache lines between chips.

Case 2 - destructive cache sharing: The opposite situation can also be true: when application threads work mostly on disjoint data, then having them under a common cache can be harmful. Threads compete for the same cache space, evicting each other’s data from cache. This reduces the effective cache capacity for each thread, increases the amount of data that need to be fetched multiple times as a result of inter-thread evictions and thus consumes bus bandwidth. On the contrary, if threads in such scenarios are scheduled under separate caches, cache thrashing and additional bus pressure can be minimized.

Case 3 - NUMA-aware thread placement: In the last years, processor industry has made a shift from shared-bus multiprocessors to Non-Uniform Memory Access (NUMA) designs, in an attempt to enhance scalability. Typical examples of such technologies are AMD's HyperTransport and Intel's Quick-Path Interconnect. Such systems incorporate multiple integrated memory controllers and buses, so that each chip (or cluster of chips) can have a separate path to main memory. Potentially, this can eliminate bus contention and provide better scalability, but at the cost of asymmetric memory access times. That is, a processor can use its local bus to access an adjacent memory node in a fast and uncontended way, but accessing a remote node incurs additional latency. As in the previous cases, an arbitrary placement of threads on cores may not take advantage of NUMA capabilities. For example, a pathologic situation is when threads have disjoint working sets but all of them are allocated on a single memory node. This can lead to bus saturation as all threads contend for the same bus. Instead, a much better approach would be to evenly distribute data sets across memory nodes, and then schedule each thread to the core that is local to the memory node containing its data set. Things can be even worse when data/thread placement happens to be so bad, that each thread's accesses are satisfied at their majority by remote nodes. This will significantly degrade performance, even if the parallel application is not so memory-intensive to saturate the bus.

In all aforementioned scenarios performance problems can be solved if the programmer explicitly controls the mapping of application threads to processors. Using processor affinity functions provided by some Pthreads implementations (e.g. Linux), the programmer can bind threads to specific processors on the platform, and thus impose certain mappings between each thread's data and the location of memory hierarchy it resides. Furthermore, using memory affinity functions provided by special NUMA libraries, he can control what data is being allocated on which memory node in a NUMA system, move data between nodes, etc.. All these optimizations help to mitigate bandwidth saturation problems and allow better cache utilization.

The Pthreads implementation in Linux provides the pthread_setaffinity_np and pthread_getaffinity_np functions to set and get, respectively, the processor affinity of a single thread. By processor affinity we mean the set of processors where a thread is allowed to run, and is defined by means of a bit-mask which is known as the CPUset. By setting a single bit in the CPUset of a thread, we can restrict it to run on the corresponding processor. Once the processor affinity of a thread is set, the OS scheduler will take it into account and will not move the thread to a different processor than this specified. The CPUsets can be manipulated using the CPU_* macros defined in "sched.h" header. The user can find information about the processors id's, topology, cache sharing, etc., using the /sys/devices/system/cpu/ pseudo file system.

In NUMA architectures, one can use functions provided by the libnuma library available in Linux, in order to specify a certain page-allocation policy for each thread or to explicitly control allocation of data. For example, using the numa_alloc_onnode function the programmer directs OS to dynamically allocate data on a specific memory node. Note that many of these functions serve as hints to the OS, in the sense that allocation is usually done in a best-effort manner; if data cannot be allocated to a preferred memory node, e.g. because it has run out of pages, it will be allocated somewhere else. Similarly to CPUsets, libnuma's functions work with node sets, which are bit-masks specifying the memory nodes where a function's actions should be applied. The user can learn about the memory nodes id's and the set of cores that are local to each node via the /sys/devices/system/node/ file system.

Generic Optimization Guidelines

In the following sections we give some generic guidelines on how to cope with the rest scalability issues mentioned in section 1. These guidelines can by applied regardless of the parallelization library used.

Avoiding false-sharing and misaligned memory accesses

To correct falsesharing, we have to make sure that the variables causing it are allocated in memory with enough space between them, so that they cannot reside on the same cache line. A common case where falsesharing is met in practice is between two variables declared consecutively, e.g. within a structure, as shown in the following code snippet on the left. Most compilers will allocate structure's variables in contiguous memory locations in order to save space, so it is quite likely that "var1" and "var2" will fall into the same cache line. To avoid this, we can use compiler directives to force variables to be placed at a different cache line, by aligning them at the cache line boundary (64 bytes, in most systems). The code on the right shows how this is achieved in the case of gcc compiler. Intel compiler provides the __declspec (align(N)) directive, N being the alignment boundary.

Default case (causes false-sharing)	Properly aligned variables to avoid false-sharing (gcc)
struct shared_data { int var1; //updated by thread1 int var2; //updated by thread2 }	struct shared_data { int var1 __attribute__ ((aligned (64))); int var1 __attribute__ ((aligned (64))); };

Another problematic case is when using an array of structures and the aggregate size of the structure's variables is not a multiple of the cache line size. For the same reason as in the previous example, it might be possible that a new array element does not start from a new cache line, and therefore accessing consecutive array elements may cause falsesharing. The following example presents such a scenario, with an array of structures representing threads' arguments. What is needed in that case is to pad the structure at the end, in order to make sure that the next element begins on a new cache line, and also align the array itself (its first element) at the cache line boundary.

Default case (causes false-sharing)

Properly aligned array elements to avoid false-sharing (gcc)

 struct thread_data { 
   //total size: 3*4 = 12 bytes
   int my_x; 
   int my_y;
   int my_sum; 
 }
 
 struct thread_data args[8];

 struct thread_data { 
   int my_x; 
   int my_y;
   int my_sum; 
   int pad[13]; //adds 13*4=52 dummy bytes to fill
                     //the remainder of the cache line
 }
 
 //align array to begin at the beginning of a cache line
 struct thread_data args[8] __attribute__ ((aligned (64)));

The same alignment directives can be used to cope with the less frequent, yet possible scenario of a single variable being allocated at a misaligned memory address, in such a way that it spans consecutive cache lines.

Other memory optimizations

Placing each shared variable in the program on a separate cache line would be the ideal solution to solve issues such as falsesharing. This approach is impractical, however, since it would unnecessarily increase the memory footprint of the program and would not allow to take advantage of spatial locality in shared cache configurations, e.g. when threads work on adjacent data. What we actually need is a more judicious classification of shared variables according to their usage pattern.

Global variables that are read-only (or rarely written) should be grouped together, in hope that the compiler will allocate them in a separate memory region. In this way, we decrease the likelihood that a variable that is mostly written lies somewhere between the read-only data, which could be another source of falsesharing. Furthermore, in NUMA platforms, when threads share a large amount of variables in a read-only fashion, it could make sense to have them replicated on each memory node. This would allow all threads to access the variables at the highest speed, without putting all pressure on a single memory bus.

Global read-write variables that are often used close to each other in time can also be grouped, e.g. by being declared consecutively or within a structure. This would reduce their memory footprint and potentially increase the spatial locality of the program, since multiple variables could be transferred with a few cache line fetches.

Global read-write variables that are mostly accessed by different threads should be placed on a different cache line. As explained in the previous section, this is achieved by aligning each variable to the cache line boundary, and then add padding to fill the remainder of the cache line.

Profilers

gprof

Gprof is a widely used profiling tool for Unix systems thatproduces an execution profile of C and Fortran programs. It can show the application call graph, which represents the calling relationships between functions in the program, and the percentage of total execution time spent in each function. The tool first instruments the application executable with information that will help build the call graph. It then runs the application and samples periodically the execution to track what code is currently being executed. This process is known as statistical sampling. Instrumentation can accurately determine the number of times a function is called, while sampling approximates the time spent within each function. Because of the bookkeeping overhead of instrumentation and the use of a special system call needed for sampling, it is likely that gprof will slow down the application execution.

To profile an application with gprof, the first step is to compile its source code with the -pg option of gcc. It is sometimes useful to turn on symbol information using the -g3 option of gcc, which is necessary if the user wishes the source code to be annotated with profiling information. Then, the user can run the instrumented application and a special output file is generated (by default gmon.out). Finally, the user can invoke gprof as follows to display the profiling results:

 gprof [--flat-profile --graph --annotated-source] ./program

The --flat-profile option prints the total amount of time spent and the number of calls to each function. The --graph option shows the call graph. The --annotated-source option prints profiling information next to the original source code. The main advantage of gprof is that it offers a "quick-and-dirty" way to identify hot-spots in an application, so that the programmer can focus on the most time-consuming parts of it as candidates for parallelization. Unfortunately, the tool can only profile the "user-level" part of an application, meaning that it cannot profile external shared libraries (e.g. libc) or system calls. A serious sideeffect of this is that multithreaded applications cannot be profiled. Actually, in such cases profiling information is gathered only for the main application thread, which is not helpful.

valgrind

Valgrind is a debugging and profiling suite for Linux executable. It consists of a core that performs program instrumentation and a set of tools for profiling and detecting memory management and threading bugs. Its architecture is modular, which means that new tools can be created and integrated seamlessly to the existing structure. In general, the profiling performed by Valgrind is solely based on instrumentation and thus is more time-consuming than hardware-assisted methods (e.g. OProfile). Furthermore, some Valgrind profiling tools simulate the program behavior using synthetic hardware models in software, failing to capture all parameters of real hardware. For example, Cachegrind, the Valgrind tool for cache profiling, allows user to specify some of the most common cache parameters (e.g. size, associativity, etc.) in order to simulate a cache hierarchy as close as possible to the native platform, yet it fails to simulate other architectural parameters such as the hardware prefetcher, which contribute significantly to the final performance. For these reasons, Valgrind should not be generally preferred for profiling purposes against hardware-assisted approaches, but could be used for debugging purposes instead.

The most common Valgrind tools are the following:

Memcheck: detects memory-management problems such as illegal accesses, use of uninitialized pointers, memory leaks, bad memory de-allocations, etc.
Cachegrind: performs cache profiling by simulating program execution under a user-specified cache hierarchy, and provides feedback about the source code lines that triggered the majority of cache misses.
Callgrind: is an extension to Cachegrind that additionally provides information about call graphs.
Massif: performs detailed heap profiling by taking regular snapshots of a program's heap, and shows heap usage over time, including information about which parts of the program are responsible for most memory allocations.

For multithreaded programs, Valgrind can seem helpful through a number of tools for detecting threading and synchronization errors. The most prominent of these is Helgrind, which can detect three classes of errors: misuses of the Pthreads API (e.g. unlocking a not-locked mutex, unlocking a mutex held by another thread, etc.), potential deadlocks arising from lock ordering problems (e.g. acquiring two locks in reverse order by two threads could lead to deadlock), and data races (accessing shared memory without appropriate synchronization). Another useful tool is DRD (Data-Race Detector). Generally, it detects the same classes of errors as Helgrind (Pthreads API misuses, deadlocks and data races), but it offers additionally the opportunity to detect situations of excessive lock contention. Using the "exclusive-threshold" command line option the tool reports whether any mutex or writer lock has been held longer than a specified duration. A similar option ("shared-threshold") is provided for reader locks.

The user is referred to Valgrind's home page (http://valgrind.org/) for a detailed description of the functionality and the command line options of each tool.

Oprofile

OProfile is a low overhead, system-wide profiler for Linux that uses performance monitoring hardware incorporated in most modern processors to spot locations and possible causes of performance bottlenecks. In contrast to gprof, OProfile collects system-wide profiling information rather than application specific, for all threads on all processors in the system, and regardless of whether they execute user-level, shared library or kernel code. Apart from counting CPU cycles to measure where execution time is spent, the tool can also measure low-level, architecture-specific micro-architectural events such as cache misses, branch mispredictions, memory references or different types of instructions.

OProfile utilizes the underlying processor's performance counters to measure such events. Like gprof, it uses statistical sampling to associate their occurrence with specific locations in the program. This means that it does not record for every event where it happened, but for every S event. The lower the value of S, the more accurate the results but the higher the overhead inserted. Generally, by setting S to some moderate value we can have low overhead and at the same time quite accurate results. A list of common processors supported by OProfile is given in the following table.

Processor	Model
Intel	Atom, Core, Core2, Core i7, Nehalem, Pentium 4 (HT/non-HT), Pentium M, Pentium II/III/Pro, Itanium, Itanium 2
AMD	Athlon, family 10, family 11h, AMD64 (Hammer)
IBM	PowerPC 970, PowerPC 970MP, Cell Broadband Engine, Power 4/5/5+/5++/6/7, iseries, pseries, s390, s390x

OProfile's operation is based on three major components: a kernel module that controls the event counting facilities of the processor(s), a daemon that collects samples and saves them to disk, and a set of tools that analyze the samples and show how they correlate to the executed application.

Configuration and use

The basic steps to use OProfile are the following:

Setup

The first thing we need is to load the kernel module by running the following command as root:

 opcontrol --init

To verify that the module was successfully loaded, we can the "lsmod" command and the module should appear as "oprofile". Furthermore, a "/dev/oprofile" entry should be available, with the file "/dev/oprofile/cpu_type" containing the specific model of the CPU being monitored. To get a list of events supported by the current CPU, we can simply execute:

  opcontrol --list-events

Each event usually provides a set of unit masks. These masks can be used to further qualify the event according to certain aspects of its occurrence. For example, in many Intel processors we can use unit masks for data cache events to select only these that concern a specific MESI state (i.e., modified, exclusive, shared, invalid) or a combination of states (by logically disjugating the unit masks). In general, it is advised to consult the processor manuals to get a detailed description of events and possible fallacies regarding their measurement and interpretation.

The next step is to configure OProfile to collect data for a specific event. This is done with the following command line:

 opcontrol --setup 
           [--event=:name:count:unitmask:kernel:user ]
           [--separate=cpu ]
           [--no-vmlinux | --vmlinux=kernel ]

The "event" argument specifies one or more events that will be sampled. Note that, depending on the event type and the processor model, it is possible that the specified events cannot be counted simultaneously. In this case, separate runs of the application are needed, one for each event. The field "name" refers to the event name, as reported with the "list-events" option. The field "count" specifies that the processor will be sampled every "count" occurrences of the event to track the current value of the program counter. For rare events a small value may be needed to get a sufficient number of samples. If this field is left empty, a default value will be used. "unitmask" specifies a unit mask for the event, otherwise a default one will be used (usually the logical or of all supported unit masks). The "kernel" and "user" fields specify whether OProfile should sample when the processor is running in kernel or user mode, respectively.

If no event is specified at all, then a default one will be used. This is usually a time-based event (e.g. processor cycles) configured at a sample rate of 100,000 occurrences. It can be reported using the following command:

 ophelp --get-default-event

The following table lists the time-based events for some common processor models.

Processor model	Event name
Intel Core, Core2, Core i7, Nehalem	CPU_CLK_UNHALTED
AMD	CPU_CLK_UNHALTED
Intel Pentium 4	GLOBAL_POWER_EVENTS
IBM PowerPC 970, Power 4/5/6	CYCLES

The argument "no-vmlinux" specifies that we are interested in getting profile information for user-level code only. If we want to attribute profiling samples to specific kernel functions, then the "vmlinux" option can be used, along with an uncompressed and unstripped image of the current kernel given as argument. The argument "separate=cpu" specifies that samples should be separated for each CPU in the system. This is particularly useful in order to profile parallel applications whose threads are mapped to different processors.

Profile application

To profile the application, we need to perform the following steps, one after the other:

 opcontrol --start
 ./program
 opcontrol --dump

With the "start" option, the profiling daemon is launched. To verify that it is actually running, we check that there is a process named "oprofiled". Once launched, the daemon starts collecting statistics for all programs running on all processors of the system. Therefore we can immediately start the application that we want to profile. After it finishes, we should collect the profiling data. By giving the "dump" option, we ensure that all the current profiling data is flushed to some special sample files before they can be analyzed.

Because OProfile automatically collects data after it has started on any program that runs, we must take care to minimize "noise" inserted by other applications that run concurrently with the application under test. For this reason, we should ensure that the intervals between "opcontrol --start" and application's launch, and between application's termination and "opcontrol --dump" are as short as possible. In any case, we should strive to use a dedicated and "empty" execution environment for the profiling data to be as accurate as possible.

The daemon can be stopped by executing the command:

 opcontrol --shutdown

To save the profiling data for future reference we can use the command:

 opcontrol --save=session_name

This allows us to keep data from multiple experiments separate. The "session_name" can be given later as argument to a reporting tool to display the profile for a specific session.

Results analysis and report

After dumping samples to disk we can use the reporting tools of OProfile to analyze and display profile results. We execute the following command:

 opreport --threshold=1 session:session_name

We get a report for the specified session, containing overall system profile regarding the events that were monitored, for all executables that ran in the profile period and had 1% or more of the samples ("threshold"). The report lists in descending order the number of events that occurred and in which executable. For example, for a simple multithreaded program ("mt_test") executed on a dual-core system, the output produced is like the following:

 CPU: Core 2, speed 800 MHz (estimated)
 Counted INST_RETIRED_ANY_P events (number of instructions retired) with a unit mask of 0x00 (No unit mask) count 10000
 Samples on CPU 0
 Samples on CPU 1
                         cpu:0|                     cpu:1|
       samples     |      %|  samples|            %|
 ------------------------------------
  554786        77.9251    418087 69.4634  mt_test
   94429         13.2635     88184 14.6514   no-vmlinux
   15267           2.1444     12080  2.0070    libc-2.12.2.so
    7151            1.0044     16694  2.7736    chrome

For this session we counted the number of instructions retired on each processor, with a sample rate of 10000 instructions. We can see that for our multithreaded application there were counted 554K and 418K samples on cpu0 and cpu1, respectively, which corresponded to 77.9% and 69.4% of all instructions retired during the profile period. The rest instructions were executed in the Linux kernel ("no-vmlinux"), in glibc and the "chrome" application. Other applications that gathered less than 1% of total instructions are not shown. Note that the output of opreport contains sample count and not event count. To find the actual number of events, we have to multiply the sample count by the sample rate (10000 in the example).

It is also possible to restrict opreport to output results for a certain executable using the "image:" argument. Furthermore, using the "--symbols" option, we can see what functions called in the executable have the most samples attributed to them. The prerequisite to have access to such information is to turn on symbol information while compiling the application using the -g option of gcc. In the previous example, to produce output specific for "mt_test" application we could simply execute:

  opreport --threshold=1 session:session_name image:mt_test --symbols

and get the following results:

 CPU: Core 2, speed 800 MHz (estimated)
 Counted INST_RETIRED_ANY_P events (number of instructions retired) with a unit mask of 0x00 (No unit mask) count 10000
 Samples on CPU 0
 Samples on CPU 1
 samples  %        samples  %        symbol name
 554786   100.000  418087   100.000  thread_fn

This shows that 100% of the samples for the "mt_test" application in both processors where collected while executing "thread_fn" function, with the samples being actually distributed unevenly to processors.

It is possible to further refine the profiling results by attributing samples to specific source code lines or assembly instructions. This can be done using Oprifile commandopannotate as follows:

 opannotate --source session:session_name image:mt_test

The output we get is the original source code, annotated with the profiling results as follows:

 /* 
  * Total samples for file : "/home/user/mt_test.c"
  * 
  * 554786 100.0  418087 100.0
  */
 ...
                                              :void* thread_fn(void *args)
                                              :{ /* thread_fn total: 554786 100.0 418087 100.0 */
               ...                            :  ...
                                              :    if ( ta->id == 0 ) { // thread0 code
 222654 40.1333     0       0   :        for ( i = 0; i < iters0; i++ )
 332132 59.8667     0       0   :            s.val1++;
                                              :
                                              :    } else if ( ta->id == 1 ) { // thread1 code
       0        153575 36.7328   :        for ( i = 0; i < iters1; i++ )
       0        264512 63.2672   :            s.val2++;
                                              :    }

With threads being mapped on different processors, we see how samples are attributed to the lines corresponding to each thread's source code.

OProfile is a very useful tool but there are circumstances under which it can provide inaccurate results. First, the processor does not always trigger a sample interrupt on the exact instruction that caused the event. This is due to the out-of-order instruction execution performed by most modern processors, whichcould cause a sample to be attributed to an instruction near the one that actually triggered the event. Second, another possible cause for weird results are the aggressive optimizations done for performance reasons my most compilers. These optimizations transform the program to an intermediate form that often has little similarity with the original source code, in terms of line-by-line correspondence. Therefore, a general advice for the user is not to interpret the results of OProfile literally, but to treat them mostly as a hint on which regions of source code to look for hotspots.

Detecting scalability bottlenecks

Performance counters help us gain an insight into the way that applications interact with the underlying architecture. In the following sections we give generic guidelines on how to detect some of the common scalability problems mentioned in introduction using the performance counters statistics produced by OProfile.

Load imbalance and parallelization overhead

A measure of work performed by each thread is the total number of instructions retired by each one. By retired instructions, we mean those that were fetched and successfully executed by a processor until completion. Therefore, a clear indication of load imbalance in a parallel application is when the instruction counts are distributed among processors quite unevenly (this implies a 1-1 mapping between application threads and processors/cores). On the other hand, when we see roughly the same number of instructions in all processors we can be sure that the total computations were distributed equally to all threads.

Even in the latter case, however, things may not be as ideal as they seem. Specifically, if the total instructions of the parallel and serial versions of the application vary significantly, then we should suspect increased overhead due to parallelization. As we have already discussed, this overhead is usually due to frequent work distribution (e.g., fine-grain parallel tasks), excessive thread synchronization or other operations performed by the parallel runtime system. Of course, this argument is only valid when the parallel application has not undergone major structural changes with respect to its serial counterpart, as happens for example with most OpenMP programs. Otherwise, the extra parallel overhead should be probably attributed to the design of the parallel algorithm itself.

The following table lists the specific even name for retired instructions for some common processor models.

Processor model	Event name
Intel Core, Core2, Core i7, Nehalem	INST_RETIRED
AMD	RETIRED_INSNS (Athlon), RETIRED_INSTRUCTIONS
Intel Pentium 4	INSTR_RETIRED
IBM PowerPC 970, Power 4/5/6	PM_INST_CMPL_GRP3 (970, Power 4), PM_GRP_CMPL_GRP2 (Power 5), PM_INST_CMPL_GRP1 (Power 6,7)

Synchronization overhead

The basic indication of excessive synchronization is when threads spend a large amount of time either in critical sections code or in lock methods waiting for lock acquisition. In either case, the default time-based events mentioned in Section 3.2.2 can be used to find out whether these code locations have a relatively large number of samples attributed to them. Moreover, when threads wait a lot to acquire a lock (e.g. in high lock contention or large critical sections) and OS-based synchronization mechanisms are used, there should be long waiting periods in the kernel which should appear as high percentage of samples in kernel code ("no-vmlinux" indication as we saw in Section 3.2.2). In the same scenario but now using user-level spin locks, the waiting overhead should translate to excessive read-write sharing, whose detection is described in a next section.

Some architecture that support atomic instructions provide events for counting the number of cycles that a specific part of memory hierarchy (e.g. caches, bus) was locked as a result of an atomic instruction. Such events are for example the BUS_LOCK_CLOCKS in Intel Core/Core2 or the CACHE_LOCK_CYCLES in Intel Nehalem/Core i7. If the ratio of locked cycles to total cycles is rather increased, then this would probably indicate high contention while executing atomic operations. Other architectures such as AMD provide events like DCACHE_MISS_LOCKED_INSTRUCTIONS, which, as their name says, count the data cache misses triggered by locked memory operations. If those misses contribute significantly to the total data cache misses (DATA_CACHE_MISSES event), then we should suspect increased contention due to atomic operations. The rationale here is that uncontended atomic operations should not generally incur a lot of cache misses, since the corresponding protected variables are usually few. Thus the majority of "locked misses" should be attributed to continuous bouncing of the cache line holding the locked variable.

Memory bandwidth saturation

As we have already discussed, a memory-intensive parallel application that has saturated the bus cannot probably benefit from additional threads and will not scale. To identify bus saturation we need to measure the parallel application bandwidth usage and compare it against the maximum attainable bandwidth of our system. If they are nearly equal, then this indicates saturation. To get the achievable bandwidth, we could consult the hardware specifications of our system, or even better, use a benchmark that measures the sustainable bandwidth, such as STREAM from University of Virginia (http://www.cs.virginia.edu/stream/).

Bandwidth is the number of bytes transferred per second. Therefore, to find the bandwidth consumed by a specific part of a parallel application, we need to divide the total bytes transferred by all threads during that part, by its duration. Below we give for various architectures and processor models the events that need to be counted and the corresponding formulas to estimate the bandwidth.

Intel Core/Core2/Pentium 4:

The events that need to be counted are BUS_TRAN_MEM (with unit mask for "all agents" category in Core2), and CPU_CLK_UNHALTED with unit mask for "core cycles" category. The first one counts transactions initiated by any agent attached to the bus, and the second counts the number of cycles while the core was not in halt state. Assuming 64-byte transactions, the bandwidth can be estimated using the following formula:

 B = 64 * BUS_TRAN_MEM * cpu_frequency / CPU_CLK_UNHALTED

For Pentium 4, the formula changes to:

 B = 64 * IOQ_ALLOCATION * cpu_frequency / GLOBAL_POWER_EVENTS

AMD: For family 11h and Hammer, the required events are SYSTEM_READ_RESPONSES, QUADWORD_WRITE_TRANSFERS (family 11h perform 8-byte write transfers) and CPU_CLK_UNHALTED. The bandwidth is given by:

 B = ( 64*SYSTEM_READ_RESPONSES + 8*QUADWORD_WRITE_TRANSFERS ) *
cpu_frequency / CPU_CLK_UNHALTED

Family 10 processors perform 16-byte write transfers and the formula changes as follows:

 B = ( 64*NORTHBRIDGE_READ_RESPONSES + 16*OCTWORD_WRITE_TRANSFERS ) * cpu_frequency / CPU_CLK_UNHALTED

Intel Nehalem/Core i7:

The latest Intel Nehalem/Core i7 processor models feature an "uncore" part that is external to all cores contained in a socket and shared by all of them. It usually consists of an L3 cache and the memory controller. The events required for measuring bandwidth are triggered in the uncore part of the processor, but for reasons unknown to us, OProfile does not currently support them. In any case, the consumed bandwidth can be derived by the following formula:

 B = 64 * (UNC_QMC_NORMAL_READS.ANY + UNC_QMC_WRITES.FULL.ANY) * cpu_frequency / CPU_CLK_UNHALTED

True and false read-write sharing

In some Intel architectures, performance counters can help us spot locations in the code exhibiting intense read-write sharing on global or dynamically allocated shared data. Unfortunately, there is no means to automatically distinguish false from true sharing. Instead, the user should inspect the code parts triggering the related events, and deduce whether data are being read-write shared on purpose or because different global variables that happen to reside on the same cache line are being modified.

On processors based on Intel Core2 architecture, the events of interest are MEM_LOAD_RETIRED (retired load operations) with unit mask for "L2 cache line missed by retired loads" category, and EXT_SNOOP with unit masks for "all agents" and "HITM" snoops. The latter event represents the responses sent by the coherence protocol each time a requested cache line is found modified in another core's cache. If the total EXT_SNOOP events correspond to a measurable percentage of total retired instructions, then there is true or false sharing and we should inspect the code that triggered the MEM_LOAD_RETIRED events or near locations to detect the actual cause.

For Intel Nehalem/Core i7 processors, the event of interest is MEM_UNCORE_RETIRED with unit masks for "other_core_l2_hitm" and "remote_cache_local_home_hit" categories. This event counts the number of memory load instructions retired where the memory reference hit modified data either in another cache on the same socket or on a remote socket cache. Similarly, if there is high occurrence of such events we should check the locations in the code where they are concentrated.

Misaligned accesses

Finally, most x86-based architectures provide events to count and detect misaligned data references. For Intel Core, Core2, Core i7 and Nehalem architectures the specific event is MISALIGN_MEM_REF, and on AMD processors it is MISALIGNED_DATA_REFS. Again, the user should look where such events are concentrated in the code and apply the appropriate techniques to eliminate misalignment.

Compiler usage guideline for performance optimization

Free compilers

GCC, the GNU Compiler Collection

GCC stands for the GNU Compiler Collection and it includes front ends for compiling C, C++, Objective-C, Fortran, Java, Ada, and Go source code files. Of particular importance in the HPC regime are the C/C++ and Fortran languages as most applications designed for such platforms are written in one or more of these language specifications. Currently the latest stable release of GCC is 4.6.1.

As most compiler families do GCC provides four main types of options, which are information related options (i.e. warning options, debugging options, assembly listings, language options), optimization related options, preprocessor options and architecture specification options. Of particular importance when dealing with performance studies are the optimization and architecture dependent options. Using the first set of options the compiler attempts to improve the performance and/or code size at the expense of compilation time and possibly the ability to debug the program.

Optimization options

In general, a compiler performs optimizations based on the knowledge it has of the program. Compiling multiple files at once to a single output file mode allows the compiler to use information gained from all of the files when compiling each one of them. The most basic optimization flag for the GCC family of compilers is ‘-O’ or ‘-O1’. When using this flag the compiler will try to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time. The next level of optimization is turned on with the ‘-O2’ flag in which case almost all supported optimizations that do not involve a space-speed tradeoff are applied. The ‘-O3’ optimization option includes in addition to ‘-O2’ the options ‘-finline-functions’, ‘-funswitch-loops’, ‘-fpredictive-commoning’, ‘-fgcse-after-reload’, ‘-ftree-vectorize’ and ‘-fipa-cp-clone options’, which belong to the family of interprocedural analysis optimizations. Finally the most aggresive optimization option is the ‘-Ofast’ option thatincludes ‘-ffast-math’ and the Fortran-specific options ‘-fno-protect-parens’ and ‘-fstack-arrays’ on top of ‘-O3’. This option should be used with causion since numeric computations performed by the program will not comply with the exact implementation of IEEE or ISO rules/specifications.

Architecture dependent options

The architecture specification options are, in general, related with the target architecture the program will be executed on. Thus, if the type of architecture is known beforehand such options should be used as the resulting executable will include instructions that match the instruction sets of the target architecture.

If the architecture is not known beforehand the ‘-march=generic’ flag should be used at compile time. The resulting executable will be optimized for the most common IA32/AMD64/EM64T processors. If, in contrast, the CPUtype is known the appropriate value for march should be used instead. Thus, for example, to target the Intel Core i7 CPU the ‘-march=corei7’ flag should be used.

Vendor specific compilers

Intel compiler

Intel C++ Compiler (known as icc or icl) and Intel Fortran Compiler (known as IFORT) are part of the Intel Compiler Suite a set of compilers available for GNU/Linux, Mac OS and Microsoft Windows systems. Both compilers support IA-32 and Intel 64 processors and certain non-Intel ones as AMD processors.

Intel tunes its compilers to optimize for its hardware platforms to minimize stalls and to produce code that executes in the fewest number of cycles. The compilers support three separate high-level techniques for optimizing the compiled program: inter procedural optimization (IPO), profile-guided optimization (PGO), and high-level optimizations (HLO). It also supports tools and techniques for adding and maintaining parallelism to applications.

Flag	Purpose
-O0	No optimization
-O1	Optimize for size
-O2	Optimize for speed and enable some optimization.
-O3	Enable all optimizations as O2, and intensive loop optimizations
-xO	Enables SSE3, SSE2 and SSE instruction sets optimizations for non-Intel CPUs
-fast	Shorthand "-O3 -ipo -static -xHOST -no-prec-div". Note that the processor specific optimization flag (-xHOST) will optimize for the processor compiled on—it is the only flag of -fast, which may be overridden.
-prof_gen	Compile the program and instrument it for a profile generating run.
-prof_use	May only be used after running a program that was previously compiled using prof_gen. Uses profile information during each step of the compilation process.

IBM BlueGene compiler

In order to use effectively the IBM Blue Gene supercomputer, you have to use the IBM XL family of compilers that are optimized for that infrastructure and allow you to develop C, C++ and Fortran applications. With the Blue Gene P the following products are compatible:

IBM XL C/C++ Advanced Edition for Blue Gene/P, V9.0
IBM XL Fortran Advanced Edition for Blue Gene/P, V11.1

The bg-prefixed and bg*_r commands on the SLES 10 platform are for cross-compiling applications for use on the Blue Gene P computer. The compiler invocations that are not prefixed with blrts_ or bg create executable targeted for the SLES 10 platform, and are provided only for testing and debugging purposes Compilations most commonly occur on the Front End Node. The resulting program can run on the Blue Gene/P™ system without manually copying the executable to the Service Node. The bg* and bg*_rcompiler invocation commands set certain default compiler options to maximize the use of the Blue Gene/P architecture. There are special options that can be used in order to compile optimized code:

Flag	Purpose
-qarch=[450 \| 450d]	Specifies which instructions the compiler can generate. Suboptions include: 450- Generates code for the single floating-point unit (FPU) only. This option avoids SIMD instructions being generated. 450d- Generates parallel instructions for the 450d Double Hummer dual FPU. This is the default. Note that if you encounter problems with code generation, try resetting this option to -qarch=450.
-qnoautoconfig	Prevents optimization levels -O4 and -O5 from resetting the -qarch setting to auto, thereby preserving the -qarch setting for the target architecture
-qstaticlink (C/C++)	Although shared libraries are supported on the Blue Gene P platform, note that -qstaticlink and -qstaticlink=libgcc are enabled by default. To use shared libraries, you must set the -qnostaticlink or -qnostaticlink=libgcc option.
-qtune=450	Optimizes code for the 450 family of processors. This is the default for -qarch=450, -qarch=450d , or when no -qarch or -qtune settings are specified and the bg prefixed commands are used.

AMD compiler - x86 Open64 Compiler Suite

The x86 Open64 compiler system offers a high level of advanced optimizations, multi-threading, and processor support that includes global optimization, vectorization, inter-procedural analysis, feedback directed optimizations, loop transformations, and code generation which helps extract the optimal performance from each x86 processor core.Webpage: http://developer.amd.com/tools/open64/Pages/default.aspx.

These optimizations can be modified with compiler flags:

FDO Options (Control Feedback Directed Optimizations)
Global Options (Control Global Options)
General Optimizations (Control General Optimizations)
IPO Options (Inter-procedural Optimizations Options)
LNO Options (Options that Control Loop Nest Optimizations)

Flag	Purpose
-O2	Default optimization level; equivalent to “-O”. Performs a set of extensive global
-O3	“-O2” plus many more aggressive optimizations; in particular, “-O3” turns on LNO.
-Ofast	Expands into “-O3”, “-OPT:Ofast”, “-ipa”, and a few other aggressive optimizations.
-LNO	-LNO Enables loop nest optimizations, including vectorization and generation of prefetch instructions.
-ipa	Performs inter-procedural analysis and optimizations. Optimizes across function and file boundaries.
-fb-create, -fb-opt	Turns on profile-guided (feedback-directed) optimizations. Requires separate compilations.
-apo	Enables automatic parallelization of loops.
-mso	Performs multi-core processor scalability optimizations. (Open64 Release 4.2.3 or later.)
-march	Generates instructions for specific processor type. Use “-march=Barcelona” when targeting Quad-core AMD OpteronTM processors or later.
-np	Turns on support for OpenMP (version 2.5).
-HP	Specifies the number of huge (2 MB) pages used for the bss, data, text, and heap segments. Note: this feature may not be available on all operating systems.

Optimization techniques for scalability