Parallel code

From HP-SEE Wiki

Revision as of 20:01, 18 April 2012 by Dusan (Talk | contribs)
Jump to: navigation, search

Contents

Message Passing Model

Section contributed by IFIN-HH

In most distributed memory systems parallelization is achieved by using various implementations of the widely adopted Message Passing Interface (MPI) standard [mpis]. MPI presents a set of specifications for writing message-passing programs, that is parallel programs in which one assumes the interprocess communication through messages. There are two versions of the MPI standard currently in use, MPI-1 and MPI-2, and various library implementations of these which are particularly tuned for specific target platforms.

The standard specifications, and its implementations that are used in the framework of the HP-SEE ecosystem have been shortly exposed in HP-SEE deliverable 8.1 [D8.1].

Here a detailed discussion is presented on the message passing issues that are relevant for the migration, adaption and optimization of parallel applications into the HPC infrastructure, together with examples drawn from the developers’ experience.

The topics is restricted to the libraries that are effectively in use in HP-SEE: implementations of the MPI-1 version (MPICH1 and its derivatives MPICH-MX, MVAPICH), Open MPI (which implements MPI-2), and MPICH2 (together with its derivatives MVAPICH2, MPIX, and Intel MPI) - which implements both versions MPI-1 and MPI-2.

REFERENCES

[mpis] MPI standard, http://www.mcs.anl.gov/research/projects/mpi/

[D8.1] HP-SEE deliverable D8.1, Software Scalability Analysis and Interoperability Issues Assessment

MPICH implementations

Introduction contributed by IFIN-HH

Proposed as a freely available and portable implementation of the MPI standard, MPICH has evolved along with it, from the MPICH1 implementation [mpc1] (fulfilling MPI-1 specifications and partially supporting some MPI-2 features like the parallel I/O), to MPICH2 [mpc2], which is moreover fully compatible with the MPI-2 version of the standard. Although the development of the MPICH1 was frozen since 2005 to the 1.2.7p1 version, with the intention to be replaced with MPICH2, it continues to be the most used MPI implementation worldwide.

REFERENCES

[mpc1] MPICH1, http://www.mcs.anl.gov/research/projects/mpi/mpich1-old/

[mpc2] MPICH2, http://www.mcs.anl.gov/research/projects/mpich2/

Open MPI

Introduction Implementations of OpenMPI are used in the HP-SEE infrastructure by ...

NEURON ParallelContext

Section contributed by IMBB-FORTH (CMSLTM application)

Preferred MPI environment is OpenMPI (openmpi_gcc-1.4.3). NEURON was compiled with parallel support (MPI) using the gcc compiler:

 ./configure --without-iv --with-paranrn --prefix=/home/gkastel/src/nrn-7.1

We used NEURON's ParallelContext to distribute the simulation of each neuron to different nodes evenly.

REFERENCES

Porting between MPI implementations

Section contributed by IICT-BAS

The MPI standard has been developed with the aim to simplify the development of cross-platform applications that use distributed memory model, as opposed to SMP. The first version of the MPI standard is well supported by the various MPI implementations, thus ensuring that a program tested with one such implementation will work correctly with another. Within the MPI specification there is some freedom of design choices, which are well documented and should serve as a warning to the user not to rely on specific implementation details. These considerations mostly affect the so-called asynchronous or non-blocking operations. For example, MPI_Isend is non-blocking version of MPI_Send. When a thread uses this function to send data, the function will return immediately. This would usually happen before the data is finished being sent. This means that users should not change any of the data in the buffer, which was passed as an argument, until they are sure that the data sending is completed. This can be ensured by invoking MPI_Wait. Although usage of non-blocking operations adds complexity to the program, it also enables overlap between communication and computations, thus increasing parallel efficiency.

The version 2 of the MPI standard added new advanced features, like parallel I/O, dynamic process management and remote memory operations. The implementations of these features among MPI versions is unequal and may lead to portability problems.

Some MPI implementations offer a way of network topology discovery, which may be extremely useful for achieving good parallel efficiency, especially when running on heterogeneous resources, but the usage of such information may also lead to portability problems for the application.

Shared Memory Model

OpenMP

Pthreads

Hybrid Programming

Section contributed by IPB

Since the general purpose processors are not getting very much faster, the optimal HPC hardware goes to massively parallel computers. This includes many compute nodes coupled by high-speed interconnections, where each compute node has several shared memory sockets, and each socket has multi-core processor. Furthermore, today hybrid architecture couples compute nodes with highly specialized computing nodes, such as cell processors or general purpose GPUs. On the other hand, there is logical programing model that can be mapped to the hybrid hardware: OpenMP - which corresponds to one multi-core processor, and MPI - which corresponds to massively parallel computers.

Combination of MPI and OpenMP comes naturally, and matches the hardware trends. This Hybrid MPI/OpenMP programming is specially suitable for the applications that exhibit two level of parallelism: coarse-grained (MPI) and fine-grained (OpenMP). In addition, some applications show unbalances workload at the MPI level, which can be avoided by OpenMP, addressing this issue by assigning a different number of thread to each MPI process.

Introduction of OpenMP into an existing MPI code includes OpenMP drawbacks, such as limitation in work control distribution and synchronization, overhead introduced by thread creation, dependence on compiler quality and runtime support. For some application that exhibit only one level of parallelism there may be no benefit of hybrid approach.

In the case when development starts from sequential code, hybrid programing approach has to include decomposition of the problem for MPI parallelization followed by OpenMP directives. There are two general hybrid programming models: no overlapping and overlapping communication and computation. In the first case MPI is called by master thread outside the parallel regions, while in the overlapping model some of the threads communicate while the rest are executing other parts of the application.

In many cases, however, introducing threads is not straightforward, and can even lead to degradation of performance. MPI standard and its implementations define four levels of thread safety: MPI THREAD SINGLE, where only one thread of execution exists; MPI THREAD FUNNELED, where a process may be multithreaded but only the thread that initialized MPI makes MPI calls; MPI THREAD SERIALIZED, where multiple threads may make MPI calls but not simultaneously; and MPI THREAD MULTIPLE, where multiple threads may call MPI at any time. The use of MPI THREAD FUNNELED is the easiest choice, but can be far from optimal with large number of threads per MPI process. On the other hand, performance issues plague implementations of a more natural MPI THREAD MULTIPLE mode and, while it can be expected that its use could benefit application's performance, in practice it also requires significant work on refactoring application's structure and its data distribution. The programming model becomes even more complex if we add to the existing hierarchy GPGPUs, which are now typically integrated into compute nodes in newly designed and procured HPC systems. However, here we will not consider this, and instead will focus only on hybrid MPI/OpenMP programming model.

CUDA

Personal tools