MVAPICH

From HP-SEE Wiki

Revision as of 17:17, 16 April 2012 by Antun (Talk | contribs)
Jump to: navigation, search

Section contributed by IPB

InfiniBand, 10GigE/iWARP and RDMA over Converged Ethernet (RoCE) are emerging as high-performance networking technologies which deliver low-latency and high-bandwidth to HPC users and enjoy widespread acceptance due to their open standards. MVAPICH is an open-source MPI implementation developed in the Network-Based Computing Laboratory (NBCL) of the Ohio State University. It exploits the novel features and mechanisms of the mentioned networking technologies. Currently, there are two versions of this MPI library: MVAPICH with MPI-1 semantics, and MVAPICH2 with MPI-2 semantics.

These MPI implementations are used by many institutions worldwide (national laboratories, research institutes, universities, and industry) and several InfiniBand systems using MVAPICH/MVAPICH2 are present in the TOP 500 ranking. Many InfiniBand, 10GigE/iWARP and RoCE vendors, server vendors, systems integrators and Linux distributors have incorporated MVAPICH/MVAPICH2 into their software stacks. MVAPICH and MVAPICH2 are also available with the Open Fabrics Enterprise Distribution (OFED) stack (www.openfabrics.org) and through public anonymous MVAPICH SVN. Both MVAPICH and MVAPICH2 distributions are available under BSD licensing.

At the Institute of Physics Belgrade, MVAPICH MPI implementations are used within the tPARADOX cluster as it provides InfiniBand interconnect between its nodes.

Contents

MVAPICH features

MVAPICH is an implementation of MPI-1 standard based on MPICH and MVICH (MPI for Virtual Interface Architecture). The latest release is MVAPICH 1.2 (includes MPICH 1.2.7). MVAPICH 1.2 supports the following underlying transport interfaces:

  • High-Performance support with scalability for OpenFabrics/Gen2 interface to work with InfiniBand and other RDMA interconnects.
  • High-Performance support with scalability for OpenFabrics/Gen2-RDMAoE interface.
  • High-Performance support with scalability (for clusters with multi-thousand cores) for OpenFabrics/Gen2-Hybrid interface to work with InfiniBand.
  • Shared-Memory only channel which is useful for running MPI jobs on multi-processor systems without using any high-performance network. For example, multi-core servers, desktops, and laptops; and clusters with serial nodes.
  • The InfiniPath interface for InfiniPath adapters.
  • The standard TCP/IP interface (provided by MPICH) to work with a range of networks. This interface can be used with IPoIB support of InfiniBand also.

In addition, MVAPICH 1.2 supports many features for high performance, scalability portability and fault tolerance. It also supports a wide range of platforms (architecture, OS, compilers and InfiniBand adapters).

MVAPICH2 features

This is an MPI-2 implementation (conforming to MPI 2.2 standard) which includes all MPI-1 features. It is based on MPICH2 and MVICH. The latest release is MVAPICH2 1.8 (includes MPICH2 1.4.1p1). The current release supports the ten underlying transport interfaces, the most important being:

  • OFA-IB-CH3: This interface supports all InfiniBand compliant devices based on the OpenFabrics Gen2 layer. This interface has the most features and is most widely used.
  • OFA-IB-Nemesis: This interface supports all InfiniBand compliant devices based on the OpenFabrics libibverbs layer with the emerging Nemesis channel of the MPICH2 stack.
  • OFA-RoCE-CH3: This interface supports the emerging RoCE (RDMA over Convergence Ethernet) interface for Mellanox ConnectX-EN adapters with 10GigE switches.
  • Shared-Memory-CH3: This interface provides native shared memory support on multi-core platforms where communication is required only within a node. Such as SMP-only systems, laptops, etc.
  • TCP/IP-CH3: The standard TCP/IP interface (provided by MPICH2) to work with a range of network adapters supporting TCP/IP interface. This interface can be used with IPoIB (TCP/IP over InfiniBand network) support of InfiniBand also.
  • Shared-Memory-Nemesis: This interface provides native shared memory support on multi-core platforms where communication is required only within a node. Such as SMP-only systems, laptops, etc.

MVAPICH2 supports a wide range of platforms (architecture, OS, compilers, InfiniBand adapters Mellanox and QLogic, iWARP adapters, RoCE adapters, and network adapters supporting uDAPL interface). It also provides many features including high-performance communication support for NVIDIA GPU with IPC, collective and non-contiguous datatype support, shared memory interface, fast process-level fault-tolerance with checkpoint-restart, etc.

MVAPICH and MVAPICH2 usage

Compiling MPI applications

MVAPICH and MVAPICH2 provide a variety of MPI compilers (wrappers) to support applications written in different programming languages. mpicc, mpif77, mpiCC, or mpif90 can be used to compile applications and correct compiler should be selected, depending on the programming language of the MPI application. After proper installation and configuration of MVAPICH/MVAPICH2, these compiler wrappers are available in the bin directory of the MVAPICH installation directory.

 # Compiling MPI application in C 
 $ mpicc -o mpi_app.x mpi_app.c 
 
 # Compiling MPI application in C++ 
 $ mpiCC -o mpi_app.x mpi_app.cc 
 
 # Compiling MPI application in Fortran 77 
 $ mpif77 -o mpi_app.x mpi_app.f 
 
 # Compiling MPI application in Fortran 90 
 $ mpif90 -o mpi_app.x mpi_app.f90

Run MPI applications

There are many different ways to run MVAPICH MPI applications and here we present the usual ones:

  • Run MPI applications using mpirun_rsh

MVAPICH developers suggest users to use mpirun_rsh for job start-up for all interfaces, both for MVAPICH and MVAPICH2. This mpirun_rsh scheme provides fast and scalable job start-up, as it scales on clusters with several thousands of computing nodes. Either ssh or rsh should be enabled between the front nodes and the computing nodes, and a user should be able to login to the remote nodes without any password or other authentication challenge. All host names should resolve to the same IP address on all machines. Usage example:

 $ mpirun_rsh -np 4 -hostfile hosts ./cpi

The mpirun_rsh hostfile format allows users to specify hostnames, one per line, optionally with a multiplier, and host channel adapters (HCA) specification. The multiplier allows users to to specify blocked distribution of MPI ranks using one line per hostname. The HCA specification allows users to force an MPI rank to use a particular HCA.

 #Sample hostfile for mpirun_rsh
 host1 # rank 0 will be placed on host1
 host2:2 # rank 1 and 2 will be placed on host2
 host3:hca1 # rank 3 will be on host3 and will use hca1
 host4:4:hca2 # ranks 4 through 7 will be on host4 and will use hca2
  • Run MPI applications using SLURM

MPI applications compiled with both MVAPICH implementations can be launched using SLURM, an open-source resource manager designed by Lawrence Livermore National Laboratory (https://computing.llnl.gov/linux/slurm/):

 #MVAPICH
 $ srun –n 2 --mpi=mvapich ./a.out

or

 #MVAPICH2
 $ srun -n 2 ./a.out 

The use of SLURM enables many desirable features, such as explicit CPU and memory binding.

  • Run MPI applications on PBS/Torque Clusters

On clusters managed by PBS/Torque, MVAPICH2 MPI applications can be launched using the mpiexec launcher developed by the Ohio Supercomputer Center (OSC) (http://www.osc.edu/~djohnson/mpiexec/). An example of using this launcher:

 $ /path/to/osc/mpiexec -np 4 ./a.out
  • Run MPI applications using TotalView Debugger support

Both MVAPICH and MVAPICH2 provide support for TotalView Debugger. The following commands give an example of how to build and run MPI application with TotalView support:

 #Compile MPI application with debug symbols
 $ mpicc -g -o prog prog.c
 #Define the correct path to TotalView as the TOTALVIEW variable
 $ export TOTALVIEW=<path to TotalView>
 #Run MPI application
 $ mpirun_rsh -tv -np 2 n0 n1 prog
  • Run using a profiling library

All MPI2-functions of MVAPICH2 support the MPI profiling interface. This allows MVAPICH2 to be used by a variety of profiling libraries for MPI applications. To use Scalasca (www.scalasca.org), a user should configure Scalasca by supplying the --mpi=mpich2 option:

 $./configure --mpi=mpich2

Once the installation is done, a user will be able to use Scalasca with MVAPICH2.

  • Run Memory Intensive Applications on Multi-core Systems with MVAPICH

Process to CPU mapping may affect application performance on multi-core systems, especially for memory-intensive applications. If the number of processes is smaller than the number of CPU’s cores, it is preferable to distribute the processes on different chips to avoid memory contention, because CPU cores on the same chip usually share the memory controller. To use MVAPICH CPU mapping, a user first has to set CPU affinity (VIADEV_USE_AFFINITY) and then to use the run-time environment variable VIADEV_CPU_MAPPING to specify the CPU/core mapping. In a quad-core system in which cores [0-3] are on the same chip and cores [4-7] are on another chip, and user needs to run an application with 2 processes, then the following mapping will give the best performance:

 $ mpirun_rsh -np 2 n0 n0 VIADEV_CPU_MAPPING=0:4 ./a.out

In this case process 0 will be mapped to core 0 and process 1 will be mapped to core 4.

  • Running on Clusters with GPU Accelerators

MVAPICH2 works on clusters with GPU accelerators. By default, MVAPICH2 will work out-of-the-box on GPU clusters without any special configuration. The new "GPU-Direct" technology from NVIDIA and Mellanox enables using RDMA for memory buffers used by GPU devices. Once GPU-Direct is installed, no further configuration of MVAPICH2 is required. RDMA support within MVAPICH2 will automatically be enabled. Even if cluster does not have GPU-Direct, user doesn’t need to modify application to copy memory buffers into communication buffers. MVAPICH2 will perform a pipelined memory copy internally, using the R3 protocol.

  • Running MVAPICH2 in Multi-threaded Environments

MVAPICH2 binds processes to processor cores for optimal performance. However, in multi-threaded environments, it might be desirable to have each thread compute using a separate processor core. This is especially true for OpenMP+MPI programs. In MVAPICH2, processor core mapping is turned off in the following way to enable the application in MPI_THREAD_MULTIPLE threading level if user requested it in MPI_Init_thread. Otherwise, applications will run in MPI_THREAD_SINGLE threading level.

 $ mpirun rsh -np 2 n0 n1 MV2 ENABLE AFFINITY=0 ./openmp+mpi_app