New programming languages and models

From HP-SEE Wiki

Revision as of 11:42, 2 October 2012 by Marko.misic (Talk | contribs)
Jump to: navigation, search

Contents

CAPS HMPP

CAPS HMPP is a complete software solution for the parallelization of legacy codes targeting in the efficient usage of multicore and GPGPU resources. Much like OpenMP HMPP is used as a set of directives which can be used to speed up the execution of a code while preserving its portability across infrastructures. HMPP can be used on already parallelized codes (either with MPI or OpenMP based ones).

HMPP directives that are added in the application source code do not change the semantic of the original code. They address the remote execution of functions or regions of code on GPUs and many-core accelerators as well as the transfer of data to and from the target device memory.

In the procedure of implementing HMPP directives on top of an existing code there are 3 steps to consider:

  * declaration of kernels that withhold a critical amount of computation
  * data management to and from the target device (i.e. GPGPU) memory
  * optimization of kernel performance and data synchronization

In practice HMPP separately handles and compiles an application to be executed on the native host and seperately the GPU accelerated codelet functions that are implemented on top of the native application as software plugins. In effect this means that the resulting application can be executed on a host that either has or does not have an accelerator resource (i.e. a GPGPU) as codelets will be executed on such hardware only as long as they are to be found on the host.

Note that the codelets are translated in NVIDIA CUDA and OpenCL languages by the HMPP backend and are thus compiled with the existing tools for these extensions on the software stack.

OpenACC

OpenACC is compirsed of a set of standardized, high-level pragmas that enable C/C++ and Fortran programmers to extend their code onto utilizing massively parallel processors with much of the convenience of OpenMP. Thus, the OpenACC standard preserves the familiarity of OpenMP code annotation while extending the execution model to encompass devices that reside in separate memory spaces. To support coprocessors, OpenACC pragmas annotate data placement and transfer (in addition to OpenMP) as well as loop and block parallelism.

Note that, in similarity to OpenMP, OpenACC provides portability across operating systems, host CPUs and accelerator resources.

The details of data management and parallelism are implicit in the programming model and are managed by OpenACC API-enabled compilers and runtimes. The programming model allows thus the programmer to handle in a clear way data management and guidance on mapping of loops onto an accelerator as well as similar performance-related details.

OpenACC directives may in-as-far be used via the CAPS HMPP product and via latest versions of the PGI Compiler Suite.

MPI-ACC

Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement frameworks, thus providing applications with no direct mechanism to perform end-to-end data movement. MPI-ACC is an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC’s runtime system enables several key optimizations, including pipelining of data transfers and balancing of communication based on accelerator and node architecture. MPI-ACC could both use CUDA and OpenCL accelerator programming interfaces.

Reference: Ashwin M Aji, James Dinan, Darius Buntinas, Pavan Balaji, Wu-Chun Feng, Keith R Bisset, Rajeev Thakur, MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems, In proceeding of: The 14th IEEE International Conference on High Performance Computing and Communications, At Liverpool, UK, 2012.

Link: http://synergy.cs.vt.edu/pubs/papers/aji-hpcc12-mpiacc.pdf

Intel LEO

Intel LEO (Language Extensions for Offload) is a set of high-level coprocessor offload directives intended for use with new Intel MIC processors. Those directives can be inserted into high-level source that tells the compiler to execute specific code on the accelerator. Intel does not support OpenACC directives. LEO is a less restrictive and more generalized set of offload directives than OpenACC since its allows the programmer to offload virtually any function or even a whole application to the MIC hardware. The MIC architecture is based on the simpler Pentium architetcure, which is more suitable for a manycore throughput processor. MIC cores are relatively slow, but they have almost all the functional capabilities of Xeon cores. Thus MIC can behave as a general-purpose CPU, but with limited single-thread performance and smaller memory.

Links:

Personal tools