New programming languages and models

From HP-SEE Wiki

Jump to: navigation, search

Contents

OpenACC

OpenACC is an initiative from CAPS, CRAY, NVIDIA and PGI to provide a new open parallel programming standard. The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators.

OpenACC is comprised of a set of standardized, high-level pragmas that enable C/C++ and Fortran programmers to extend their code onto utilizing massively parallel processors with much of the convenience of OpenMP. Thus, the OpenACC standard preserves the familiarity of OpenMP code annotation while extending the execution model to encompass devices that reside in separate memory spaces. To support coprocessors, OpenACC pragmas annotate data placement and transfer (in addition to OpenMP) as well as loop and block parallelism.

Note that, in similarity to OpenMP, OpenACC provides portability across operating systems, host CPUs and accelerator resources. The details of data management and parallelism are implicit in the programming model and are managed by OpenACC API-enabled compilers and runtimes. The programming model allows thus the programmer to handle in a clear way data management and guidance on mapping of loops onto an accelerator as well as similar performance-related details.

OpenACC compiler support is provided by latest versions of the PGI Compiler Suite, CAPS and Cray.

More information on OpenACC programming is included on the OpenACC page.

OpenHMPP

By combining parallel and heterogeneous cores, manycore processors offer a tremendous performance potential at a very low power. New parallel languages such as CUDA and OpenCL have emerged to fine program manycore architectures. By being a higher level abstraction, directive-based programming models:

  • Minimize code restructuration;
  • Keep applications hardware independent;
  • And ensure their portability across new generations of hardware.

Heterogeneous processor technology is here to last and porting legacy codes to GPU computing is a major challenge.

OpenHMPP

OpenHMPP standard is considered as an extension to the OpenACC technology. This extension notably features multi-GPUs computations, the integration of external libraries suc as NVIDIA CUBLAS in a single directive, and the ability to easily tune kernels to get the best of modern architectures. This syntax is usable since CAPS HMPP 3.1 and is inter-operable with the other parts of HMPP.

CAPS OpenHMPP compiler

CAPS HMPP is a complete software solution for the parallelization of legacy codes targeting in the efficient usage of multicore and GPGPU resources. Much like OpenMP HMPP is used as a set of directives which can be used to speed up the execution of a code while preserving its portability across infrastructures. HMPP can be used on already parallelized codes (either with MPI or OpenMP based ones).

HMPP directives that are added in the application source code do not change the semantic of the original code. They address the remote execution of functions or regions of code on GPUs and many-core accelerators as well as the transfer of data to and from the target device memory.

In the procedure of implementing HMPP directives on top of an existing code there are 3 steps to consider:

  * declaration of kernels that withhold a critical amount of computation
  * data management to and from the target device (i.e. GPGPU) memory
  * optimization of kernel performance and data synchronization

In practice HMPP separately handles and compiles an application to be executed on the native host and seperately the GPU accelerated codelet functions that are implemented on top of the native application as software plugins. In effect this means that the resulting application can be executed on a host that either has or does not have an accelerator resource (i.e. a GPGPU) as codelets will be executed on such hardware only as long as they are to be found on the host.

Note that the codelets are translated in NVIDIA CUDA and OpenCL languages by the HMPP backend and are thus compiled with the existing tools for these extensions on the software stack.

MPI-ACC

Data movement in high-performance computing systems accelerated by graphics processing units (GPUs) remains a challenging problem. Data communication in popular parallel programming models, such as the Message Passing Interface (MPI), is currently limited to the data stored in the CPU memory space. Auxiliary memory systems, such as GPU memory, are not integrated into such data movement frameworks, thus providing applications with no direct mechanism to perform end-to-end data movement. MPI-ACC is an integrated and extensible framework that allows end-to-end data movement in accelerator-based systems. MPI-ACC provides productivity and performance benefits by integrating support for auxiliary memory spaces into MPI. MPI-ACC’s runtime system enables several key optimizations, including pipelining of data transfers and balancing of communication based on accelerator and node architecture. MPI-ACC could both use CUDA and OpenCL accelerator programming interfaces.

Reference: Ashwin M Aji, James Dinan, Darius Buntinas, Pavan Balaji, Wu-Chun Feng, Keith R Bisset, Rajeev Thakur, MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-Based Systems, In proceeding of: The 14th IEEE International Conference on High Performance Computing and Communications, At Liverpool, UK, 2012.

Link: http://synergy.cs.vt.edu/pubs/papers/aji-hpcc12-mpiacc.pdf

Intel LEO

Intel LEO (Language Extensions for Offload) is a set of high-level coprocessor offload directives intended for use with new Intel MIC processors. Those directives can be inserted into high-level source that tells the compiler to execute specific code on the accelerator. Intel does not support OpenACC directives. LEO is a less restrictive and more generalized set of offload directives than OpenACC since its allows the programmer to offload virtually any function or even a whole application to the MIC hardware. The MIC architecture is based on the simpler Pentium architetcure, which is more suitable for a manycore throughput processor. MIC cores are relatively slow, but they have almost all the functional capabilities of Xeon cores. Thus MIC can behave as a general-purpose CPU, but with limited single-thread performance and smaller memory.

Links:

GASPI / PGAS

Parallel software is currently mainly based on the MPI standard, which was established in 1994 and has since dominated applications development. The adaptation of parallel software for use on current hardware, which is mainly dominated by higher core numbers per CPU and heterogeneous systems, has highlighted significant weaknesses of MPI, which preclude the scalability of applications on heterogeneous multi-core systems. As a result of both the hardware development and the objective of achieving scalability to even higher CPU numbers, we now see new demands on programming models in terms of a flexible thread model, asynchronous communication, and the management of storage subsystems with varying bandwidth and latency. This challenge to the software industry, also known as the "Multicore Challenge", stimulates the development of new programming models and programming languages ​​and leads to new challenges for mathematical modeling, algorithms and their implementation in software.

PGAS (Partitioned Global Address Space) programming models have been discussed as an alternative to MPI for some time. The PGAS approach offers the developer an abstract shared address space which simplifies the programming task and at the same time facilitates: data-locality, thread-based programming and asynchronous communication.

GASPI, which stands for Global Address Space Programming Interface, is, as the name suggests, a partitioned global address space (PGAS) API. The GASPI standard is focused on three key objectives: scalability, flexibility and fault tolerance. It follows a single program multiple data (SPMD) approach and offers a small, yet powerful API composed of synchronization primitives, synchronous and asynchronous collectives, fine grained control over one-sided read and write communication primitives, global atomics, passive receives, communication groups and communication queues. The goal of the GASPI project is to develop a suitable programming tool for the wider HPC-Community by defining a standard with a reliable basis for future developments through the PGAS-API of Fraunhofer ITWM. Furthermore, an implementation of the standard as a highly portable open source library will be available. The standard will also define interfaces for performance analysis, for which tools will be developed in the project. The evaluation of the libraries is done via the parallel re-implementation of industrial applications up to and including production status.

Essentially, GASPI uses one-sided RDMA-driven communication in a PGAS environment. As such, GASPI aims to initiate a paradigm shift from bulk-synchronous two-sided communication patterns towards an asynchronous communication and execution model.

Links:

OpenSHMEM

OpenSHMEM is an effort to create a specification for a standardized API for parallel programming in the Partitioned Global Address Space. Along with the specification the project is also creating a reference implementation of the API. This implementation attempts to be portable, to allow it to be deployed in multiple environments, and to be a starting point for implementations targeted to particular hardware platforms. It will also serve as a springboard for future development of the API.

Links:

Julia

Julia is a high-level, high-performance dynamic programming language for technical computing, with syntax that is familiar to users of other technical computing environments. It provides a sophisticated compiler, distributed parallel execution, numerical accuracy, and an extensive mathematical function library. The library, mostly written in Julia itself, also integrates mature, C and Fortran libraries for linear algebra, random number generation, FFTs, and string processing. More libraries continue to be added over time. Julia programs are organized around defining functions, and overloading them for different combinations of argument types (which can also be user-defined).

Julia language advantages:

  • Free and open source (MIT licensed)
  • Syntax similar to MATLAB
  • Designed for parallelism and distributed computation (multicore and cluster)
  • C functions called directly (no wrappers or special APIs needed)
  • Powerful shell-like capabilities for managing other processes
  • Lisp-like macros and other meta-programming facilities
  • User-defined types are as fast and compact as built-ins
  • LLVM-based, just-in-time (JIT) compiler that allows Julia to approach and often match the performance of C/C++
  • An extensive mathematical function library (written in Julia)
  • Integrated mature, C and Fortran libraries for linear algebra, random number generation, FFTs, and string processing

Julia does not impose any particular style of parallelism on the user. Instead, it provides a number of key building blocks for distributed computation, making it flexible enough to support a number of styles of parallelism, and allowing users to add more. Another important HPC feature of Julia is a native parallel computing model based on two primitives: remote references and remote calls. Julia uses message passing behind the scenes but does not require the user to control the environment explicitly like MPI. Communication in Julia is generally “one-sided,” meaning the programmer needs to manage only one processor explicitly in a two-processor operation. Julia also has support for distributed arrays.

Links:

Rootbeer

The Rootbeer GPU compiler is a project intended to ease the use of Graphics Processing Units from within Java. Rootbeer is more advanced than CUDA or OpenCL Java Language Bindings. With bindings the developer must serialize complex graphs of objects into arrays of primitive types. With Rootbeer this is done automatically. Also with language bindings, the developer must write the GPU kernel in CUDA or OpenCL. With Rootbeer a static analysis of the Java Bytecode is done (using Soot) and CUDA code is automatically generated.

Links:

Chapel

Chapel (Cascade High Productivity Language) is a parallel programming language developed by Cray. Chapel strives to vastly improve the programmability of large-scale parallel computers while keeping or improving the performance and portability of current programming models like MPI. Chapel provides a higher level of expression than current programming languages and tries to make a better distinction between algorithmic expression and data structure implementation details.

Chapel supports a multithreaded parallel programming model at a high level by supporting abstractions for data parallelism, task parallelism, and nested parallelism. It enables optimizations for the locality of data and computation in the program via abstractions for data distribution and data-driven placement of subcomputations. It allows for code reuse and generality through object-oriented concepts and generic programming features.

Links:

Personal tools