From HP-SEE Wiki

Jump to: navigation, search

OpenACC is an initiative from CAPS, CRAY, NVIDIA and PGI to provide a new open parallel programming standard. The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators.

OpenACC is comprised of a set of standardized, high-level pragmas that enable C/C++ and Fortran programmers to extend their code onto utilizing massively parallel processors with much of the convenience of OpenMP. Thus, the OpenACC standard preserves the familiarity of OpenMP code annotation while extending the execution model to encompass devices that reside in separate memory spaces. To support coprocessors, OpenACC pragmas annotate data placement and transfer (in addition to OpenMP) as well as loop and block parallelism.

Note that, in similarity to OpenMP, OpenACC provides portability across operating systems, host CPUs and accelerator resources. The details of data management and parallelism are implicit in the programming model and are managed by OpenACC API-enabled compilers and runtimes. The programming model allows thus the programmer to handle in a clear way data management and guidance on mapping of loops onto an accelerator as well as similar performance-related details.

OpenACC compiler support is provided by latest versions of the PGI Compiler Suite, CAPS and Cray.


OpenACC Compilers

CAPS OpenACC Compiler

At ISC’12 in Hamburg, Germany, CAPS announced the 3.1.1 release of its HMPP Workbench compiler, and demonstrated the same OpenACC directives compiled to run on NVIDIA, AMD and Intel accelerators. Based on a common set of directives for C and Fortran languages, CAPS OpenACC Compiler lets you incrementally develop or port existing applications to Many-Core without the complexity associated with many-core programming. CAPS compilers now give programmers broad flexibility to accelerate applications on the hardware platform of their choice.

PGI Accelerator Compilers

PGI 2010 and later releases include the PGI Accelerator Fortran and C99 compilers supporting x64+NVIDIA systems running under Linux, Mac OS X and Windows. PGI introduced support for OpenACC directives with Release 2012 version 12.6 of the PGI Accelerator compilers Using PGI Accelerator™ compilers, programmers can accelerate applications on x64+accelerator platforms by adding OpenACC compiler directives to existing high-level standard-compliant Fortran and C programs and then recompiling with appropriate compiler options.

Execution Model

OpenACC targets a host-directed execution model where the sequential code runs on a conventional processor and computationally intensive parallel pieces of code are offloaded to the accelerator device (such as a GPU) under control of the host.The device executes parallel regions or kernels regions. Even in accelerator-targeted regions, the host must orchestrate the execution by allocating memory on the accelerator device, initiating data transfer, sending the code to the accelerator, passing arguments to the parallel region, queuing the device code, waiting for completion, transferring results back to the host, and deallocating memory. In most cases, the host can queue a sequence of operations to be executed on the device, one after the other.

Memory Model

The most significant difference between a host-only program and a host+accelerator program is that the memory on the accelerator may be completely separate from host memory. In the OpenACC model, data movement between the memories is implicit and managed by the compiler, based on directives from the programmer. Cashes are also managed by the compiler in a similar way. However, in order to effectively accelerate a given region of code, the programmer must be aware of memory bandwidth between the host memory and device memory, device memory size and any other characteristics of the potentially separate memories.


Runtime Library Definitions

To use OpenACC runtime library routines and datatypes:

  • C/C++: #include “openacc.h”
  • Fortran: use openacc or #include “openacc_lib.h”

OpenACC Directive Format

In C and C++, OpenACC directives are specified with the #pragma mechanism. The syntax of an OpenACC directive is

#pragma acc directive-name [clause [[,] clause]...] new-line 

In Fortran, OpenACC directives are specified in free-form source files as

$acc directive-name [clause [[,] clause]...]  

OpenACC pragmas

Parallel Construct

When the program encounters an accelerator parallel construct, gangs of workers are created to execute the accelerator parallel region. Once the gangs are created, the number of gangs and the number of workers in each gang remain constant for the duration of that parallel region. One worker in each gang begins executing the code in the structured block of the construct.

#pragma acc parallel [clause [[,] clause]...] new-line
                 structured block 
!$acc parallel [clause [[,] clause]...]
                 structured block

!$acc end parallel 

Kernels Construct

The compiler will break the code in the kernels region into a sequence of accelerator kernels. Typically, each loop nest will be a distinct kernel. When the program encounters a kernels construct, it will launch the sequence of kernels in order on the device. The number and configuration of gangs of workers and vector length may be different for each kernel.

#pragma acc kernels [clause [[,] clause]...] new-line

     	        structured block
!$acc kernels [clause [[,] clause]...]

       		structured block

!$acc end kernels

Data Construct

The data construct defines scalars, arrays and subarrays to be allocated in the device memory for the duration of the region, whether data should be copied from the host to the device memory upon region entry, and copied from the device to host memory upon region exit.

#pragma acc data [clause [[,] clause]...] new-line

     		structured block
!$acc data [clause [[,] clause]...]

                structured block

!$acc end data

Loop Construct

The OpenACC loop directive applies to a loop which must immediately follow this directive. The loop directive can describe what type of parallelism to use to execute the loop and declare loop-private variables and arrays and reduction operations.

#pragma acc loop [clause [[,] clause]...]new-line

                for loop
!$acc loop [clause [[,] clause]...]

          	do loop

Cache Directive

The cache directive may appear at the top of (inside of) a loop. It specifies array elements or subarrays that should be fetched into the highest level of the cache for the body of the loop.

#pragma acc cache ( list ) new-line 
!$acc cache ( list )

Declare Directive

A declare directive is used in the declaration section of a Fortran subroutine, function, or module, or following an variable declaration in C or C++. It can specify that a variable or array is to be allocated in the device memory for the duration of the implicit data region of a function, subroutine or program, and specify whether the data values are to be transferred from the host to the device memory upon entry to the implicit data region, and from the device to the host memory upon exit from the implicit data region. These directives create a visible device copy of the variable or array.

#pragma acc declare declclause [[,] declclause]... new-line
!$acc declare declclause [[,] declclause]...

Update Directive

The update directive is used within an explicit or implicit data region to update all or part of a host memory array with values from the corresponding array in device memory, or to update all or part of a device memory array with values from the corresponding array in host memory.

#pragma acc update clause [[,] clause]... new-line
!$acc update clause [[,] clause]...

Wait Directive

The wait directive causes the program to wait for completion of an asynchronous activity, such as an accelerator parallel or kernels region or update directive.

#pragma acc wait [( scalar-integer-expression )] new-line
!$acc wait [(scalar-integer-expression )]

The table below shows a list of clauses used in openACC version 1.0 pragmas

Clauses\Pragmas Parallel Kernels Data Loop Declare Update
host( list )
device( list )
collapse( n )
gang [( scalar-integer-expression )]
worker [( scalar-integer-expression )]
vector [( scalar-integer-expression )]
async [( scalar-integer-expression )]
num_gangs ( scalar-integer-expression
num_workers ( scalar-integer-expression )
vector_length ( scalar-integer-expression )
reduction ( operator : list )
copy( list )
copyin( list )
copyout( list )
create( list )
present( list )
present_or_copy ( list )
present_or_copyin ( list )
present_or_copyout ( list )
present_or_create ( list )
deviceptr( list )
device_resident ( list )
private( list )
firstprivate( list )

The description of data Clauses that applies to each use case can be found in the OpenACC.1.0.pdf and OpenACC API Quick Reference Guide.

Internal Control Variables

An OpenACC implementation acts as if there are internal control variables (ICVs) that control the behavior of the program. These ICVs are initialized by the implementation, and may be given values through environment variables and through calls to OpenACC API routines. The program can retrieve values through calls to OpenACC API routines.

The ICVs are:

acc-device-type-var - controls which type of accelerator device is used.

acc-device-num-var - controls which accelerator device of the selected type is used.

ICV Ways to modify values Way to retrieve values
acc-device-type-var ACC_DEVICE_TYPE
acc-device-num-var ACC_DEVICE_NUM

Runtime Library Routines

The table below shows a list of runtime routines provided by OpenACC:

routine description
acc_get_num_devices ( devicetype ) Returns the number of accelerator devices of the specified type
acc_set_device_type ( devicetype ) Sets the accelerator device type to use for this host thread
acc_get_device_type ( ) Returns the accelerator device type that is being used by this host thread
acc_set_device_num ( devicenum, devicetype ) Sets the accelerator device number to use for this host thread
acc_get_device_num ( devicetype ) Returns the accelerator device number that is being used by this host thread
acc_async_test ( expression ) Returns nonzero or .TRUE. if all asynchronous activities with the given expression have been completed;otherwise returns zero or .FALSE.
acc_async_test_all ( ) Returns nonzero or .TRUE. if all asynchronous activities have been completed;otherwise returns zero or .FALSE.
acc_async_wait ( expression ) Waits until all asynchronous activities with the given expression have been completed.
acc_async_wait_all ( ) Waits until all asynchronous activities have been completed.
acc_init ( devicetype ) Initialized the runtime system and sets the accelerator device type to use for this host thread.
acc_shutdown ( devicetype ) Disconnects this host thread from the accelerator device.
acc_on_device ( devicetype ) In an OpenACC parallel or kernels region, this is used to take different execution paths depending on whether the program is running on an accelerator or on the host.
acc_malloc ( size_t ) Returns the address of memory allocated on the accelerator device.
acc_free ( void* ) Frees memory allocated by acc_malloc.

OpenACC Simple Use Case

We briefly discuss here the development effort needed to implement a simple matrix-matrix multiplication algorithm using OpenACC directives on GPU resources and showcase timing and performance results obtained via several development approaches (simple algorithm for CPU, simple algorithm using CUDA, CuBLAS and OpenACC)


Personal tools