From HP-SEE Wiki
OpenACC is an initiative from CAPS, CRAY, NVIDIA and PGI to provide a new open parallel programming standard. The OpenACC Application Program Interface describes a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators.
OpenACC is comprised of a set of standardized, high-level pragmas that enable C/C++ and Fortran programmers to extend their code onto utilizing massively parallel processors with much of the convenience of OpenMP. Thus, the OpenACC standard preserves the familiarity of OpenMP code annotation while extending the execution model to encompass devices that reside in separate memory spaces. To support coprocessors, OpenACC pragmas annotate data placement and transfer (in addition to OpenMP) as well as loop and block parallelism.
Note that, in similarity to OpenMP, OpenACC provides portability across operating systems, host CPUs and accelerator resources. The details of data management and parallelism are implicit in the programming model and are managed by OpenACC API-enabled compilers and runtimes. The programming model allows thus the programmer to handle in a clear way data management and guidance on mapping of loops onto an accelerator as well as similar performance-related details.
OpenACC compiler support is provided by latest versions of the PGI Compiler Suite, CAPS and Cray.
CAPS OpenACC Compiler
At ISC’12 in Hamburg, Germany, CAPS announced the 3.1.1 release of its HMPP Workbench compiler, and demonstrated the same OpenACC directives compiled to run on NVIDIA, AMD and Intel accelerators. Based on a common set of directives for C and Fortran languages, CAPS OpenACC Compiler lets you incrementally develop or port existing applications to Many-Core without the complexity associated with many-core programming. CAPS compilers now give programmers broad flexibility to accelerate applications on the hardware platform of their choice.
PGI Accelerator Compilers
PGI 2010 and later releases include the PGI Accelerator Fortran and C99 compilers supporting x64+NVIDIA systems running under Linux, Mac OS X and Windows. PGI introduced support for OpenACC directives with Release 2012 version 12.6 of the PGI Accelerator compilers Using PGI Accelerator™ compilers, programmers can accelerate applications on x64+accelerator platforms by adding OpenACC compiler directives to existing high-level standard-compliant Fortran and C programs and then recompiling with appropriate compiler options.
OpenACC targets a host-directed execution model where the sequential code runs on a conventional processor and computationally intensive parallel pieces of code are offloaded to the accelerator device (such as a GPU) under control of the host.The device executes parallel regions or kernels regions. Even in accelerator-targeted regions, the host must orchestrate the execution by allocating memory on the accelerator device, initiating data transfer, sending the code to the accelerator, passing arguments to the parallel region, queuing the device code, waiting for completion, transferring results back to the host, and deallocating memory. In most cases, the host can queue a sequence of operations to be executed on the device, one after the other.
The most significant difference between a host-only program and a host+accelerator program is that the memory on the accelerator may be completely separate from host memory. In the OpenACC model, data movement between the memories is implicit and managed by the compiler, based on directives from the programmer. Cashes are also managed by the compiler in a similar way. However, in order to effectively accelerate a given region of code, the programmer must be aware of memory bandwidth between the host memory and device memory, device memory size and any other characteristics of the potentially separate memories.
Runtime Library Definitions
To use OpenACC runtime library routines and datatypes:
- C/C++: #include “openacc.h”
- Fortran: use openacc or #include “openacc_lib.h”
OpenACC Directive Format
In C and C++, OpenACC directives are specified with the #pragma mechanism. The syntax of an OpenACC directive is
#pragma acc directive-name [clause [[,] clause]...] new-line
In Fortran, OpenACC directives are specified in free-form source files as
$acc directive-name [clause [[,] clause]...]
When the program encounters an accelerator parallel construct, gangs of workers are created to execute the accelerator parallel region. Once the gangs are created, the number of gangs and the number of workers in each gang remain constant for the duration of that parallel region. One worker in each gang begins executing the code in the structured block of the construct.C/C++:
#pragma acc parallel [clause [[,] clause]...] new-line structured blockFortran:
!$acc parallel [clause [[,] clause]...] structured block !$acc end parallel
The compiler will break the code in the kernels region into a sequence of accelerator kernels. Typically, each loop nest will be a distinct kernel. When the program encounters a kernels construct, it will launch the sequence of kernels in order on the device. The number and configuration of gangs of workers and vector length may be different for each kernel.C/C++:
#pragma acc kernels [clause [[,] clause]...] new-line structured blockFortran:
!$acc kernels [clause [[,] clause]...] structured block !$acc end kernels
The data construct defines scalars, arrays and subarrays to be allocated in the device memory for the duration of the region, whether data should be copied from the host to the device memory upon region entry, and copied from the device to host memory upon region exit.C/C++:
#pragma acc data [clause [[,] clause]...] new-line structured blockFortran:
!$acc data [clause [[,] clause]...] structured block !$acc end data
The OpenACC loop directive applies to a loop which must immediately follow this directive. The loop directive can describe what type of parallelism to use to execute the loop and declare loop-private variables and arrays and reduction operations.C/C++:
#pragma acc loop [clause [[,] clause]...]new-line for loopFortran:
!$acc loop [clause [[,] clause]...] do loop
The cache directive may appear at the top of (inside of) a loop. It specifies array elements or subarrays that should be fetched into the highest level of the cache for the body of the loop.C/C++:
#pragma acc cache ( list ) new-lineFortran:
!$acc cache ( list )
A declare directive is used in the declaration section of a Fortran subroutine, function, or module, or following an variable declaration in C or C++. It can specify that a variable or array is to be allocated in the device memory for the duration of the implicit data region of a function, subroutine or program, and specify whether the data values are to be transferred from the host to the device memory upon entry to the implicit data region, and from the device to the host memory upon exit from the implicit data region. These directives create a visible device copy of the variable or array.C/C++:
#pragma acc declare declclause [[,] declclause]... new-lineFortran:
!$acc declare declclause [[,] declclause]...
The update directive is used within an explicit or implicit data region to update all or part of a host memory array with values from the corresponding array in device memory, or to update all or part of a device memory array with values from the corresponding array in host memory.C/C++:
#pragma acc update clause [[,] clause]... new-lineFortran:
!$acc update clause [[,] clause]...
The wait directive causes the program to wait for completion of an asynchronous activity, such as an accelerator parallel or kernels region or update directive.C/C++:
#pragma acc wait [( scalar-integer-expression )] new-lineFortran:
!$acc wait [(scalar-integer-expression )]
The table below shows a list of clauses used in openACC version 1.0 pragmas
|host( list )||√|
|device( list )||√|
|collapse( n )||√|
|gang [( scalar-integer-expression )]||√|
|worker [( scalar-integer-expression )]||√|
|vector [( scalar-integer-expression )]||√|
|async [( scalar-integer-expression )]||√||√||√|
|num_gangs ( scalar-integer-expression||√|
|num_workers ( scalar-integer-expression )||√|
|vector_length ( scalar-integer-expression )||√|
|reduction ( operator : list )||√||√|
|copy( list )||√||√||√||√|
|copyin( list )||√||√||√||√|
|copyout( list )||√||√||√||√|
|create( list )||√||√||√||√|
|present( list )||√||√||√||√|
|present_or_copy ( list )||√||√||√||√|
|present_or_copyin ( list )||√||√||√||√|
|present_or_copyout ( list )||√||√||√||√|
|present_or_create ( list )||√||√||√||√|
|deviceptr( list )||√||√||√||√|
|device_resident ( list )||√|
|private( list )||√||√|
|firstprivate( list )||√|
Internal Control Variables
An OpenACC implementation acts as if there are internal control variables (ICVs) that control the behavior of the program. These ICVs are initialized by the implementation, and may be given values through environment variables and through calls to OpenACC API routines. The program can retrieve values through calls to OpenACC API routines.
The ICVs are:
acc-device-type-var - controls which type of accelerator device is used.
acc-device-num-var - controls which accelerator device of the selected type is used.
|ICV||Ways to modify values||Way to retrieve values|
Runtime Library Routines
The table below shows a list of runtime routines provided by OpenACC:
|acc_get_num_devices ( devicetype )||Returns the number of accelerator devices of the specified type|
|acc_set_device_type ( devicetype )||Sets the accelerator device type to use for this host thread|
|acc_get_device_type ( )||Returns the accelerator device type that is being used by this host thread|
|acc_set_device_num ( devicenum, devicetype )||Sets the accelerator device number to use for this host thread|
|acc_get_device_num ( devicetype )||Returns the accelerator device number that is being used by this host thread|
|acc_async_test ( expression )||Returns nonzero or .TRUE. if all asynchronous activities with the given expression have been completed;otherwise returns zero or .FALSE.|
|acc_async_test_all ( )||Returns nonzero or .TRUE. if all asynchronous activities have been completed;otherwise returns zero or .FALSE.|
|acc_async_wait ( expression )||Waits until all asynchronous activities with the given expression have been completed.|
|acc_async_wait_all ( )||Waits until all asynchronous activities have been completed.|
|acc_init ( devicetype )||Initialized the runtime system and sets the accelerator device type to use for this host thread.|
|acc_shutdown ( devicetype )||Disconnects this host thread from the accelerator device.|
|acc_on_device ( devicetype )||In an OpenACC parallel or kernels region, this is used to take different execution paths depending on whether the program is running on an accelerator or on the host.|
|acc_malloc ( size_t )||Returns the address of memory allocated on the accelerator device.|
|acc_free ( void* )||Frees memory allocated by acc_malloc.|
OpenACC Simple Use Case
We briefly discuss here the development effort needed to implement a simple matrix-matrix multiplication algorithm using OpenACC directives on GPU resources and showcase timing and performance results obtained via several development approaches (simple algorithm for CPU, simple algorithm using CUDA, CuBLAS and OpenACC)