OpenACC Use Case

From HP-SEE Wiki

Revision as of 06:11, 16 July 2013 by Pkoro (Talk | contribs)
Jump to: navigation, search

Contents

Matrix Matrix multiplication

We briefly discuss the development effort needed to implement a simple matrix matrix multiplication algorithm using OpenACC directives on GPU resources and showcase timing and performance results obtained via several development approaches (simple algorithm for CPU, simple algorithm using CUDA, CuBLAS and OpenACC).

Introduction

OpenACC is an initiative from CAPS, CRAY, NVIDIA and PGI to provide a new open parallel programming standard.

The Application Program Interface provides a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators. In similarity to OpenMP, OpenACC provides portability across operating systems, host CPUs and accelerator resources.

Effort required

To get an estimate overview of the effort required to port an existing code into a GPU enabled infrastructure using OpenACC directives we have used as a basis a simple matrix matrix multiplication algorithm.

As is shown in Listing 1, the changes required to parallelize a for-loop that is taking care of the matrix matrix multiplication are rather minimal (as is the case also when using OpenMP). Note that the input and output chunks of memory to be transferred to and from the GPU card should be explicitly defined within the pragma statement. (A similar syntax is required in the FORTRAN case).

 #pragma acc kernels copyin(a,b) copyout(c)
 for (i = 0; i < SIZE; i++) {
   for (j = 0; j < SIZE; j++) {
     for (k = 0; k < SIZE; k++) {
       c[i][j] += a[i][k] * b[k][j];
     }
   }
 }


Further modifications and fine tuning of the parallelization strategy to be implemented on top of the GPU resources require the addition of more pragma. In the example given in Listing 2 we explicitly define a sequential inner loop and then group the outer loop into chunks/vectors of 512 threads to be executed on the GPU concurrently.

 #pragma acc kernels loop copyin(a,b) copyout(c)
 {
   #pragma acc loop independent  gang (512) 
   for (i = 0; i < SIZE; i++) {
     #pragma acc loop independent vector(512)
     for (j = 0; j < SIZE; j++) {
       #pragma acc loop seq 
       for (k = 0; k < SIZE; k++) {
         c[i][j] += a[i][k] * b[k][j];
       }
     }
   }
 }


Benchmark tests

To get an estimate overview of the performance gain (in comparison to the plain CPU algorithm) we have used a Tesla M2090 GPU card on HPCG site. Our CPU results have been obtained on the same site and in specific on an Intel Xeon E5649 (2.53GHz) CPU.

In Figure 1 we compare our results when using the non-OpenACC code (compiled with gcc and pgi without any optimization options and pgi with the “-fast” optimization option which performs a vectorization of the matrix product on the CPU).

Personal tools