OpenACC Use Case

From HP-SEE Wiki

Revision as of 07:32, 16 July 2013 by Pkoro (Talk | contribs)
Jump to: navigation, search

Contents

Matrix Matrix multiplication

We briefly discuss the development effort needed to implement a simple matrix matrix multiplication algorithm using OpenACC directives on GPU resources and showcase timing and performance results obtained via several development approaches (simple algorithm for CPU, simple algorithm using CUDA, CuBLAS and OpenACC).

Introduction

OpenACC is an initiative from CAPS, CRAY, NVIDIA and PGI to provide a new open parallel programming standard.

The Application Program Interface provides a collection of compiler directives to specify loops and regions of code in standard C, C++ and Fortran to be offloaded from a host CPU to an attached accelerator, providing portability across operating systems, host CPUs and accelerators. In similarity to OpenMP, OpenACC provides portability across operating systems, host CPUs and accelerator resources.

The motivation behind OpenACC is to provide an easy, powerful, portable an open standard to developers that need a higher-level approach to GPU acceleration. [2] Note that, in similarity to OpenMP, OpenACC provides portability across operating systems, host CPUs and accelerator resource

Effort required

To get an estimate overview of the effort required to port an existing code into a GPU enabled infrastructure using OpenACC directives we have used as a basis a simple matrix matrix multiplication algorithm.

As is shown in Listing 1, the changes required to parallelize a for-loop that is taking care of the matrix matrix multiplication are rather minimal (as is the case also when using OpenMP). Note that the input and output chunks of memory to be transferred to and from the GPU card should be explicitly defined within the pragma statement. (A similar syntax is required in the FORTRAN case).

 #pragma acc kernels copyin(a,b) copyout(c)
 for (i = 0; i < SIZE; i++) {
   for (j = 0; j < SIZE; j++) {
     for (k = 0; k < SIZE; k++) {
       c[i][j] += a[i][k] * b[k][j];
     }
   }
 }


Further modifications and fine tuning of the parallelization strategy to be implemented on top of the GPU resources require the addition of more pragma. In the example given in Listing 2 we explicitly define a sequential inner loop and then group the outer loop into chunks/vectors of 512 threads to be executed on the GPU concurrently.

 #pragma acc kernels loop copyin(a,b) copyout(c)
 {
   #pragma acc loop independent  gang (512) 
   for (i = 0; i < SIZE; i++) {
     #pragma acc loop independent vector(512)
     for (j = 0; j < SIZE; j++) {
       #pragma acc loop seq 
       for (k = 0; k < SIZE; k++) {
         c[i][j] += a[i][k] * b[k][j];
       }
     }
   }
 }


Benchmark tests

To get an estimate overview of the performance gain (in comparison to the plain CPU algorithm) we have used a Tesla M2090 GPU card on HPCG site. Our CPU results have been obtained on the same site and in specific on an Intel Xeon E5649 (2.53GHz) CPU.

In Figure 1 we compare our results when using the non-OpenACC code (compiled with gcc and pgi without any optimization options and pgi with the “-fast” optimization option which performs a vectorization of the matrix product on the CPU).

Chart 9.png

Figure 1 - Comparison of execution times for simple matrix matrix multiplication

As can be seen the simple OpenACC approach (given in Listing 1) outperforms the serial execution times as measured on the CPU even in the case the compiler performs some vectorization of the product behind the curtains.

Going into further detail we have measured the execution time of the OpenACC code given in Listing 2 for various values of gang and vector sizes. Our results for a given size of the square matrices multiplies (4096) are shown in Figure 2 below.

Chart 2.png

Figure 2 - Comparison of different configurations for gang and vector size (square size matrix 4096)

The best timing results for this multiplication size have been retrieved for gang and vector sizes of 512.

These results, however, are only partial as there are several approaches towards implementing an efficient matrix matrix multiplication that should be preferred to writing an own code. Thus in Figures 3 and 4 we provide benchmarking results and performance results where we compare our more advanced OpenACC implementation (Listing 2) to

  1. a simple (again) Cuda implementation and
  2. a CuBLAS implementation


Chart 5.png

Figure 3 - Comparison of execution times

Chart 3.png

Figure 4 - Comparison of performance achieved


As can be seen from these results the plain CUDA example code and the CuBLAS implementations clearly outperform the simple OpenACC example code we have implemented even after fine tuning the parameters of the problem to their optimal values.

Conclusions

In conclusion OpenACC proves very helpful in achieving an improved performance on a GPU enabled infrastructure as it requires a minimal amount of effort to code and implement. The documentation is quite extensive and even a beginner in parallel programming can start off and have remarkable results in a very short time. However, to achieve better results CUDA and the so far supported and implemented libraries (if such libraries are applicable to a given problem) should be preferred as these exhibit far better usage of the underlying resources.


References

[1] http://www.openacc-standard.org/OpenACC

Personal tools