Examples for application scalability description
From HP-SEE Wiki
Contents |
Template
Code author(s): XY | |
Application areas: Computational Chemistry | |
Language: C++ | Estimated lines of code: 200 |
URL: http://wiki.hp-see.eu/index.php/XY |
Implemented scalability actions
Hints: these actions are from the previous deliverable (D8.2). Please write here down how the following actions have been implemented (one by one line comment for each action). D8.2: http://www.hp-see.eu/files/public/HPSEE-WP8-HU-020-D8.2-f-2011-08-29.pdf (Table 18 - Summary of scalability and interoperability actions)
Actions:
- Usage of different scalability model (hints: which scalability model, parallel paradigm)
- Profiling (hints: which profiler have been used and how)
- Efficient usage of parallel libraries (hints: which libraries are used and how)
- Efficient usage of compilers (hints: which compiler flags are used and how)
Benchmark dataset
Hardware platforms
Hints: which HPC sites have you used?
Execution times
Important: please provide here a graph too, how your application scales
Memory Usage
Profiling
Communication
I/O
CPU and cache
Derived metrics
Analysis
Hints: Please provide here a summary how the scalability has been improved. Explain what you did and what are the improvements.
PRACE document about application scalabilities
http://www.prace-ri.eu/IMG/pdf/D6-2-2.pdf
You do not need to follow its format, this is just only a reference.
Examples
SET
Code author(s): Team leader Emanouil Atanassov | |
Application areas: Computational Physics | |
Language: C/C++ | Estimated lines of code: 6000 |
URL: http://wiki.hp-see.eu/index.php/SET |
Implemented scalability actions
- Our focus in this application was to achieve the optimal output from the hardware platforms that were available to us. Achieving good scalability depends mostly on avoiding bottlenecks and using good parallel pseudorandom number generators and generators for low-discrepancy sequences. Because of the high requirements for computing time we took several actions in order to achieve the optimal output.
- The parallelization has been performed with MPI. Different version of MPI were tested and we found that the particular choice of MPI does not change much the scalability results. This was fortunate outcome as it allowed porting to the Blue Gene/P architecture without substantial changes.
- Once we ensured that the MPI parallelization model we implemented achieves good parallel efficiency, we concentrated on achieving the best possible results from using single CPU core.
- We performed profiling and benchmarking, also tested different generators and compared different pseudo-random number generators and low-discrepancy sequences.
- We tested various compilers and we concluded that the Intel compiler currently provides the best results for the CPU version running at our Intel Xeon cluster. For the IBM Blue Gene/P architecture the obvious choice was the IBM XL compiler suite since it has advantage versus the GNU Compiler Collection in that it supports the double-hammer mode of the CPUs, achieving twice the floating point calculation speeds. For the GPU-based version that we developed recently we relay on the C++ compiler supplied by NVIDIA.
- For all the choosen compilers we performed tests to choose the best possible compiler and linker options. For the Intel-based cluster one important source of ideas for the options was the website of the SPEC tests, where one can see what options were used for each particular sub-test of the SPEC suite. From there we also took the idea to perform two-pass compilation, where the results from profiling on the first pass were fed to the second pass of the compilation to optimise further.
- For the HPCG cluster we also measured the performance of the parallel code with and without hyperthreading. It is well known that hyperthreading does not always improve the overall speed of calculations, because the floating point units of the processor are shared between the threads and thus if the code is highly intensive in such computations, there is no gain to be made from hyperthreading. Our experience with other application of the HP-SEE project yields such examples. But for the SET application we found about 30% improvement when hyperthreading is turned on, which should be considered a good results and also shows that our overall code is efficient in the sense that most of it is now floating point computations, unlike some earlier version where the gain from hyperthreading was larger.
- For the NVIDIA-based version we found that we have much better performance using the newer M2090 cards versus the old GTX295, which was to be expected because the integer performance of the GTX 295 is comparable to that of M2090, but the floating performance of the GTX is many times smaller.
Benchmark dataset
For the benchmarking we fixed a particular division of the domain into 800 by 260 points, electric field of 15 and 180 femto-seconds evolution time. The computational time in such case becomes proportational to the number of Markov Chain Monte Carlo trajectories. In most tests we used 1 billion (10^9) trajectories, but for some tests we decreased that in order to shorten the overall testing time.
Hardware platforms
HPCG cluster and Blue Gene/P supercomputer.
Four distinct hardware platforms were used:
- the HPCG cluster with Intel Xeon X5560 CPU @2.8 Ghz,
- Blue Gene/P with PowerPC CPUs,
- our GTX 295-based GPU cluster (with processors Intel Core i7 920)
- our new M2090-based resource with processors Intel Xeon X5650.
Execution times
Comparison of the execution time and parallel efficiency of SET application are shown on HPCG (Table below) and BlueGene/P ( Table above).
Memory Usage
The maximum memory usage of a single computational thread is relatively small, in the order of 100 MB. On the GPUs there are several different kinds of memory, some of them rather limited. The available registers and the shared memory are especially problematic, since there is a risk if the available registers are all used some local variables to be spilled to global memory, encountering high latency and other issues. Still we found reasonable performance using 256 GPU threads, which is an acceptable number.
Profiling
Profiling was performed in order to improve the compiler optimisation during the second pass and also in order to understand what kind of issues we may be having in the application. We found as expected that most of the computational time is spent in computing of transcendental function like sin, cos, exp, and also in the generation of pseudorandom numbers. We attempted in the GPU version to replace the regular sin, cos, etc., with the less-accurate versions that are more efficient, but we found that the gain from that is relatively small and is not worth the loss of accuracy. For the GPU-based version we obtained relatively high percentage of divergence within warps, which means that some logical statements are resolved differently within threads of the same warp and there is substantial loss of performance. So far we have not been able to re-order the computation so as to avoid it.
Communication
The communication for this application is not critical in the sense that the communication takes less than 10% of the execution time.
I/O
The input for the application is small, containing the parameters of the problem at hand. The output is written out at the end of the computation and its size depends on the parameters. For a reasonable size of the domain the output is in the order of several megabytes. More accurate mesh is reasonable only for smaller evolution times and the output size will be proportional to the size of the mesh
CPU and cache
We believe that most of the computations of the CPU-based version fit in the cache for the Intel-based version. For the PowerPC processors of the Blue Gene/P some lookup operations when sampling the random variables use the main memory and thus entice higher latency. For the GPU-based version the situation is similar, since some of the tables are larger than the size of the so-called shared-memory. In both cases, the overall significance of these operations is less than 5%.
Analysis
From our testing we concluded that hyperthreading should be used when available, production Tesla cards have much higher performance than essentially gaming cards like GTX 295, two passes of compilation should be used for the Intel compiler targeting Intel CPUs and that the application is scalable to the maximum number of available cores/threads at our disposable. For future work it remains to find an efficient strategy of reordering of the computations on the GPUs in order to avoid warp divergence. For the CPU-based version we have also developed an MPI meta-program that measures the variation and uses genetic algorithm (from galib library) to optimise the transition density. This step will be added as a pre-processing stage of the program in order to provide some speedup in order of 20% to the overall computations, but to do so we need to find the right balance between this stage and the main computational stage.