CMSLTM/Scalability

From HP-SEE Wiki

(Difference between revisions)
Jump to: navigation, search
(I/O)
(Examples)
 
Line 54: Line 54:
Our simulations show sub-linear scaling as new computational nodes are added. We identified optimal configurations that compromise between simulation size and number of nodes.
Our simulations show sub-linear scaling as new computational nodes are added. We identified optimal configurations that compromise between simulation size and number of nodes.
-
 
-
 
-
 
-
= Examples =
 
-
 
-
== SET ==
 
-
 
-
{| style="width: 100%; border-collapse: separate; border-spacing: 0; border-width: 1px; border-style: solid; border-color: #000; padding: 0"
 
-
|-
 
-
|colspan="2"  align="left" style="border-style: solid; border-width: 1px" | Code author(s): Team leader Emanouil Atanassov
 
-
|-
 
-
|colspan="2"  align="left" style="border-style: solid; border-width: 1px"| Application areas: Computational Physics
 
-
|-
 
-
|align="left" style="border-style: solid; border-width: 1px"|Language: C/C++
 
-
|align="left" style="border-style: solid; border-width: 1px"|Estimated lines of code: 6000
 
-
|-
 
-
|colspan="2"  align="left" style="border-style: solid; border-width: 1px"| URL: http://wiki.hp-see.eu/index.php/SET
 
-
|-
 
-
|}
 
-
 
-
=== Implemented scalability actions ===
 
-
 
-
* Our focus in this application was to achieve the optimal output from the hardware platforms that were available to us. Achieving good scalability depends mostly on avoiding bottlenecks and using good parallel pseudorandom number generators and generators for low-discrepancy sequences. Because of the high requirements for computing time we took several actions in order to achieve the optimal output.
 
-
* The parallelization has been performed with MPI. Different version of MPI were tested and we found that the particular choice of MPI does not change much the scalability results. This was fortunate outcome as it allowed porting to the Blue Gene/P architecture without substantial changes.
 
-
* Once we ensured that the MPI parallelization model we implemented achieves good parallel efficiency, we concentrated on achieving the best possible results from using single CPU core.
 
-
* We performed profiling and benchmarking, also tested different generators and compared different pseudo-random number generators and low-discrepancy sequences.
 
-
* We tested various compilers and we concluded that the Intel compiler currently provides the best results for the CPU version running at our Intel Xeon cluster. For the IBM Blue Gene/P architecture the obvious choice was the IBM XL compiler suite since it has advantage versus the GNU Compiler Collection in that it supports the double-hammer mode of the CPUs, achieving twice the floating point calculation speeds. For the GPU-based version that we developed recently we relay on the C++ compiler supplied by NVIDIA.
 
-
* For all the choosen compilers we performed tests to choose the best possible compiler and linker options. For the Intel-based cluster one important source of ideas for the options was the website of the SPEC tests, where one can see what options were used for each particular sub-test of the SPEC suite. From there we also took the idea to perform two-pass compilation, where the results from profiling on the first pass were fed to the second pass of the compilation to optimise further.
 
-
* For the HPCG cluster we also measured the performance of the parallel code with and without hyperthreading. It is well known that hyperthreading does not always improve the overall speed of calculations, because the floating point units of the processor are shared between the threads and thus if the code is highly intensive in such computations, there is no gain to be made from hyperthreading. Our experience with other application of the HP-SEE project yields such examples. But for the SET application we found about 30% improvement when hyperthreading is turned on, which should be considered a good results and also shows that our overall code is efficient in the sense that most of it is now floating point computations, unlike some earlier version where the gain from hyperthreading was larger.
 
-
* For the NVIDIA-based version we found that we have much better performance using the newer M2090 cards versus the old GTX295, which was to be expected because the integer performance of the GTX 295 is comparable to that of M2090, but the floating performance of the GTX is many times smaller. 
 
-
 
-
=== Benchmark dataset ===
 
-
 
-
For the benchmarking we fixed a particular division of the domain into 800 by 260 points, electric field of 15 and 180 femto-seconds evolution time. The computational time in such case becomes proportational to the number of Markov Chain Monte Carlo trajectories. In most tests we used 1 billion (10^9) trajectories, but for some tests we decreased that in order to shorten the overall testing time. 
 
-
 
-
=== Hardware platforms ===
 
-
 
-
HPCG cluster and Blue Gene/P supercomputer.
 
-
 
-
Four distinct hardware platforms were used:
 
-
* the HPCG cluster with Intel Xeon X5560 CPU @2.8 Ghz,
 
-
* Blue Gene/P with PowerPC CPUs,
 
-
* our GTX 295-based GPU cluster (with processors Intel Core i7 920)
 
-
* our new M2090-based resource with processors Intel Xeon X5650.
 
-
 
-
=== Execution times ===
 
-
 
-
[[File:Scalability_example1.png|center|500px]]
 
-
 
-
Comparison of the execution time and parallel efficiency of SET application are shown on HPCG (Table below) and BlueGene/P ( Table above).
 
-
 
-
[[File:Scalability_example2.png|center|500px]]
 
-
 
-
[[File:SET-scalability-graph-BG-P.jpg|center|500px]]
 
-
 
-
=== Memory Usage ===
 
-
 
-
The maximum memory usage of a single computational thread is relatively small, in the order of 100 MB. On the GPUs there are several different kinds of memory, some of them rather limited. The available registers and the shared memory are especially problematic, since there is a risk if the available registers are all used some local variables to be spilled to global memory, encountering high latency and other issues. Still we found reasonable performance using 256 GPU threads, which is an acceptable number.
 
-
 
-
=== Profiling ===
 
-
 
-
Profiling was performed in order to improve the compiler optimisation during the second pass and also in order to understand what kind of issues we may be having in the application. We found as expected that most of the computational time is spent in computing of transcendental function like sin, cos, exp, and also in the generation of pseudorandom numbers. We attempted in the GPU version to replace the regular sin, cos, etc., with the less-accurate versions that are more efficient, but we found that the gain from that is relatively small and is not worth the loss of accuracy.  For the GPU-based version we obtained relatively high percentage of divergence within warps, which means that some logical statements are resolved differently within threads of the same warp and there is substantial loss of performance. So far we have not been able to re-order the computation so as to avoid it.
 
-
 
-
=== Communication ===
 
-
 
-
The communication for this application is not critical in the sense that the communication takes less than 10% of the execution time.
 
-
 
-
=== I/O ===
 
-
 
-
The input for the application is small, containing the parameters of the problem at hand. The output is written out at the end of the computation and its size depends on the parameters. For a reasonable size of the domain the output is in the order of several megabytes. More accurate mesh is reasonable only for smaller evolution times and the output size will be proportional to the size of the mesh
 
-
 
-
=== CPU and cache ===
 
-
 
-
We believe that most of the computations of the CPU-based version fit in the cache for the Intel-based version. For the PowerPC processors of the Blue Gene/P some lookup operations when sampling the random variables use the main memory and thus entice higher latency. For the GPU-based version the situation is similar, since some of the tables are larger than the size of the so-called shared-memory. In both cases, the overall significance of these operations  is less than 5%. 
 
-
 
-
=== Analysis ===
 
-
 
-
From our testing we concluded that hyperthreading should be used when available, production Tesla cards have much higher performance than essentially gaming cards like GTX 295, two passes of compilation should be used for the Intel compiler targeting Intel CPUs and that the application is scalable to the maximum number of available cores/threads at our disposable. For future work it remains to find an efficient strategy of reordering of the computations on the GPUs in order to avoid warp divergence. For the CPU-based version we have also developed an MPI meta-program that measures the variation and uses genetic algorithm (from galib library) to optimise the transition density.  This step will be added as a pre-processing stage of the program in order to provide some speedup in order of 20% to the overall computations, but to do so we need to find the right balance between this stage and the main computational stage.
 

Latest revision as of 14:21, 16 January 2013

Contents


Code author(s): George Kastellakis
Application areas: Life Sciences
Language: NEURON Estimated lines of code: 2000
URL: http://wiki.hp-see.eu/index.php/CMSLTM

Implemented scalability actions

Actions:

  • Estimation of performance by running a smaller network in different number of processors
  • Our simulations were implemented in the NEURON simulator, which uses MPI for parallelization. We weren't able to evaluate different MPI implementations, because it was only possible to compile NEURON with openMPI.
  • Different compilers: We attempted to compile NEURON using intel compilers, however it was not possible due to incompatibilities. Although a precompiled binary is provided in the system, NEURON needs to recompile its own modules every time a change is made, therefore it is not possible to run the intel-compiled binary.

Benchmark dataset

The simulation consisted of a network configuration that was smaller than the one used in production.

Hardware platforms

Simulations were run on HPCG/BG in Sofia, Bulgaria.

Execution times

Simulation times and graphs for different numbers of simulated neurons and number of computational nodes are shown below

Cmsltm B1.jpg

Memory Usage

Memory usage is stable < 2GB as our application is not memory-intensive.

Profiling

We could not perform profiling due to lack of profiling support by the underlying platform (NEURON)

Communication

Interprocess communication through MPI. OpenMPI was the MPI implementation we used. We were unable to compile NEURON to work correctly with other MPI implementations.

I/O

We did not perform benchmarks, as our application is not IO-intensive

CPU and cache

No data

Derived metrics

Analysis

Our simulations show sub-linear scaling as new computational nodes are added. We identified optimal configurations that compromise between simulation size and number of nodes.