Tools and libraries

From HP-SEE Wiki

(Difference between revisions)

Revision as of 13:38, 2 October 2012

Parellel Programming

MPICH-2

MPICH2-1.5rc1 released (Aug. 15, 2012) A feature preview release of MPICH2, 1.5rc1, is now available to download. This release contains many new features. This release is not recommended for production systems at this time.

The current stable release for MPICH2 is 1.4.1p1. It was released on Sep. 2, 2011.

OpenMP

OpenMP aims to provide parallel language support for automotive, aeronautics, biotech, and financial applications.

OpenMP, the de-facto standard for parallel programming on shared memory systems, continues to extend its reach beyond pure HPC to include embedded systems, multicore and real time systems. A new version is being developed that will include support for accelerators, error handling, thread affinity, tasking extensions and Fortran 2003. The OpenMP consortium welcomes feedback from all interested parties and will use this feedback to improve the next version of OpenMP.

OpenMP aims to provide high-level parallel language support for a wide range of applications, from automotive and aeronautics to biotech, automation, robotics and financial analysis.

The key features that the OpenMP consortium is working on include:

Support for accelerators. A mechanism will be provided to describe regions of code where data and/or computation should be moved to any of a wide variety of computing devices. User experiences with the OpenACC directives will provide important information to the OpenMP effort. The minimum feature core required for an initial release has been defined.

Error handling. Error handling capabilities of OpenMP will be defined to improve the resiliency and stability of OpenMP applications in the presence of system-level, runtime-level, and user-defined errors. Features to cleanly abort parallel OpenMP execution have been defined, based on conditional cancellation and user-defined cancellation points.

Thread affinity. Users will be given a way to define where to execute OpenMP threads. Platform-specific data and algorithm-specific properties are separated, offering a deterministic behavior and simplicity in use. The advantages for the user are better locality, less false sharing and more memory bandwidth.

Tasking extensions. The new tasking extensions being considered are task deep synchronization, dependent tasks, reduction support for tasks, and task-only threads. Task-only threads are threads that do not take part in worksharing constructs, but just wait for tasks to be executed.

Support for Fortran 2003. The Fortran 2003 standard adds many modern computer language features. Having these features in the specification allows users to take advantage of using OpenMP directives to parallelize Fortran 2003 complying programs. This includes interoperability of Fortran and C, which is one of the most popular features in Fortran 2003.

“OpenACC contributes to OpenMP by providing real-world exposure to programming concepts embraced by both groups,” said Duncan Poole, President of the OpenACC standards group. “The founding OpenACC members are all members of the OpenMP Working Group on Accelerators. Even now as OpenACC compilers are released to the market, we look forward to a time when our experiences bring additional validation to the standard we are all developing within the OpenMP process. Developers will benefit from early access to OpenACC tools, and can be assured that these same companies are working to support OpenMP in a future version.”

“This year will be the fifteenth anniversary of OpenMP. Not many parallel languages can say that and we are still growing”, said Michael Wong, CEO of the OpenMP ARB, “OpenMP will be made more robust, and will cover more types of systems.”

Source: OpenMP consortium site

CUDA

CUDA 4.1 offers a trifecta of features to make parallel programing with GPUs easier and faster:

- New LLVM-based compiler delivering instant performance speed-up - Re-designed Visual Profiler with automated performance analysis - Hundreds of new imaging and signal processing functions

CUDA 5 Release Candidate offers:

Eclipse Nsight, Eclipse Edition - Develop, Debug and Optimize… All in one IDE

RDMA for GPUDirect,learn more - Direct communication between GPUs and other PCIe devices

GPU Library Object Linking - Libraries and plug-ins for GPU code

Dynamic Parallelism - Easily accelerate parallel nested loops starting with Tesla K20 Kepler GPUs

Pthreads

If any...

Generals Scientific and Numeric Libraries

CULA

CULA is a set of GPU-accelerated linear algebra libraries utilizing the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics. CULA is providing a wide set of LAPACK and BLAS capability. CULA offers: - Supercomputing performance. The CULA libraries are over 10x faster than the competition. - Simplicity. CULA libraries require no GPU programming experience. - Advanced interfaces for usage in C, C++, Fortran, MATLAB, and Python. - Cross platform support. Available for Linux, Windows, and Mac OS X.

CULA consists of two core libraries CULA Dense and CULA Sparse. CULA Dense provides accelerated implementations of the LAPACK and BLAS libraries for dense linear algebra. Contains routines for systems solvers, singular value decompositions, and eigenproblems. CULA Sparse provides tools necessary to rapidly solve large sparse systems using iterative methods. Multiple algorithms, preconditioners, and data storage formats are supported.

CULA is available in a variety of different interfaces to integrate directly into existing code. Programmers can easily call GPU-acclerated CULA from their C/C++, FORTRAN, MATLAB, or Python codes. This can all be done with no GPU programming experience, simply by replacing existing function calls with CULA function calls. CULA takes care of all GPU memory management, but for the more experienced GPU programmers there is also a "device" interface to work with GPU memory directly.

There are both free and commercial versions.

Link: http://www.culatools.com/

NAG Numerical Library

The NAG Library for SMP and Multicore, Mark 23, has been extended with an additional group of algorithms, specifically engineered for the current generation of computer systems. To take advantage of the latest processor and memory configurations, many mathematical algorithms have been re-implemented so that they are more efficient when running on multiple cores. New parallel routines, such as Particle Swarm Optimization, chosen for their special characteristics when run on parallel architecture, can be powerful when solving important classes of problems on parallel hardware. The newly engineered routines are available now with OpenMP (Open Multi-Processing) support.

New features include:

Parallelism in the areas of pseudorandom number generators, two-dimensional wavelets, particle swarm optimization, four and five-dimensional data interpolation routines, hierarchical mixed effects regression routines and more sparse eigensolver routines.
Over 70 tuned LAPACK routines and
Over 250 routines enhanced through calling tuned LAPACK routines (including: nonlinear equations, matrix calculations, eigenproblems, Cholesky factorization).

Virtual Research Communities Specific Libraries

Tools

PGI Accelerator compilers

Using PGI Accelerator compilers, programmers can accelerate applications on x64+accelerator platforms by adding OpenACC compiler directives to existing high-level standard-compliant Fortran and C programs and then recompiling with appropriate compiler options. The PGI Accelerator compilers automatically analyze whole program structure and data, split portions of the application between the x64 host CPU and the accelerator device as specified by user directives, and define and generate an optimized mapping of loops to automatically use the parallel cores, hardware threading capabilities and SIMD vector capabilities of modern accelerators. In addition to directives and pragmas that specify regions of code or functions to be accelerated, other directives give the programmer fine-grained control over the mapping of loops, allocation of memory, and optimization for the accelerator memory hierarchy. The PGI Accelerator compilers generate unified object files and executables that manage all movement of data to and from the accelerator while leveraging all existing host-side utilities—linker, librarians, makefiles—and require no changes to the existing standard HPC Linux/x64 programming environment.

Link:

http://www.hpcwire.com/hpcwire/2012-06-19/pgi_fortran_c_compilers_supporting_openacc_now_available.html

Mellanox Application Acceleration Software

Mellanox Application Acceleration Software is software solution that reduces latency, increases throughput, and offloads CPU cycles, enhancing the performance of HPC applications while eliminating the need for large investments in hardware infrastructure. it increases the performance and scalability of parallel programs over InfiniBand. It consists of three components - Fabric Collective Accelerator (FCA), Storage Accelerator (VSA) and Messaging Accelerator (VMA) and the parallel programming libraries that are using these accelerations, including the new ScalableSHMEM and ScalableUPC PGAS libraries that Mellanox has recently introduced to run over InfiniBand.

Slidecast:

http://insidehpc.com/2012/05/08/slidecast-scalable-hpc-new-accelerations-for-parallel-programming-languages-over-infiniband/

@@ Line 77: / Line 77: @@
 Link: http://www.culatools.com/
+=== NAG Numerical Library ===
+The NAG Library for SMP and Multicore, Mark 23, has been extended with an additional group of algorithms, specifically engineered for the current generation of computer systems. To take advantage of the latest processor and memory configurations, many mathematical algorithms have been re-implemented so that they are more efficient when running on multiple cores. New parallel routines, such as Particle Swarm Optimization, chosen for their special characteristics when run on parallel architecture, can be powerful when solving important classes of problems on parallel hardware. The newly engineered routines are available now with OpenMP (Open Multi-Processing) support.
+New features include:
+* Parallelism in the areas of pseudorandom number generators, two-dimensional wavelets, particle swarm optimization, four and five-dimensional data interpolation routines, hierarchical mixed effects regression routines and more sparse eigensolver routines.
+* Over 70 tuned LAPACK routines and
+* Over 250 routines enhanced through calling tuned LAPACK routines (including: nonlinear equations, matrix calculations, eigenproblems, Cholesky factorization).
 == Virtual Research Communities Specific Libraries ==