Tools and libraries

From HP-SEE Wiki

Parellel Programming

MPI-3

New MPI-3 specification is expected by the end of 2012 or in the beginning of 2013.

New specification will include better support for scalability (performance and robustness), multi-core and cluster architectures, fault tolerance and hybrid programming. Non-blocking collective operations and sparse collective operations on process topologies will be included in the standard. Updates to one-sided programming model are expected.

More information:

MPICH-2

As of November 2012, the MPICH2 project renamed itself to simply "MPICH". MPICH v3.0 implements the MPI-3.0 standard.

A new stable release of MPICH, 3.0.3, is now available for download. This release adds several performance features for MPI-RMA and fixes several bugs present in 3.0.2.

MPICH2-1.5rc1 released (Aug. 15, 2012) A feature preview release of MPICH2, 1.5rc1, is now available to download. This release contains many new features. This release is not recommended for production systems at this time.

The current stable release for MPICH2 is 1.4.1p1. It was released on Sep. 2, 2011.

OpenMP

OpenMP aims to provide parallel language support for automotive, aeronautics, biotech, and financial applications.

OpenMP, the de-facto standard for parallel programming on shared memory systems, continues to extend its reach beyond pure HPC to include embedded systems, multicore and real time systems. A new version is being developed that will include support for accelerators, error handling, thread affinity, tasking extensions and Fortran 2003. The OpenMP consortium welcomes feedback from all interested parties and will use this feedback to improve the next version of OpenMP.

OpenMP aims to provide high-level parallel language support for a wide range of applications, from automotive and aeronautics to biotech, automation, robotics and financial analysis.

The key features that the OpenMP consortium is working on include:

Support for accelerators. A mechanism will be provided to describe regions of code where data and/or computation should be moved to any of a wide variety of computing devices. User experiences with the OpenACC directives will provide important information to the OpenMP effort. The minimum feature core required for an initial release has been defined.

Error handling. Error handling capabilities of OpenMP will be defined to improve the resiliency and stability of OpenMP applications in the presence of system-level, runtime-level, and user-defined errors. Features to cleanly abort parallel OpenMP execution have been defined, based on conditional cancellation and user-defined cancellation points.

Thread affinity. Users will be given a way to define where to execute OpenMP threads. Platform-specific data and algorithm-specific properties are separated, offering a deterministic behavior and simplicity in use. The advantages for the user are better locality, less false sharing and more memory bandwidth.

Tasking extensions. The new tasking extensions being considered are task deep synchronization, dependent tasks, reduction support for tasks, and task-only threads. Task-only threads are threads that do not take part in worksharing constructs, but just wait for tasks to be executed.

Support for Fortran 2003. The Fortran 2003 standard adds many modern computer language features. Having these features in the specification allows users to take advantage of using OpenMP directives to parallelize Fortran 2003 complying programs. This includes interoperability of Fortran and C, which is one of the most popular features in Fortran 2003.

“OpenACC contributes to OpenMP by providing real-world exposure to programming concepts embraced by both groups,” said Duncan Poole, President of the OpenACC standards group. “The founding OpenACC members are all members of the OpenMP Working Group on Accelerators. Even now as OpenACC compilers are released to the market, we look forward to a time when our experiences bring additional validation to the standard we are all developing within the OpenMP process. Developers will benefit from early access to OpenACC tools, and can be assured that these same companies are working to support OpenMP in a future version.”

“This year will be the fifteenth anniversary of OpenMP. Not many parallel languages can say that and we are still growing”, said Michael Wong, CEO of the OpenMP ARB, “OpenMP will be made more robust, and will cover more types of systems.”

Source: OpenMP consortium site

OpenMP 4.0

OpenMP consortium released new OpenMP API specifications, version 4.0. New standard extends the reach of OpenMP beyond shared memory systems, with support for DSPs, real time systems, and accelerators (like GPUs and Xeon Phi). The OpenMP API aims to provide high-level parallel language support for a wide range of applications, from automotive and aeronautics to biotech, automation, robotics and financial analysis.

New features in the OpenMP 4.0 API include:

· Support for accelerators. The OpenMP 4.0 API specification effort included significant participation by all the major vendors in order to support a wide variety of compute devices. OpenMP API provides mechanisms to describe regions of code where data and/or computation should be moved to another computing device. Several prototypes for the accelerator proposal have already been implemented.

· SIMD constructs to vectorize both serial as well as parallelized loops. With the advent of SIMD units in all major processor chips, portable support for accessing them is essential. OpenMP 4.0 API provides mechanisms to describe when multiple iterations of the loop can be executed concurrently using SIMD instructions and to describe how to create versions of functions that can be invoked across SIMD lanes.

· Error handling. OpenMP 4.0 API defines error handling capabilities to improve the resiliency and stability of OpenMP applications in the presence of system-level, runtime-level, and user-defined errors. Features to abort parallel OpenMP execution cleanly have been defined, based on conditional cancellation and user-defined cancellation points.

· Thread affinity. OpenMP 4.0 API provides mechanisms to define where to execute OpenMP threads. Platform-specific data and algorithm-specific properties are separated, offering a deterministic behavior and simplicity in use. The advantages for the user are better locality, less false sharing and more memory bandwidth.

· Tasking extensions. OpenMP 4.0 API provides several extensions to its task-based parallelism support. Tasks can be grouped to support deep task synchronization and task groups can be aborted to reflect completion of cooperative tasking activities such as search. Task-to-task synchronization is now supported through the specification of task dependency.

· Support for Fortran 2003. The Fortran 2003 standard adds many modern computer language features. Having these features in the specification allows users to parallelize Fortran 2003 compliant programs. This includes interoperability of Fortran and C, which is one of the most popular features in Fortran 2003.

· User-defined reductions. Previously, OpenMP API only supported reductions with base language operators and intrinsic procedures. With OpenMP 4.0 API, user-defined reductions are now also supported.

· Sequentially consistent atomics. A clause has been added to allow a programmer to enforce sequential consistency when a specific storage location is accessed atomically.

Links:

http://openmp.org/wp/

CUDA

New CUDA 5.5 has been released. Major improvements include native compilation on ARM-based platforms, enhanced Hyper–Q support and MPI workload prioritization, and new guided performance analysis. Changes are also made to allow single GPU debugging for Linux and to support static CUDA runtime library.

NVIDIA CUDA parallel computing platform and programming model, for the first time, delivers support for ARM-based platforms. Available as a free download the CUDA 5.5 release brings the power of GPU-accelerated computing to ARM platforms, a rapidly-growing processor ecosystem -- approximately 10 times larger than the x86 CPU-based market. The release provides programmers with a robust, easy-to-use platform to develop advanced science, engineering, mobile and high performance computing (HPC) applications on ARM and x86 CPU-based systems. In addition to providing native support for ARM platforms, the release delivers a number of new advanced performance and productivity features. It also offers a full suite of programming tools, GPU-accelerated math libraries and documentation for both x86- and ARM-based platforms.

CUDA 5 provides a host of new features and capabilities to make the development of GPU-accelerated applications faster and easier. Advanced features and tools in this release include GPU object linking, which allows developers to create libraries that run entirely on the GPU, and NVIDIA Nsight Eclipse Edition. CUDA 5 also includes a new integrated documentation system and all-in-one installers that make it easier for developers to get started with parallel programming. Support for the Kepler computing architecture is also included.

CUDA 5 offers:

Eclipse Nsight, Eclipse Edition for Linux and Mac
RDMA for GPUDirect - Direct communication between GPUs and other PCIe devices
GPU Library Object Linking - Libraries and plug-ins for GPU code
Dynamic Parallelism - easy way to accelerate parallel nested loops starting with Tesla K20 Kepler GPUs

CUDA 4.1 offers a trifecta of features to make parallel programing with GPUs easier and faster:

New LLVM-based compiler delivering instant performance speed-up
Re-designed Visual Profiler with automated performance analysis
Hundreds of new imaging and signal processing functions

CUDA in Python

Togehter with Continuum Analytics, NVIDIA announced a Python support for CUDA. Python's extensive libraries and advanced features make it ideal for a broad range of HPC science, engineering and big data analytics applications. Programmers that use the Python language can now speedup their HPC and big data analytics applications with GPU acceleration by using the NVIDIA CUDA parallel programming model. Continuum Analytics provided a new Python CUDA compiler, NumbaPro, for their high performance Python suite, Anaconda Accelerate. Continuum Analytics' Python development environment uses LLVM and the NVIDIA CUDA compiler software development kit to deliver GPU-accelerated application capabilities to Python programmers.

Links:

CUDA and MPI

There are many reasons to combine MPI and CUDA parallel programming approaches. This combination enables us to solve problems with a data size too large to fit into the memory of a single GPU, or that would require an unreasonably long compute time on a single node. Another reason is to accelerate an existing MPI application with GPUs or to enable an existing single-node multi-GPU application to scale across multiple nodes. The main problem in this combined approach was handling of the memory transfers, since memory transfer from GPU on one node to the GPU on another node needed to pass the CPU memory. CUDA Toolkits 4.0 and after introduced Unified Virtual Addressing, so those CUDA-aware MPI implementations took advantege of it to perform these memory transfers easily and efficiently.

Following MPI implementations support CUDA:

MVAPICH2 1.8/1.9b
OpenMPI 1.7 (beta)
CRAY MPI (MPT 5.6.2)
IBM Platform MPI (8.3)

Application that use CUDA-aware MPI benefit for two reasons:

all operations that are required to carry out the message transfer can be pipelined
acceleration technologies like GPUDirect can be utilized by the MPI library transparently to the user

NVIDIA GPUDirect technologies provide high-bandwidth, low-latency communications with NVIDIA GPUs. GPUDirect is an umbrella name used to refer to several specific technologies. In the context of MPI the GPUDirect technologies cover all kinds of inter-rank communication: intra-node, inter-node, and RDMA inter-node communication.

Links:

https://developer.nvidia.com/content/introduction-cuda-aware-mpi

Pthreads

If any...

AMD CodeXL toolkit

AMD presented AMD APP SDK 2.8 and the AMD CodeXL unified tool suite to provide developers the tools and resources needed to accelerate applications with AMD accelerated processing units (APUs) and graphics processing units (GPUs). The APP SDK 2.8 and CodeXL tool suite provides access to code samples, white papers, libraries and tools to leverage the processing power of heterogeneous compute with OpenCL, C++, DirectCompute.

AMD CodeXL is a comprehensive tool suite that enables users to develop, debug and tune GPU applications. AMD CodeXL is available both as a Visual Studio plugin and a standalone user interface application for Windows and Linux. It includes powerful GPU debugging, comprehensive GPU and CPU profiling, and static OpenCL kernel analysis capabilities. AMD CodeXL increases developer productivity by helping them identify programming errors and performance issues in their application quickly and easily. Now developers can debug, profile and analyze their applications with a full system-wide view on AMD APU, GPU and CPUs.

AMD CodeXL enables:

CPU profiling
GPU debugging
GPU profiling
Static kernel analysis

Xcelerit SDK

The Xcelerit SDK is a software toolkit to boost the performance of compute-intensive applications while preserving programmers productivity.

From a simple sequential source code – free from any parallel constructs, low-level compiler directives, etc. – the Xcelerit SDK can generate highly optimised code for a variety of target processors. These include multi-core CPUs, GPUs, and combinations of these in a grid. The programming interface is intuitive and suitable for many classes of algorithms, e.g. Monte-Carlo, linear algebra, or spectral analysis.

The Xcelerit SDK is targeted at domain specialists using heavy computations in their day-to-day work, such as mathematicians, researchers, engineers, or scientists. Users can focus on the core of their algorithm, rather than addressing parallelism and low-level hardware details, and yet they can generate high performance programs.

Generals Scientific and Numeric Libraries

CULA

CULA is a set of GPU-accelerated linear algebra libraries utilizing the NVIDIA CUDA parallel computing architecture to dramatically improve the computation speed of sophisticated mathematics. CULA is providing a wide set of LAPACK and BLAS capability. CULA offers: - Supercomputing performance. The CULA libraries are over 10x faster than the competition. - Simplicity. CULA libraries require no GPU programming experience. - Advanced interfaces for usage in C, C++, Fortran, MATLAB, and Python. - Cross platform support. Available for Linux, Windows, and Mac OS X.

CULA consists of two core libraries CULA Dense and CULA Sparse. CULA Dense provides accelerated implementations of the LAPACK and BLAS libraries for dense linear algebra. Contains routines for systems solvers, singular value decompositions, and eigenproblems. CULA Sparse provides tools necessary to rapidly solve large sparse systems using iterative methods. Multiple algorithms, preconditioners, and data storage formats are supported.

CULA is available in a variety of different interfaces to integrate directly into existing code. Programmers can easily call GPU-acclerated CULA from their C/C++, FORTRAN, MATLAB, or Python codes. This can all be done with no GPU programming experience, simply by replacing existing function calls with CULA function calls. CULA takes care of all GPU memory management, but for the more experienced GPU programmers there is also a "device" interface to work with GPU memory directly.

There are both free and commercial versions.

Link: http://www.culatools.com/

NAG Numerical Library

The NAG Library for SMP and Multicore, Mark 23, has been extended with an additional group of algorithms, specifically engineered for the current generation of computer systems. To take advantage of the latest processor and memory configurations, many mathematical algorithms have been re-implemented so that they are more efficient when running on multiple cores. New parallel routines, such as Particle Swarm Optimization, chosen for their special characteristics when run on parallel architecture, can be powerful when solving important classes of problems on parallel hardware. The newly engineered routines are available now with OpenMP (Open Multi-Processing) support.

New features include:

Parallelism in the areas of pseudorandom number generators, two-dimensional wavelets, particle swarm optimization, four and five-dimensional data interpolation routines, hierarchical mixed effects regression routines and more sparse eigensolver routines.
Over 70 tuned LAPACK routines and
Over 250 routines enhanced through calling tuned LAPACK routines (including: nonlinear equations, matrix calculations, eigenproblems, Cholesky factorization).

Charm++

New version, 6.4 of Charm++ objects is released. New features include:

Added new attribute to enable parameter-marshaled recipients of reduction messages (reductiontarget attribute)
Enabled pipelining of large messages in CkMulticast by default
New load balancers - TreeMatch, Zoltan, Scotch graph paritioning based (ScotchLB and Refine and Topo variants), RefineSwap
Load balancing improvements:

   * Allow reduced load database size using floats instead of doubles
   * Improved hierarchical balancer
   * Periodic balancing adapts its interval dynamically
   * User code can request a callback when migration is complete
   * Instrumentation records multicasts

Chare arrays support options that can enable some optimizations
New 'completion detection' library for parallel process terminationdetection, when the need for modularity excludes full quiescence detection
New 'mesh streamer' library for fine-grain many-to-many collectives, handling message bundling and network topology
Memory pooling allocator performance and resource usage improved substantially

Link:

http://charm.cs.uiuc.edu/software

Odeint

Odeint is a modern C++ library for numerically solving Ordinary Differential Equations (ODEs). The library is developed in a generic way using template metaprogramming, which leads to extraordinarily high flexibility at top performance. Odeint can solve ODEs on the CUDA enabled GPU by using the Thrust Library. Programmer should use thrust::device_vector as state type and thrust_algebra/thrust_operations when defining the stepper and ODEINT runs all computations on the GPU.

Virtual Research Communities Specific Libraries

OpenCV

OpenCV (Open Source Computer Vision Library) is an open-source BSD-licensed library that includes several hundreds of computer vision algorithms. A new version of OpenCV 2.4.3 is finally out and available at the at SourceForge (http://sourceforge.net/projects/opencvlibrary), or alternatively at https://github.com/Itseez/opencv/tree/2.4

Main improvements include:

Significantly improved and optimized Android and iOS ports
Greatly extended GPU (i.e. CUDA-based) module
The brand new ocl (OpenCL-based) module that unleashes GPU power also for AMD and Intel GPU users
Much better performance on many-core systems out of the box
About 130 bugs have been fixed since 2.4.2

OpenCL support is not included into the binary package, since there are different SDKs, and it is not turned on by default. OpenCL module may not be stable and fully functional, yet.

SPECIAL NOTE for Ubuntu x86 12.04 users: By default OpenCV is now built with -O2 optimization flag instead of -O3 on 32-bit Linux. The compiler in the 32-bit version of Ubuntu 12.04 produces incorrect code with -O3, so it’s strongly recommended not to use this flag.

OpenFOAM

A new versio, 2.2.0 of OpenFOAM open source CFD toolbox has been released. Version 2.2.0 is a major new version containing significant developments, which will be presented at the OpenFOAM User Conference 2013. New features include:

Imroved meshing and mesh tools - new options added to existing meshing tools
snappyHexMesh updated
Pre-processing improvement - extensions to macro expansion capability, capability to group patches, and support for reading VTK format files
New numerical methods - run-time selectable bounded time and convection schemes, improved cell value reconstruction, a new framework for coupling solution over multiple regions
Improved matrix solvers - block-matrix and solver framework, coupled solution of vector, tensor and other multi-component entities
Better run-time control during simulation
Runtime-selectable physics - a new framework has been introduced to allow users to select any physics that can be represented as sources or constraints on the governing equations
Major chnages in thermophysical modelling - improved handling of multiple materials and fluid phases, and others.
Number of changes in physical modelling - modelling of surface films, porosity, turbulence, combustion and particle tracking
New boundary conditions
Better post-processing
Better documentation - a new system of documentation has been introduced within the HTML source documentation, generated by Doxygen

Links:

http://www.openfoam.org/version2.2.0/

ROOT

Current version: 5.34/05

A new ROOT 6 is being planned, with ROOT 6 - The Next Generation workshop took place on 11 - 14 March, Saas-Fee, Switzerland. Since almost two decades, ROOT has established itself as the framework for HENP data processing and analysis. The LHC upgrade program and the new experiments being designed at CERN and elsewhere will pose even more formidable challenges in terms of data complexity and size. The new parallel and heterogeneous computing architectures that are either announced or already available will call for a deep rethinking of the code and the data structures to be exploited efficiently.

Links:

http://root.cern.ch/drupal/

Gaussian

A new revision of Gaussian 9 software, C.01 has been released. The complete list of new features, changes and bug fixes could be found here:

http://www.gaussian.com/g_tech/rel_notes.pdf

GROMACS

GROMACS has a new version, 4.6 and a minor update 4.6.1. Major improvements in the new version include:

New Verlet non-bonded scheme which, by default, uses exact cut-off's and a buffered pair-list.
Multi-level hybrid parallelization (MPI + OpenMP + CUDA)
New x86 SIMD non-bonded kernels for the usual cut-off scheme, called group scheme and the new verlet scheme, use x86 SIMD intrinsics (no more assembly code):
Various improvements in CPU affinity setting and load balancing
Improved PME spread, gather, solve and FFT communication, total improvement ~25% on x86.
New, advanced free energy sampling techniques.
AdResS adaptive resolution simulation support.
Enforced rotation ("rotational pulling")
Build configuration now uses CMake, configure+autoconf/make no longer supported.
Improved regressiontests; these can now be run directly from the build tree using make check

AMBER

Amber12 is available. Different improvements in the force fields domain, improved models, simplified installation and automatic update support. Various kernels have been offloaded to GPU, including Temperature Replica Exchange, Isotropic Periodic Sum, Accelerated Molecular Dynamics and support for various harmonic restraints based on the use of NMRopt. GPU support is extended to NVIDIA Kepler architecture.

AmberTools13 have been released in April 2013. AmberTools consists of several independently developed packages that work well by themselves, and with Amber itself. The suite can also be used to carry out complete molecular dynamics simulations (using NAB or mdgx), with either explicit water or generalized Born solvent models.

Links:

http://ambermd.org/

NAMD

NAMD 2.9 has been released. Main advantages in the new version are:

Improved (temperature/Hamiltonian) replica-exchange implementation
Replica-based umbrella sampling via collective variables module
Optimized shared-memory single-node and multiple-node CUDA builds
CUDA GPU-accelerated generalized Born implicit solvent (GBIS) model
CUDA GPU-accelerated energy evaluation and minimization
Native CRAY XE/XK uGNI network layer implementation
Faster grid forces and lower-accuracy "lite" implementation
Hybrid MD with knowledge-based Go forces to drive folding
Linear combination of pairwise overlaps (LCPO) SASA for GBIS model
Weeks-Chandler-Anderson decomposition for alchemical FEP simulations
Collective variables module improvements
Updates to CUDA 4.0 and Tcl 8.5.9, plus option to build with FFTW 3
Enhanced performance and scalability

Tools

PGI Accelerator compilers

Using PGI Accelerator compilers, programmers can accelerate applications on x64+accelerator platforms by adding OpenACC compiler directives to existing high-level standard-compliant Fortran and C programs and then recompiling with appropriate compiler options. The PGI Accelerator compilers automatically analyze whole program structure and data, split portions of the application between the x64 host CPU and the accelerator device as specified by user directives, and define and generate an optimized mapping of loops to automatically use the parallel cores, hardware threading capabilities and SIMD vector capabilities of modern accelerators. In addition to directives and pragmas that specify regions of code or functions to be accelerated, other directives give the programmer fine-grained control over the mapping of loops, allocation of memory, and optimization for the accelerator memory hierarchy. The PGI Accelerator compilers generate unified object files and executables that manage all movement of data to and from the accelerator while leveraging all existing host-side utilities—linker, librarians, makefiles—and require no changes to the existing standard HPC Linux/x64 programming environment.

Link:

http://www.hpcwire.com/hpcwire/2012-06-19/pgi_fortran_c_compilers_supporting_openacc_now_available.html

Mellanox Application Acceleration Software

Mellanox Application Acceleration Software is software solution that reduces latency, increases throughput, and offloads CPU cycles, enhancing the performance of HPC applications while eliminating the need for large investments in hardware infrastructure. it increases the performance and scalability of parallel programs over InfiniBand. It consists of three components - Fabric Collective Accelerator (FCA), Storage Accelerator (VSA) and Messaging Accelerator (VMA) and the parallel programming libraries that are using these accelerations, including the new ScalableSHMEM and ScalableUPC PGAS libraries that Mellanox has recently introduced to run over InfiniBand.

Slidecast:

http://insidehpc.com/2012/05/08/slidecast-scalable-hpc-new-accelerations-for-parallel-programming-languages-over-infiniband/

Portable Hardware Locality (hwloc)

The Portable Hardware Locality (hwloc) software tool provides a portable abstraction (across OS, versions, architectures, ...) of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, cores and simultaneous multithreading. With the extensive developement of multicore architectures, most of the cluster nodes have complex hardware topologies. Nowadays, cluster nodes consist of several CPU cores, hierarchical caches, or multiple threads per core, making its topology far from flat. Such complex and hierarchical topologies have strong impact of the application performance. The developer must take hardware affinities into account when trying to exploit the actual hardware performance.

Hwloc builds a hierarchical tree that the application may walk to retrieve information about the hardware or to bind tasks properly. It also gathers various system attributes such as cache and memory information as well as the locality of I/O devices such as network interfaces, InfiniBand HCAs or GPUs. It primarily aims at helping applications with gathering information about modern computing hardware so as to exploit it accordingly and efficiently. It enables better I/O data transfer thanks to processes and data being properly placed on the host part that is closer to devices.

Hwloc supports the several operating systems such as Linux, Solaris, AIX, Darwin, Windows, etc. Since it uses standard Operating System information, hwloc's support is almost always independant from the processor type (x86, powerpc, ia64, ...), and just relies on the Operating System support.

Links:

http://www.open-mpi.org/projects/hwloc/

Allinea DDT

Allinea DDT, is a commercial debugger produced by Allinea Software, primarily for debugging parallel MPI or OpenMP programs, including those running on clusters of Linux machines, but also used by many for scalar code in C, C++, Fortran 90 and CUDA. As of June 2011 it is used on 36 of the top 100 supercomputers on the TOP500 list.

Key features include:

Spots memory and logic errors instantly
Lightning-fast performance even at extreme scale
Debug thousands of processes as easily as one
Native Mac, Windows and Linux clients
One tool for the HPC architectures of today and tomorrow
Scalable printf and powerful command-line modes
Visualize huge data sets
Navigate source and data effortlessly

Supported architectures:

Intel Xeon Phi, NVIDIA CUDA, IBM BlueGene, ARM 7, 32-bit and 64-bit x86

Programming models:

MPI, OpenMP, CUDA, OpenACC, UPC, CoArray Fortran, PGAS Languages, pthread-based multithreading

Programming languages:

Fortran, C++, C, PGAS Languages, CoArray Fortran

Links:

http://www.allinea.com/products/ddt/

Allinea MAP

Allinea MAP is a MPI profiler and new performance-analysis tool that can be used to rapidly spot performance problems in complex codes.

Allinea MAP allows programmer to:

Check memory usage, floating-point calculations and MPI usage at a glance
Flick to the CPU view to see the percentage of vectorized SIMD instructions, including AVX extensions used in each part of the code
See how the amount of time spent in memory operations varies over time and processes - are you making efficient use of the cache?
Zoom in to any part of the timeline, isolate a single iteration and explore its behaviour in detail
Everything shows aggregated data, preferring distributions with outlying ranks labelled to endless lists of processes and threads, ensuring the display is as visually scalable as our industry-leading backend is.

Supported Languages:

C/C++, Fortran, UPC, CUDA

Links:

http://www.allinea.com/products/map/

Tools and libraries

From HP-SEE Wiki

Contents

Parellel Programming

MPI-3

MPICH-2

OpenMP

OpenMP 4.0

CUDA

CUDA in Python

CUDA and MPI

Pthreads

AMD CodeXL toolkit

Xcelerit SDK

Generals Scientific and Numeric Libraries

CULA

NAG Numerical Library

Charm++

Odeint

Virtual Research Communities Specific Libraries

OpenCV

OpenFOAM

ROOT

Gaussian

GROMACS

AMBER

NAMD

Tools

PGI Accelerator compilers

Mellanox Application Acceleration Software

Portable Hardware Locality (hwloc)

Allinea DDT

Allinea MAP

Views

Personal tools

Navigation

Search

Toolbox