Profilers

From HP-SEE Wiki

Revision as of 23:06, 20 April 2012 by Compphys (Talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Contents

gprof

Section contributed by IPB

gprof is the profiler from the GNU (www.gnu.org) development suite. It is free software, with the source code freely available and licensed under the terms of the GNU Public License. gprof is not a separate package, but is rather included in the binutils (http://www.gnu.org/software/binutils/) package, which is one of the one of the basic packages of the GNU development suite, making gprof available on almost each Linux system.

gprof is available on the PARADOX cluster as a part of the installed binutils package, version 2.17.50. GNU, Intel and PGI compilers (which are all available on PARADOX) can all generate a gprof-compatible profiling outputs.

Users can perform profiling of their programs using the following steps:

Compiling and linking program with profiling enabled

To compile a source file and build an executable that will enable comfortable profiling, a user should specify the -pg compile option when invoking the compiler from the command line. The compiling with -pg flag instruments the code so that gprof reports detailed information about program execution in a separate, report file, during the application execution. The binary obtained this way contains additional information on functions and a table of symbols, allowing logging of run-time information that gprof can subsequently analyze. Furthermore, gprof can be used interactively, when the binary is run through gprof, which then fully controls execution flow.

As an example, we show how to enable gprof profiling of NUQG 3D BEC imag- and real-time propagation source codes. To compile the code, the following lines are used:

 gcc -pg -O5 -c src/util/diffint.c -o diffint.o
 gcc -pg -O5 -c src/util/mem.c -o mem.o
 gcc -pg -O5 -c src/util/cfg.c -o cfg.o
 gcc -pg -O5 -c src/util/inout.c -o inout.o
 gcc -pg -O5 -c src/util/time.c -o time.o
 gcc -pg -O5 -c src/imagtime3d/imagtime3d.c -o imagtime3d.o    
 gcc -pg -O5 -o imagtime3d imagtime3d.o diffint.o mem.o cfg.o inout.o time.o -lm 
 gcc -pg -O5 -c src/realtime3d/realtime3d.c -o realtime3d.o    
 gcc -pg -O5 -o realtime3d realtime3d.o diffint.o mem.o cfg.o inout.o time.o -lm

If only some of the modules of the program are compiled with -pg flag, a user can still profile the program, but will get complete information only about the modules that were compiled with the -pg. The only information user will get for functions from other modules is the total time spent in them. There is no record of how many times they were called, or from where. If user wants to perform line-by-line profiling, the -g option should be also used when compiling the code, instructing the compiler to insert debugging symbols into the program that match program addresses to source code lines.

Executing program to generate a profile data file

Once the program is compiled for profiling, a user runs it in order to generate the information gprof needs in a report file. Program should be run as usual, using the normal arguments, file names, etc. It will, however, run somewhat slower than normal due to the time spent collecting and writing the profile data. Here is the standard output of NUQG 3D BEC imaginary-time propagation:

 [dusan@ui gp3d-1.0.5]$ ./imagtime3d -i imagtime3d-input 
 opt = 2
 Nitr =         10
 g  = 2.245400e+01
 dt = 2.000000e-03
 dx = 1.000000e-01     dy = 1.000000e-01     dz = 1.000000e-01
 Nx =          200     Ny =          200     Nz =          200
 kappa  = 1.414214e+00
 lambda = 2.000000e+00
 
          itr              t                    norm                      mu                      en                     rms              psi(0,0,0)
            0   0.000000e+00   9.999999999999998e-01   2.207033403947117e+00   2.207033403947117e+00   1.050501494807729e+00   5.495710458544626e-01
            1   2.000000e-03   9.818055800278778e-01   4.556898534719066e+00   3.382066456099940e+00   1.055477226914399e+00   5.402400908819136e-01
            2   4.000000e-03   9.819891188296052e-01   4.512857588540425e+00   3.360320847314966e+00   1.060277268987461e+00   5.316691994790188e-01
            3   6.000000e-03   9.821582930151911e-01   4.472166301378423e+00   3.340394508413489e+00   1.064914161625991e+00   5.237639995493827e-01
            4   8.000000e-03   9.823147871685903e-01   4.434443467945163e+00   3.322072346920306e+00   1.069399002721710e+00   5.164458244260830e-01
            5   1.000000e-02   9.824600230497559e-01   4.399365605398038e+00   3.305172433784271e+00   1.073741672048799e+00   5.096484796708504e-01
            6   1.200000e-02   9.825952101779827e-01   4.366656166504124e+00   3.289539740661047e+00   1.077951012536146e+00   5.033157930514318e-01
            7   1.400000e-02   9.827213849606330e-01   4.336077125915407e+00   3.275041264828938e+00   1.082034978066712e+00   4.973997317850793e-01
            8   1.600000e-02   9.828394413238462e-01   4.307422344085718e+00   3.261562191657742e+00   1.086000755097427e+00   4.918589373123249e-01
            9   1.800000e-02   9.829501549515177e-01   4.280512279960870e+00   3.249002842935008e+00   1.089854863578449e+00   4.866575718777318e-01
           10   2.000000e-02   9.830542026564079e-01   4.255189739075672e+00   3.237276227445227e+00   1.093603241339709e+00   4.817644010486062e-01
 
 Execution time: 1.957879e+01 seconds.

In addition to the standard output, the executed application will write the profile data into a file called gmon.out just before exiting. If there is already a file called gmon.out, it's content will be overwritten. The gmon.out file is written in the program's current working directory at the time it exits.

Running the gprof to analyze the profile data

After the profile data file gmon.out was generated, a user can run gprof to analyze the information collected during the application execution. The gprof program prints profiling information (flat profile, call graph) on the standard output or into a file (with redirection). In the case of NUQG 3D BEC imaginary-time propagation, the result is illustrated below:

  [dusan@ui gp3d-1.0.5]$ gprof ./imagtime3d gmon.out > gmon.txt

Interpreting gprof's output

gprof can produce several different output styles, the most important being Flat Profile and the Call-graph.

The flat profile shows the distribution of time the program spent executing each function, as well as the distribution of number of calls to each function. This is part of the NUQG 3D BEC flat profile:

 [dusan@ui gp3d-1.0.5]$ head -n20 gmon.txt 
 Flat profile:
 
 Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total           
  time   seconds   seconds    calls  ms/call  ms/call  name    
  37.10      6.27     6.27       11   570.30   918.18  calcmuen
  21.74      9.95     3.68  1320000     0.00     0.00  diff
  10.24     11.68     1.73       10   173.09   173.09  calclux
   9.44     13.28     1.60                             main
   8.22     14.67     1.39       10   139.07   139.07  calcluy
   7.16     15.88     1.21       10   121.06   121.06  calcluz
   2.01     16.22     0.34  2008844     0.00     0.00  simpint
   1.42     16.46     0.24       11    21.83    28.64  calcrms
   1.30     16.68     0.22       11    20.01    26.82  calcnorm
   0.83     16.82     0.14        1   140.07   140.07  initpsi
   0.59     16.92     0.10        6    16.68    23.45  outpsixy
   0.00     16.92     0.00       17     0.00     0.00  cfg_read
   0.00     16.92     0.00       12     0.00     0.00  alloc_double_vector
   0.00     16.92     0.00       12     0.00     0.00  free_double_vector
   0.00     16.92     0.00        7     0.00     0.00  alloc_double_matrix

The functions are sorted by their run times and by their number of calls (both in decreasing order), and then alphabetically by name.

The call-graph shows, for each function, which other functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function. This can suggest places where one might try optimize the code by eliminating function calls that use a lot of time (by in-lining them). Bellow is a sample call-graph from a small example program:

                    Call graph (explanation follows)
                    
 
 granularity: each sample hit covers 2 byte(s) for 0.06% of 16.92 seconds
 
 index % time    self  children    called     name
                                                  <spontaneous>
 [1]    100.0    1.60   15.32                 main [1]
                 6.27    3.83      11/11          calcmuen [2]
                 1.73    0.00      10/10          calclux [4]
                 1.39    0.00      10/10          calcluy [5]
                 1.21    0.00      10/10          calcluz [6]
                 0.24    0.07      11/11          calcrms [8]
                 0.22    0.07      11/11          calcnorm [9]
                 0.10    0.04       6/6           outpsixy [10]
                 0.14    0.00       1/1           initpsi [11]
                 0.00    0.00      12/12          alloc_double_vector [13]
                 0.00    0.00      12/12          free_double_vector [14]
                 0.00    0.00       7/7           alloc_double_matrix [15]
                 0.00    0.00       7/7           free_double_matrix [16]
                 0.00    0.00       4/4           alloc_double_tensor [17]
                 0.00    0.00       4/4           free_double_tensor [18]
                 0.00    0.00       2/2           gettime [19]
                 0.00    0.00       1/1           cfg_init [20]
                 0.00    0.00       1/1           readpar [22]
                 0.00    0.00       1/1           gencoef [21]
                 0.00    0.00       1/1           writepsi [23]
 -----------------------------------------------
                 6.27    3.83      11/11          main [1]
 [2]     59.7    6.27    3.83      11         calcmuen [2]
                 3.68    0.00 1320000/1320000     diff [3]
                 0.15    0.00  884422/2008844     simpint [7]
 -----------------------------------------------
                 3.68    0.00 1320000/1320000     calcmuen [2]
 [3]     21.7    3.68    0.00 1320000         diff [3]
 ...

Sun Studio performance tools

Section contributed by UPB

The Sun Sampling Collector and the Performance Analyzer are a pair of tools that you can use to collect and analyze performance data for your serial or parallel application. The Collector gathers performance data by sampling at regular time intervals and by tracing function calls. The performance information is gathered in so called experiment files, which can then be displayed with the analyzer GUI or the er print line command after the program has nished. Since the collector is part of the Sun compiler suite the studio compiler module has to be loaded. However programs to be analyzed do not have to be compiled with the Sun compiler, the GNU or Intel compiler for example work as well.

Collecting experiment data

The first step in profiling with the Sun Analyzer is to obtain experiement data. For this you must compile your code with the -g option. After that you can either run collect like this:

 $ collect a.out

To use the GUI to collect experiment data, start the analyzer (X11 forwarding must be enabled - $ analyzer), go to Collect Experiment under the File menu and select the Target, Working Directory and add Arguments if you need to. Click on Preview Command to view the command for collecting experiment data only. Now you can submit the command to a queue. Some example of scripts used to submit the command (the path to collect might be different):

 $ cat script.sh
 #!/bin/bash
 qsub -q [queue] -pe [pe] [np] -cwd -b y "/opt/sun/sunstudio12.1/prod/bin/collect -p high -M CT8.1 -S on -A on -L none mpirun -np 4 -- /path/to/file/test"
 $ cat script_OMP_ONLY.sh
 #!/bin/bash
 qsub -q [queue] [pe] [np] -v OMP_NUM_THREADS=8 -cwd -b y  "/opt/sun/sunstudio12.1/prod/bin/collect -L none -p high -S on -A on /path/to/file/test"
 $ cat scriptOMP.sh
 #!/bin/bash
 qsub -q [queue] [pe] [np] -v OMP_NUM_THREADS=8 -cwd -b y "/opt/sun/sunstudio12.1/prod/bin/collect -p high -M CT8.1 -S on -A on -L none mpirun -np 4 -- /path/to/file/test"

The first one uses MPI tracing for testing MPI programs, the second one is intended for OpenMP programs (that’s why it sets the OMP NUM THREADS variable) and the last one is for hybrid programs (they use both MPI and OpenMP). Some of the parameters used are explained in the following. You can find more information in the manual ($ man collect).

  • -M CT8.1; Specify collection of an MPI experiment. CT8.1 is the MPI version installed.
  • -L size; Limit the amount of profiling and tracing data recorded to size megabytes. None means no limit.
  • -S interval; Collect periodic samples at the interval specified (in seconds). on defaults to 1 second.
  • -A option; Control whether or not load objects used by the target process should be archived or copied into the recorded experiment. on archive load objects into the experiment.
  • -p option; Collect clock-based profiling data. high turn on clock-based profiling with the default profiling interval of approximately 1 millisecond.

Viewing the experiment results

To view the results, open the analyzer and go to File - Open Experiment and select the experiment you want to view. A very good tutorial for analyzing the data can be found here: http://www.oracle.com/technetwork/server-storage/solarisstudio/documentation/o11-151-perf-analyzer-brief-1405338.pdf. The Performance Analyzer MPI Tutorial is a good place to start.

Intel VTune

Section contributed by IICT-BAS

Intel VTune Amplifier is a performance analysis application for both 32-bit and 64-bit x86 based machines. It can profile C, C++ and Fortran code and can be used for both serial and threaded application. Additionaly VTune is available as an MPI-enabled version.

To use VTune you must compile your code using the -g compiler flag in order to generate the debug information needed for VTune.

VTune provides several tools for analysis:

Software sampling
A feature limited only for x86 compatible processors. It can be used to generate the time spent on different locations along with the call stack, thus identifying hotspots that which require optimization.
Locks and waits analysis
Used for locating long synchronization waits. Useful for detecting underutilization.
Threading timeline
A feature that allows the indetification of synchronization issues.
Source view
Visualizes the sampling times results mapped to the the source code. Can also be drilled-down to assembly code.
Hardware event sampling
A feature limited to Intel processors, because it uses the chip performance monitoring unit. Quite useful for minimizing cache misses and branch mispredictions.
PTU (Performance Tuning Utility)
An additional application, that grants access to experimental performance analysis tools.

D8.2 related materials

http://hpseewiki.ipb.ac.rs/index.php/Optimization_techniques_for_scalability#Profilers