Profilers

From HP-SEE Wiki

Revision as of 08:06, 19 April 2012 by Antun (Talk | contribs)
Jump to: navigation, search

Contents

gprof

Section contributed by IPB

gprof is the profiler from the GNU (www.gnu.org) development suite. It is free software, with the source code freely available and licensed under the terms of the GNU Public License. gprof is not a separate package, but is rather included in the binutils (http://www.gnu.org/software/binutils/) package, which is one of the one of the basic packages of the GNU development suite, making gprof available on almost each Linux system.

gprof is available on the PARADOX cluster as a part of the installed binutils package, version 2.17.50. GNU, Intel and PGI compilers (which are all available on PARADOX) can all generate a gprof-compatible profiling outputs.

Users can perform profiling of their programs using the following steps:

Compiling and linking program with profiling enabled

To compile a source file and build an executable that will enable comfortable profiling, a user should specify the -pg compile option when invoking the compiler from the command line. The compiling with -pg flag instruments the code so that gprof reports detailed information about program execution in a separate, report file, during the application execution. The binary obtained this way contains additional information on functions and a table of symbols, allowing logging of run-time information that gprof can subsequently analyze. Furthermore, gprof can be used interactively, when the binary is run through gprof, which then fully controls execution flow.

As an example, we show how to enable gprof profiling of NUQG 3D BEC imag- and real-time propagation source codes. To compile the code, the following lines are used:

 gcc -pg -O5 -c src/util/diffint.c -o diffint.o
 gcc -pg -O5 -c src/util/mem.c -o mem.o
 gcc -pg -O5 -c src/util/cfg.c -o cfg.o
 gcc -pg -O5 -c src/util/inout.c -o inout.o
 gcc -pg -O5 -c src/util/time.c -o time.o
 gcc -pg -O5 -c src/imagtime3d/imagtime3d.c -o imagtime3d.o    
 gcc -pg -O5 -o imagtime3d imagtime3d.o diffint.o mem.o cfg.o inout.o time.o -lm 
 gcc -pg -O5 -c src/realtime3d/realtime3d.c -o realtime3d.o    
 gcc -pg -O5 -o realtime3d realtime3d.o diffint.o mem.o cfg.o inout.o time.o -lm

If only some of the modules of the program are compiled with -pg flag, a user can still profile the program, but will get complete information only about the modules that were compiled with the -pg. The only information user will get for functions from other modules is the total time spent in them. There is no record of how many times they were called, or from where. If user wants to perform line-by-line profiling, the -g option should be also used when compiling the code, instructing the compiler to insert debugging symbols into the program that match program addresses to source code lines.

Executing program to generate a profile data file

Once the program is compiled for profiling, a user runs it in order to generate the information gprof needs in a report file. Program should be run as usual, using the normal arguments, file names, etc. It will, however, run somewhat slower than normal due to the time spent collecting and writing the profile data. Here is the standard output of NUQG 3D BEC imaginary-time propagation:

 [dusan@ui gp3d-1.0.5]$ ./imagtime3d -i imagtime3d-input 
 opt = 2
 Nitr =         10
 g  = 2.245400e+01
 dt = 2.000000e-03
 dx = 1.000000e-01     dy = 1.000000e-01     dz = 1.000000e-01
 Nx =          200     Ny =          200     Nz =          200
 kappa  = 1.414214e+00
 lambda = 2.000000e+00
 
          itr              t                    norm                      mu                      en                     rms              psi(0,0,0)
            0   0.000000e+00   9.999999999999998e-01   2.207033403947117e+00   2.207033403947117e+00   1.050501494807729e+00   5.495710458544626e-01
            1   2.000000e-03   9.818055800278778e-01   4.556898534719066e+00   3.382066456099940e+00   1.055477226914399e+00   5.402400908819136e-01
            2   4.000000e-03   9.819891188296052e-01   4.512857588540425e+00   3.360320847314966e+00   1.060277268987461e+00   5.316691994790188e-01
            3   6.000000e-03   9.821582930151911e-01   4.472166301378423e+00   3.340394508413489e+00   1.064914161625991e+00   5.237639995493827e-01
            4   8.000000e-03   9.823147871685903e-01   4.434443467945163e+00   3.322072346920306e+00   1.069399002721710e+00   5.164458244260830e-01
            5   1.000000e-02   9.824600230497559e-01   4.399365605398038e+00   3.305172433784271e+00   1.073741672048799e+00   5.096484796708504e-01
            6   1.200000e-02   9.825952101779827e-01   4.366656166504124e+00   3.289539740661047e+00   1.077951012536146e+00   5.033157930514318e-01
            7   1.400000e-02   9.827213849606330e-01   4.336077125915407e+00   3.275041264828938e+00   1.082034978066712e+00   4.973997317850793e-01
            8   1.600000e-02   9.828394413238462e-01   4.307422344085718e+00   3.261562191657742e+00   1.086000755097427e+00   4.918589373123249e-01
            9   1.800000e-02   9.829501549515177e-01   4.280512279960870e+00   3.249002842935008e+00   1.089854863578449e+00   4.866575718777318e-01
           10   2.000000e-02   9.830542026564079e-01   4.255189739075672e+00   3.237276227445227e+00   1.093603241339709e+00   4.817644010486062e-01
 
 Execution time: 1.957879e+01 seconds.

In addition to the standard output, the executed application will write the profile data into a file called gmon.out just before exiting. If there is already a file called gmon.out, it's content will be overwritten. The gmon.out file is written in the program's current working directory at the time it exits.

Running the gprof to analyze the profile data

After the profile data file gmon.out was generated, a user can run gprof to analyze the information collected during the application execution. The gprof program prints profiling information (flat profile, call graph) on the standard output or into a file (with redirection). In the case of NUQG 3D BEC imaginary-time propagation, the result is illustrated below:

  [dusan@ui gp3d-1.0.5]$ gprof ./imagtime3d gmon.out > gmon.txt

Interpreting gprof's output

gprof can produce several different output styles, the most important being Flat Profile and the Call-graph.

The flat profile shows the distribution of time the program spent executing each function, as well as the distribution of number of calls to each function. This is part of the NUQG 3D BEC flat profile:

 [dusan@ui gp3d-1.0.5]$ head -n20 gmon.txt 
 Flat profile:
 
 Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total           
  time   seconds   seconds    calls  ms/call  ms/call  name    
  37.10      6.27     6.27       11   570.30   918.18  calcmuen
  21.74      9.95     3.68  1320000     0.00     0.00  diff
  10.24     11.68     1.73       10   173.09   173.09  calclux
   9.44     13.28     1.60                             main
   8.22     14.67     1.39       10   139.07   139.07  calcluy
   7.16     15.88     1.21       10   121.06   121.06  calcluz
   2.01     16.22     0.34  2008844     0.00     0.00  simpint
   1.42     16.46     0.24       11    21.83    28.64  calcrms
   1.30     16.68     0.22       11    20.01    26.82  calcnorm
   0.83     16.82     0.14        1   140.07   140.07  initpsi
   0.59     16.92     0.10        6    16.68    23.45  outpsixy
   0.00     16.92     0.00       17     0.00     0.00  cfg_read
   0.00     16.92     0.00       12     0.00     0.00  alloc_double_vector
   0.00     16.92     0.00       12     0.00     0.00  free_double_vector
   0.00     16.92     0.00        7     0.00     0.00  alloc_double_matrix

The functions are sorted by their run times and by their number of calls (both in decreasing order), and then alphabetically by name.

The call-graph shows, for each function, which other functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function. This can suggest places where one might try optimize the code by eliminating function calls that use a lot of time (by in-lining them). Bellow is a sample call-graph from a small example program:

                    Call graph (explanation follows)
                    
 
 granularity: each sample hit covers 2 byte(s) for 0.06% of 16.92 seconds
 
 index % time    self  children    called     name
                                                  <spontaneous>
 [1]    100.0    1.60   15.32                 main [1]
                 6.27    3.83      11/11          calcmuen [2]
                 1.73    0.00      10/10          calclux [4]
                 1.39    0.00      10/10          calcluy [5]
                 1.21    0.00      10/10          calcluz [6]
                 0.24    0.07      11/11          calcrms [8]
                 0.22    0.07      11/11          calcnorm [9]
                 0.10    0.04       6/6           outpsixy [10]
                 0.14    0.00       1/1           initpsi [11]
                 0.00    0.00      12/12          alloc_double_vector [13]
                 0.00    0.00      12/12          free_double_vector [14]
                 0.00    0.00       7/7           alloc_double_matrix [15]
                 0.00    0.00       7/7           free_double_matrix [16]
                 0.00    0.00       4/4           alloc_double_tensor [17]
                 0.00    0.00       4/4           free_double_tensor [18]
                 0.00    0.00       2/2           gettime [19]
                 0.00    0.00       1/1           cfg_init [20]
                 0.00    0.00       1/1           readpar [22]
                 0.00    0.00       1/1           gencoef [21]
                 0.00    0.00       1/1           writepsi [23]
 -----------------------------------------------
                 6.27    3.83      11/11          main [1]
 [2]     59.7    6.27    3.83      11         calcmuen [2]
                 3.68    0.00 1320000/1320000     diff [3]
                 0.15    0.00  884422/2008844     simpint [7]
 -----------------------------------------------
                 3.68    0.00 1320000/1320000     calcmuen [2]
 [3]     21.7    3.68    0.00 1320000         diff [3]
 ...

Sun Studio performance tools

Intel VTune

Section contributed by IICT-BAS

Intel VTune Amplifier is a performance analysis application for both 32-bit and 64-bit x86 based machines. It can profile C, C++ and Fortran code and can be used for both serial and threaded application. Additionaly VTune is available as an MPI-enabled version.

To use VTune you must compile your code using the -g compiler flag in order to generate the debug information needed for VTune.

VTune provides several tools for analysis:

Software sampling
A feature limited only for x86 compatible processors. It can be used to generate the time spent on different locations along with the call stack, thus identifying hotspots that which require optimization.
Locks and waits analysis
Used for locating long synchronization waits. Useful for detecting underutilization.
Threading timeline
A feature that allows the indetification of synchronization issues.
Source view
Visualizes the sampling times results mapped to the the source code. Can also be drilled-down to assembly code.
Hardware event sampling
A feature limited to Intel processors, because it uses the chip performance monitoring unit. Quite useful for minimizing cache misses and branch mispredictions.
PTU (Performance Tuning Utility)
An additional application, that grants access to experimental performance analysis tools.

D8.2 related materials

http://hpseewiki.ipb.ac.rs/index.php/Optimization_techniques_for_scalability#Profilers

Personal tools