Profilers

From HP-SEE Wiki

Revision as of 09:37, 18 April 2012 by Dusan (Talk | contribs)
Jump to: navigation, search

Contents

gprof

Section contributed by IPB

gprof is the profiler of the GNU (www.gnu.org) project. It is free software, with the source code freely available and it is licensed under the terms of the GNU Public License. Gprof is not a separate package and it is included in the binutils (http://www.gnu.org/software/binutils/) package which is one of the one of the basic packages of the GNU development suite which makes gprof available on almost every Linux system.

Gprof is available on the PARADOX Cluster as a part of the installed binutils package, version 2.17.50. Beside GNU compilers, Intel and PGI compilers that are available on PARADOX can also generate a gprof compatible profiling output.

Users can perform profiling of their programs using the following steps:

Compiling and linking program with profiling enabled

To compile a source file for profiling, user should specify the -pg option when run the compiler. Compiling with -pg instruments the code so that gprof reports detailed information about program execution. Profiling in gprof works by changing how every function in program is compiled so that when it is called, it will stash away some information about where it was called from. From this, the profiler can figure out what function called it, and can count how many times it was called. For example, to enable gprof profiling of NUQG 3D BEC imag- and real-time propagation source codes, the following lines are used:

 gcc -pg -O5 -c src/util/diffint.c -o diffint.o
 gcc -pg -O5 -c src/util/mem.c -o mem.o
 gcc -pg -O5 -c src/util/cfg.c -o cfg.o
 gcc -pg -O5 -c src/util/inout.c -o inout.o
 gcc -pg -O5 -c src/util/time.c -o time.o
 gcc -pg -O5 -c src/imagtime3d/imagtime3d.c -o imagtime3d.o    
 gcc -pg -O5 -o imagtime3d imagtime3d.o diffint.o mem.o cfg.o inout.o time.o -lm 
 gcc -pg -O5 -c src/realtime3d/realtime3d.c -o realtime3d.o    
 gcc -pg -O5 -o realtime3d realtime3d.o diffint.o mem.o cfg.o inout.o time.o -lm

If only some of the modules of the program are compiled with -pg flag, user can still profile the program, but one won't get complete information about the modules that were compiled without -pg. The only information user will get for the functions in those modules is the total time spent in them. There is no record of how many times they were called, or from where. If user wants to perform line-by-line profiling, one should also specify the -g option, instructing the compiler to insert debugging symbols into the program that match program addresses to source code lines.

Executing program to generate a profile data file

Once the program is compiled for profiling, user must run it in order to generate the information that gprof needs. Program should be run as usual, using the normal arguments, file names, etc. It will, however, run somewhat slower than normal because of the time spent collecting and the writing the profile data. Here is the standard output of NUQG 3D BEC imaginary-time propagation:

 [dusan@ui gp3d-1.0.5]$ ./imagtime3d -i imagtime3d-input 
 opt = 2
 Nitr =         10
 g  = 2.245400e+01
 dt = 2.000000e-03
 dx = 1.000000e-01     dy = 1.000000e-01     dz = 1.000000e-01
 Nx =          200     Ny =          200     Nz =          200
 kappa  = 1.414214e+00
 lambda = 2.000000e+00
 
          itr              t                    norm                      mu                      en                     rms              psi(0,0,0)
            0   0.000000e+00   9.999999999999998e-01   2.207033403947117e+00   2.207033403947117e+00   1.050501494807729e+00   5.495710458544626e-01
            1   2.000000e-03   9.818055800278778e-01   4.556898534719066e+00   3.382066456099940e+00   1.055477226914399e+00   5.402400908819136e-01
            2   4.000000e-03   9.819891188296052e-01   4.512857588540425e+00   3.360320847314966e+00   1.060277268987461e+00   5.316691994790188e-01
            3   6.000000e-03   9.821582930151911e-01   4.472166301378423e+00   3.340394508413489e+00   1.064914161625991e+00   5.237639995493827e-01
            4   8.000000e-03   9.823147871685903e-01   4.434443467945163e+00   3.322072346920306e+00   1.069399002721710e+00   5.164458244260830e-01
            5   1.000000e-02   9.824600230497559e-01   4.399365605398038e+00   3.305172433784271e+00   1.073741672048799e+00   5.096484796708504e-01
            6   1.200000e-02   9.825952101779827e-01   4.366656166504124e+00   3.289539740661047e+00   1.077951012536146e+00   5.033157930514318e-01
            7   1.400000e-02   9.827213849606330e-01   4.336077125915407e+00   3.275041264828938e+00   1.082034978066712e+00   4.973997317850793e-01
            8   1.600000e-02   9.828394413238462e-01   4.307422344085718e+00   3.261562191657742e+00   1.086000755097427e+00   4.918589373123249e-01
            9   1.800000e-02   9.829501549515177e-01   4.280512279960870e+00   3.249002842935008e+00   1.089854863578449e+00   4.866575718777318e-01
           10   2.000000e-02   9.830542026564079e-01   4.255189739075672e+00   3.237276227445227e+00   1.093603241339709e+00   4.817644010486062e-01
 
 Execution time: 1.957879e+01 seconds.

In addition to standard output, executed application will write the profile data into a file called gmon.out just before exiting. If there is already a file called gmon.out, its contents will be overwritten. The gmon.out file is written in the program's current working directory at the time it exits.

Running the gprof to analyze the profile data

After the profile data file gmon.out was generated, user can run gprof to interpret the information in it. The gprof program prints profiling information (flat profile, call graph) on the standard output or into a file with redirection. In the case of NUQG 3D BEC imaginary-time propagation the result in illustrated below:

  [dusan@ui gp3d-1.0.5]$ gprof ./imagtime3d gmon.out > gmon.txt

Interpreting gprof's output

gprof can produce several different output styles and the most important ones are Flat Profile and the Call-graph.

The flat profile shows the total amount of time user’s program spent executing each function and how many times that function was called. This is part of NUQG 3D BEC flat profile:

 [dusan@ui gp3d-1.0.5]$ head -n20 gmon.txt 
 Flat profile:
 
 Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total           
  time   seconds   seconds    calls  ms/call  ms/call  name    
  37.10      6.27     6.27       11   570.30   918.18  calcmuen
  21.74      9.95     3.68  1320000     0.00     0.00  diff
  10.24     11.68     1.73       10   173.09   173.09  calclux
   9.44     13.28     1.60                             main
   8.22     14.67     1.39       10   139.07   139.07  calcluy
   7.16     15.88     1.21       10   121.06   121.06  calcluz
   2.01     16.22     0.34  2008844     0.00     0.00  simpint
   1.42     16.46     0.24       11    21.83    28.64  calcrms
   1.30     16.68     0.22       11    20.01    26.82  calcnorm
   0.83     16.82     0.14        1   140.07   140.07  initpsi
   0.59     16.92     0.10        6    16.68    23.45  outpsixy
   0.00     16.92     0.00       17     0.00     0.00  cfg_read
   0.00     16.92     0.00       12     0.00     0.00  alloc_double_vector
   0.00     16.92     0.00       12     0.00     0.00  free_double_vector
   0.00     16.92     0.00        7     0.00     0.00  alloc_double_matrix

The functions are sorted by first by decreasing run-time spent in them, then by decreasing number of calls, then alphabetically by name.

The call graph shows, for each function, which functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function. This can suggest places where one might try to eliminate function calls that use a lot of time. Bellow is a sample call from a small example program:

                    Call graph (explanation follows)
                    
 
 granularity: each sample hit covers 2 byte(s) for 0.06% of 16.92 seconds
 
 index % time    self  children    called     name
                                                  <spontaneous>
 [1]    100.0    1.60   15.32                 main [1]
                 6.27    3.83      11/11          calcmuen [2]
                 1.73    0.00      10/10          calclux [4]
                 1.39    0.00      10/10          calcluy [5]
                 1.21    0.00      10/10          calcluz [6]
                 0.24    0.07      11/11          calcrms [8]
                 0.22    0.07      11/11          calcnorm [9]
                 0.10    0.04       6/6           outpsixy [10]
                 0.14    0.00       1/1           initpsi [11]
                 0.00    0.00      12/12          alloc_double_vector [13]
                 0.00    0.00      12/12          free_double_vector [14]
                 0.00    0.00       7/7           alloc_double_matrix [15]
                 0.00    0.00       7/7           free_double_matrix [16]
                 0.00    0.00       4/4           alloc_double_tensor [17]
                 0.00    0.00       4/4           free_double_tensor [18]
                 0.00    0.00       2/2           gettime [19]
                 0.00    0.00       1/1           cfg_init [20]
                 0.00    0.00       1/1           readpar [22]
                 0.00    0.00       1/1           gencoef [21]
                 0.00    0.00       1/1           writepsi [23]
 -----------------------------------------------
                 6.27    3.83      11/11          main [1]
 [2]     59.7    6.27    3.83      11         calcmuen [2]
                 3.68    0.00 1320000/1320000     diff [3]
                 0.15    0.00  884422/2008844     simpint [7]
 -----------------------------------------------
                 3.68    0.00 1320000/1320000     calcmuen [2]
 [3]     21.7    3.68    0.00 1320000         diff [3]
 ...

Sun Studio performance tools

Intel VTune

Section contributed by IICT-BAS

Intel VTune Amplifier is a performance analysis application for both 32-bit and 64-bit x86 based machines. It can profile C, C++ and Fortran code and can be used for both serial and threaded application. Additionaly VTune is available as an MPI-enabled version.

To use VTune you must compile your code using the -g compiler flag in order to generate the debug information needed for VTune.

VTune provides several tools for analysis:

Software sampling
A feature limited only for x86 compatible processors. It can be used to generate the time spent on different locations along with the call stack, thus identifying hotspots that which require optimization.
Locks and waits analysis
Used for locating long synchronization waits. Useful for detecting underutilization.
Threading timeline
A feature that allows the indetification of synchronization issues.
Source view
Visualizes the sampling times results mapped to the the source code. Can also be drilled-down to assembly code.
Hardware event sampling
A feature limited to Intel processors, because it uses the chip performance monitoring unit. Quite useful for minimizing cache misses and branch mispredictions.
PTU (Performance Tuning Utility)
An additional application, that grants access to experimental performance analysis tools.

D8.2 related materials

http://hpseewiki.ipb.ac.rs/index.php/Optimization_techniques_for_scalability#Profilers

Personal tools