Profilers

From HP-SEE Wiki

(Difference between revisions)
Jump to: navigation, search
(gprof)
(gprof)
Line 3: Line 3:
'' Section contributed by IPB ''
'' Section contributed by IPB ''
-
gprof is the profiler of the GNU ([http://www.gnu.org/ www.gnu.org]) project. It is free software, with the source code freely available and it is licensed under the terms of the GNU Public License. Gprof is not a separate package and it is included in the binutils ([http://www.gnu.org/software/binutils/ http://www.gnu.org/software/binutils/]) package which is one of the one of the basic packages of the GNU development suite which makes gprof available on almost every Linux system.
+
gprof is the profiler from the GNU ([http://www.gnu.org/ www.gnu.org]) development suite. It is free software, with the source code freely available and licensed under the terms of the GNU Public License. gprof is not a separate package, but is rather included in the binutils ([http://www.gnu.org/software/binutils/ http://www.gnu.org/software/binutils/]) package, which is one of the one of the basic packages of the GNU development suite, making gprof available on almost each Linux system.
-
Gprof is available on the PARADOX Cluster as a part of the installed binutils package, version 2.17.50. Beside GNU compilers, Intel and PGI compilers that are available on PARADOX can also generate a gprof compatible profiling output.
+
gprof is available on the PARADOX cluster as a part of the installed binutils package, version 2.17.50. GNU, Intel and PGI compilers (which are all available on PARADOX) can all generate a gprof-compatible profiling outputs.
Users can perform profiling of their programs using the following steps:
Users can perform profiling of their programs using the following steps:
Line 11: Line 11:
'''Compiling and linking program with profiling enabled'''
'''Compiling and linking program with profiling enabled'''
-
To compile a source file for profiling, user should specify the <code>-pg</code> option when run the compiler. Compiling with <code>-pg</code> instruments the code so that gprof reports detailed information about program execution. Profiling in gprof works by changing how every function in program is compiled so that when it is called, it will stash away some information about where it was called from. From this, the profiler can figure out what function called it, and can count how many times it was called. For example, to enable gprof profiling of NUQG 3D BEC imag- and real-time propagation source codes, the following lines are used:
+
To compile a source file and build an executable that will enable comfortable profiling, a user should specify the <code>-pg</code> compile option when invoking the compiler from the command line. The compiling with <code>-pg</code> flag instruments the code so that gprof reports detailed information about program execution in a separate, report file, during the application execution. The binary obtained this way contains additional information on functions and a table of symbols, allowing logging of run-time information that gprof can subsequently analyze. Furthermore, gprof can be used interactively, when the binary is run through gprof, which then fully controls execution flow.
 +
 
 +
As an example, we show how to enable gprof profiling of NUQG 3D BEC imag- and real-time propagation source codes. To compile the code, the following lines are used:
   gcc -pg -O5 -c src/util/diffint.c -o diffint.o
   gcc -pg -O5 -c src/util/diffint.c -o diffint.o
Line 23: Line 25:
   gcc -pg -O5 -o realtime3d realtime3d.o diffint.o mem.o cfg.o inout.o time.o -lm
   gcc -pg -O5 -o realtime3d realtime3d.o diffint.o mem.o cfg.o inout.o time.o -lm
-
If only some of the modules of the program are compiled with <code>-pg</code> flag, user can still profile the program, but one won't get complete information about the modules that were compiled without <code>-pg</code>. The only information user will get for the functions in those modules is the total time spent in them. There is no record of how many times they were called, or from where. If user wants to perform line-by-line profiling, one should also specify the <code>-g</code> option, instructing the compiler to insert debugging symbols into the program that match program addresses to source code lines.
+
If only some of the modules of the program are compiled with <code>-pg</code> flag, a user can still profile the program, but will get complete information only about the modules that were compiled with the <code>-pg</code>. The only information user will get for functions from other modules is the total time spent in them. There is no record of how many times they were called, or from where. If user wants to perform line-by-line profiling, the <code>-g</code> option should be also used when compiling the code, instructing the compiler to insert debugging symbols into the program that match program addresses to source code lines.
'''Executing program to generate a profile data file'''
'''Executing program to generate a profile data file'''
-
Once the program is compiled for profiling, user must run it in order to generate the information that gprof needs. Program should be run as usual, using the normal arguments, file names, etc.  It will, however, run somewhat slower than normal because of the time spent collecting and the writing the profile data. Here is the standard output of NUQG 3D BEC imaginary-time propagation:
+
Once the program is compiled for profiling, a user runs it in order to generate the information gprof needs in a report file. Program should be run as usual, using the normal arguments, file names, etc.  It will, however, run somewhat slower than normal due to the time spent collecting and writing the profile data. Here is the standard output of NUQG 3D BEC imaginary-time propagation:
   [dusan@ui gp3d-1.0.5]$ ./imagtime3d -i imagtime3d-input  
   [dusan@ui gp3d-1.0.5]$ ./imagtime3d -i imagtime3d-input  
Line 54: Line 56:
   Execution time: 1.957879e+01 seconds.
   Execution time: 1.957879e+01 seconds.
-
In addition to standard output, executed application will write the profile data into a file called <code>gmon.out</code> just before exiting. If there is already a file called gmon.out, its contents will be overwritten. The <code>gmon.out</code> file is written in the program's current working directory at the time it exits.
+
In addition to the standard output, the executed application will write the profile data into a file called <code>gmon.out</code> just before exiting. If there is already a file called gmon.out, it's content will be overwritten. The <code>gmon.out</code> file is written in the program's current working directory at the time it exits.
'''Running the gprof to analyze the profile data'''
'''Running the gprof to analyze the profile data'''
-
After the profile data file <code>gmon.out</code> was generated, user can run gprof to interpret the information in it. The gprof program prints profiling information (flat profile, call graph) on the standard output or into a file with redirection. In the case of NUQG 3D BEC imaginary-time propagation the result in illustrated below:
+
After the profile data file <code>gmon.out</code> was generated, a user can run gprof to analyze the information collected during the application execution. The gprof program prints profiling information (flat profile, call graph) on the standard output or into a file (with redirection). In the case of NUQG 3D BEC imaginary-time propagation, the result is illustrated below:
   [dusan@ui gp3d-1.0.5]$ gprof ./imagtime3d gmon.out > gmon.txt
   [dusan@ui gp3d-1.0.5]$ gprof ./imagtime3d gmon.out > gmon.txt
Line 64: Line 66:
'''Interpreting gprof's output'''
'''Interpreting gprof's output'''
-
gprof can produce several different output styles and the most important ones are Flat Profile and the Call-graph.
+
gprof can produce several different output styles, the most important being Flat Profile and the Call-graph.
-
The flat profile shows the total amount of time user’s program spent executing each function and how many times that function was called. This is part of NUQG 3D BEC flat profile:  
+
The flat profile shows the distribution of time the program spent executing each function, as well as the distribution of number of calls to each function. This is part of the NUQG 3D BEC flat profile:  
   [dusan@ui gp3d-1.0.5]$ head -n20 gmon.txt  
   [dusan@ui gp3d-1.0.5]$ head -n20 gmon.txt  
Line 90: Line 92:
     0.00    16.92    0.00        7    0.00    0.00  alloc_double_matrix
     0.00    16.92    0.00        7    0.00    0.00  alloc_double_matrix
-
The functions are sorted by first by decreasing run-time spent in them, then by decreasing number of calls, then alphabetically by name.
+
The functions are sorted by their run times and by their number of calls (both in decreasing order), and then alphabetically by name.
-
The call graph shows, for each function, which functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function. This can suggest places where one might try to eliminate function calls that use a lot of time. Bellow is a sample call from a small example program:
+
The call-graph shows, for each function, which other functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function. This can suggest places where one might try optimize the code by eliminating function calls that use a lot of time (by in-lining them). Bellow is a sample call-graph from a small example program:
                     Call graph (explanation follows)
                     Call graph (explanation follows)

Revision as of 08:06, 19 April 2012

Contents

gprof

Section contributed by IPB

gprof is the profiler from the GNU (www.gnu.org) development suite. It is free software, with the source code freely available and licensed under the terms of the GNU Public License. gprof is not a separate package, but is rather included in the binutils (http://www.gnu.org/software/binutils/) package, which is one of the one of the basic packages of the GNU development suite, making gprof available on almost each Linux system.

gprof is available on the PARADOX cluster as a part of the installed binutils package, version 2.17.50. GNU, Intel and PGI compilers (which are all available on PARADOX) can all generate a gprof-compatible profiling outputs.

Users can perform profiling of their programs using the following steps:

Compiling and linking program with profiling enabled

To compile a source file and build an executable that will enable comfortable profiling, a user should specify the -pg compile option when invoking the compiler from the command line. The compiling with -pg flag instruments the code so that gprof reports detailed information about program execution in a separate, report file, during the application execution. The binary obtained this way contains additional information on functions and a table of symbols, allowing logging of run-time information that gprof can subsequently analyze. Furthermore, gprof can be used interactively, when the binary is run through gprof, which then fully controls execution flow.

As an example, we show how to enable gprof profiling of NUQG 3D BEC imag- and real-time propagation source codes. To compile the code, the following lines are used:

 gcc -pg -O5 -c src/util/diffint.c -o diffint.o
 gcc -pg -O5 -c src/util/mem.c -o mem.o
 gcc -pg -O5 -c src/util/cfg.c -o cfg.o
 gcc -pg -O5 -c src/util/inout.c -o inout.o
 gcc -pg -O5 -c src/util/time.c -o time.o
 gcc -pg -O5 -c src/imagtime3d/imagtime3d.c -o imagtime3d.o    
 gcc -pg -O5 -o imagtime3d imagtime3d.o diffint.o mem.o cfg.o inout.o time.o -lm 
 gcc -pg -O5 -c src/realtime3d/realtime3d.c -o realtime3d.o    
 gcc -pg -O5 -o realtime3d realtime3d.o diffint.o mem.o cfg.o inout.o time.o -lm

If only some of the modules of the program are compiled with -pg flag, a user can still profile the program, but will get complete information only about the modules that were compiled with the -pg. The only information user will get for functions from other modules is the total time spent in them. There is no record of how many times they were called, or from where. If user wants to perform line-by-line profiling, the -g option should be also used when compiling the code, instructing the compiler to insert debugging symbols into the program that match program addresses to source code lines.

Executing program to generate a profile data file

Once the program is compiled for profiling, a user runs it in order to generate the information gprof needs in a report file. Program should be run as usual, using the normal arguments, file names, etc. It will, however, run somewhat slower than normal due to the time spent collecting and writing the profile data. Here is the standard output of NUQG 3D BEC imaginary-time propagation:

 [dusan@ui gp3d-1.0.5]$ ./imagtime3d -i imagtime3d-input 
 opt = 2
 Nitr =         10
 g  = 2.245400e+01
 dt = 2.000000e-03
 dx = 1.000000e-01     dy = 1.000000e-01     dz = 1.000000e-01
 Nx =          200     Ny =          200     Nz =          200
 kappa  = 1.414214e+00
 lambda = 2.000000e+00
 
          itr              t                    norm                      mu                      en                     rms              psi(0,0,0)
            0   0.000000e+00   9.999999999999998e-01   2.207033403947117e+00   2.207033403947117e+00   1.050501494807729e+00   5.495710458544626e-01
            1   2.000000e-03   9.818055800278778e-01   4.556898534719066e+00   3.382066456099940e+00   1.055477226914399e+00   5.402400908819136e-01
            2   4.000000e-03   9.819891188296052e-01   4.512857588540425e+00   3.360320847314966e+00   1.060277268987461e+00   5.316691994790188e-01
            3   6.000000e-03   9.821582930151911e-01   4.472166301378423e+00   3.340394508413489e+00   1.064914161625991e+00   5.237639995493827e-01
            4   8.000000e-03   9.823147871685903e-01   4.434443467945163e+00   3.322072346920306e+00   1.069399002721710e+00   5.164458244260830e-01
            5   1.000000e-02   9.824600230497559e-01   4.399365605398038e+00   3.305172433784271e+00   1.073741672048799e+00   5.096484796708504e-01
            6   1.200000e-02   9.825952101779827e-01   4.366656166504124e+00   3.289539740661047e+00   1.077951012536146e+00   5.033157930514318e-01
            7   1.400000e-02   9.827213849606330e-01   4.336077125915407e+00   3.275041264828938e+00   1.082034978066712e+00   4.973997317850793e-01
            8   1.600000e-02   9.828394413238462e-01   4.307422344085718e+00   3.261562191657742e+00   1.086000755097427e+00   4.918589373123249e-01
            9   1.800000e-02   9.829501549515177e-01   4.280512279960870e+00   3.249002842935008e+00   1.089854863578449e+00   4.866575718777318e-01
           10   2.000000e-02   9.830542026564079e-01   4.255189739075672e+00   3.237276227445227e+00   1.093603241339709e+00   4.817644010486062e-01
 
 Execution time: 1.957879e+01 seconds.

In addition to the standard output, the executed application will write the profile data into a file called gmon.out just before exiting. If there is already a file called gmon.out, it's content will be overwritten. The gmon.out file is written in the program's current working directory at the time it exits.

Running the gprof to analyze the profile data

After the profile data file gmon.out was generated, a user can run gprof to analyze the information collected during the application execution. The gprof program prints profiling information (flat profile, call graph) on the standard output or into a file (with redirection). In the case of NUQG 3D BEC imaginary-time propagation, the result is illustrated below:

  [dusan@ui gp3d-1.0.5]$ gprof ./imagtime3d gmon.out > gmon.txt

Interpreting gprof's output

gprof can produce several different output styles, the most important being Flat Profile and the Call-graph.

The flat profile shows the distribution of time the program spent executing each function, as well as the distribution of number of calls to each function. This is part of the NUQG 3D BEC flat profile:

 [dusan@ui gp3d-1.0.5]$ head -n20 gmon.txt 
 Flat profile:
 
 Each sample counts as 0.01 seconds.
   %   cumulative   self              self     total           
  time   seconds   seconds    calls  ms/call  ms/call  name    
  37.10      6.27     6.27       11   570.30   918.18  calcmuen
  21.74      9.95     3.68  1320000     0.00     0.00  diff
  10.24     11.68     1.73       10   173.09   173.09  calclux
   9.44     13.28     1.60                             main
   8.22     14.67     1.39       10   139.07   139.07  calcluy
   7.16     15.88     1.21       10   121.06   121.06  calcluz
   2.01     16.22     0.34  2008844     0.00     0.00  simpint
   1.42     16.46     0.24       11    21.83    28.64  calcrms
   1.30     16.68     0.22       11    20.01    26.82  calcnorm
   0.83     16.82     0.14        1   140.07   140.07  initpsi
   0.59     16.92     0.10        6    16.68    23.45  outpsixy
   0.00     16.92     0.00       17     0.00     0.00  cfg_read
   0.00     16.92     0.00       12     0.00     0.00  alloc_double_vector
   0.00     16.92     0.00       12     0.00     0.00  free_double_vector
   0.00     16.92     0.00        7     0.00     0.00  alloc_double_matrix

The functions are sorted by their run times and by their number of calls (both in decreasing order), and then alphabetically by name.

The call-graph shows, for each function, which other functions called it, which other functions it called, and how many times. There is also an estimate of how much time was spent in the subroutines of each function. This can suggest places where one might try optimize the code by eliminating function calls that use a lot of time (by in-lining them). Bellow is a sample call-graph from a small example program:

                    Call graph (explanation follows)
                    
 
 granularity: each sample hit covers 2 byte(s) for 0.06% of 16.92 seconds
 
 index % time    self  children    called     name
                                                  <spontaneous>
 [1]    100.0    1.60   15.32                 main [1]
                 6.27    3.83      11/11          calcmuen [2]
                 1.73    0.00      10/10          calclux [4]
                 1.39    0.00      10/10          calcluy [5]
                 1.21    0.00      10/10          calcluz [6]
                 0.24    0.07      11/11          calcrms [8]
                 0.22    0.07      11/11          calcnorm [9]
                 0.10    0.04       6/6           outpsixy [10]
                 0.14    0.00       1/1           initpsi [11]
                 0.00    0.00      12/12          alloc_double_vector [13]
                 0.00    0.00      12/12          free_double_vector [14]
                 0.00    0.00       7/7           alloc_double_matrix [15]
                 0.00    0.00       7/7           free_double_matrix [16]
                 0.00    0.00       4/4           alloc_double_tensor [17]
                 0.00    0.00       4/4           free_double_tensor [18]
                 0.00    0.00       2/2           gettime [19]
                 0.00    0.00       1/1           cfg_init [20]
                 0.00    0.00       1/1           readpar [22]
                 0.00    0.00       1/1           gencoef [21]
                 0.00    0.00       1/1           writepsi [23]
 -----------------------------------------------
                 6.27    3.83      11/11          main [1]
 [2]     59.7    6.27    3.83      11         calcmuen [2]
                 3.68    0.00 1320000/1320000     diff [3]
                 0.15    0.00  884422/2008844     simpint [7]
 -----------------------------------------------
                 3.68    0.00 1320000/1320000     calcmuen [2]
 [3]     21.7    3.68    0.00 1320000         diff [3]
 ...

Sun Studio performance tools

Intel VTune

Section contributed by IICT-BAS

Intel VTune Amplifier is a performance analysis application for both 32-bit and 64-bit x86 based machines. It can profile C, C++ and Fortran code and can be used for both serial and threaded application. Additionaly VTune is available as an MPI-enabled version.

To use VTune you must compile your code using the -g compiler flag in order to generate the debug information needed for VTune.

VTune provides several tools for analysis:

Software sampling
A feature limited only for x86 compatible processors. It can be used to generate the time spent on different locations along with the call stack, thus identifying hotspots that which require optimization.
Locks and waits analysis
Used for locating long synchronization waits. Useful for detecting underutilization.
Threading timeline
A feature that allows the indetification of synchronization issues.
Source view
Visualizes the sampling times results mapped to the the source code. Can also be drilled-down to assembly code.
Hardware event sampling
A feature limited to Intel processors, because it uses the chip performance monitoring unit. Quite useful for minimizing cache misses and branch mispredictions.
PTU (Performance Tuning Utility)
An additional application, that grants access to experimental performance analysis tools.

D8.2 related materials

http://hpseewiki.ipb.ac.rs/index.php/Optimization_techniques_for_scalability#Profilers

Personal tools