Performance Analysis

  • Why is my code slow?
  • Where are the 'hot-spots'?


  • linux 'time' command -- program run time
  • gprof (automatic instrumentation of compiled code)
  • tracers (strace, ltrace, etc.)
  • library built-in measurements (e.g MPI library)
  • oprofile
  • perf
  • openspeedshop (subject of this tutorial)

Measurement Strategies

  • Statistical Sampling
    • periodically check where execution is and record location
    • aggregate locations
  • Event Sampling
    • sample based on some 'event', such as function calls
    • events are timestamped
    • provides a more detailed level of information
    • higher overhead, dependent upon program activity
  • Hybrid of both

Open Speed Shop

  • Supports both statistical and event sampling
  • Results easily viewed in GUI
  • Run data is stored in sqlite db files
  • Can record 'what changed' between runs -- stated not demonstrated. But this is an important feature, as it allows a methodical optimization technique.
  • Each run is known as "an experient".
  • Remember to compile code with -g, otherwise trace information will not be usable.

Measurement Modes

The "Flat Profile"

  • Answers the question "Where does the code consume the most time?"
  • Periodic sampling of program counter and stack trace
  • Usage: osspcsamp "myprog myargs" # Quotes required
  • Alternate: osspcsamp "myprog myargs" [low | high | default | samplesPerSecond ]

Inclusive/Exclusive Timings

  • Adds stack traces to flat profile information, but not still a samping (unlike gprof)
  • Exclusive timing -- time spent inside function, but not children
  • Inclusive timing -- time spent inside function AND functions it calls
  • Usage: ossusertime "myprog myargs"

Hardware Counters

The Intel/AMD ia32 processors contain a number of hardware supported counters which can be used to analyse program performance. A few examples are:
  • Data cache miss counts (separate L1/L2 counters)
  • Data cache access counter (L1)
  • Cycles where floating point units (FPU) are idle
  • Cycles where no instruction is issued
  • Mis-predicted branches
  • Number of FPU instructions
  • Number of load instructions
  • Number of SIMD instructions
  • Number of hardware interrupts
  • Number of translation lookaside buffer misses
  • even more ...
  • For time-periodic sampling:
    • Usage: osshwcsamp "myprog myargs" countername, countername [ rate ]
  • For event (i.e. counter reaches threshold) sampling:
    • Usage: osshwc "myprog myargs" countername, counterthreshold [ rate ]
  • For event sampling with inclusive/exclusive data:
    • Usage: osshwctime "myprog myargs" countername, counterthreshold [ rate ]

Putting it all together

  • Where do I spend my time?
    • Flat profiles (pcsamp)
    • Get inclusive/exclusive times (usertime)
    • Identify hot call paths (usertime)
  • How do I analyze cache performance?
    • Measure memory performance using hardware counters (hwc)
    • Compare to flat profiles (custom comparison)
    • Compare multiple hardware counters (hwc, hwcsamp)
  • How to identify I/O issues?
    • Look for time spent in I/O routines (io)
    • Compare runs under different I/O scenarios (custom comparison)
  • How do I identify parallel inefficiencies?
    • Study time spent in MPI or openMP routines
    • Look for load imbalance (LB view) and outliers (CA view)


I have installed open speed shop locally. To use it add the following to your .bash_profile:
      export PATH=/home/sandboxes/jbrandt/Openss-install/bin:$PATH
      export LD_LIBRARY_PATH=/home/sandboxes/jbrandt/Openss-install/lib64:$LD_LIBRARY_PATH

Then to use just type useopenss (or place that after the definition of useopenss in your .bash_profile)

Summary of Possible Experiment Types

Note: to run from command line, prefix with 'oss'.

Experiment Clues Data Collected Technique
pcsamp High user CPU time. Actual CPU time at the source line, machine instruction, and function levels by sampling the program counter at 100 samples per second. Program Counter Sampling
usertime Slow program, nothing else known. Not CPU-bound. Inclusive and exclusive CPU time for each function by sampling the callstack at 35 samples per second intervals. Program Counter Sampling
hwc High user CPU time. Counts at the source line, machine instruction, and function levels of various hardware events, including: clock cycles, graduated instructions, primary instruction cache misses, secondary instruction cache misses, primary data cache misses, secondary data cache misses, translation lookaside buffer (TLB) misses, and graduated floating-point instructions. Hardware counters are read when a predefined count threshold is reached (overflows). Hardware Counter Overflow
hwcsamp High user CPU time. Similar to hwc experiment, except that periodic sampling is used instead of the overflow mechansim. Hardware Counter Sampling
hwctime High user CPU time. Similar to hwc experiment, except that callstack sampling is used and call paths are available along with the event counts. Hardware Counter Sampling
io I/O-bound Times the following I/O system calls: read, readv, write, writev, open, close, dup, pipe, creat. The time reported is wall clock time. I/O Function Tracing
iot I/O-bound Traces and times the following I/O system calls: read, readv, write, writev, open, close, dup, pipe, creat. The time reported is wall clock time. I/O Function Tracing
fpe High system time. Presence of floating point operations. All floating-point exceptions, with the exception type and the call stack at the time of the exception. Hardware Counter Event Trigger
mpi ... MPI variants Similar to above, but for MPI multi-node processes.  

Performance Counters

This is a list of the x86 performance counters available with the hwc, hwcsamp and hwctime experiments. Additional counters and info is available at this website.

Name Meaning
PAPI_L1_DCM Level 1 data cache misses
PAPI_L1_ICM Level 1 instruction cache misses
PAPI_L2_ICM Level 2 instruction cache misses
PAPI_L2_TCM Level 2 cache misses
PAPI_L3_TCM Level 3 cache misses
PAPI_L3_LDM Level 3 load misses
PAPI_TLB_DM Data translation lookaside buffer misses
PAPI_TLB_IM Instruction translation lookaside buffer misses
PAPI_L1_LDM Level 1 load misses
PAPI_L1_STM Level 1 store misses
PAPI_L2_LDM Level 2 load misses
PAPI_L2_STM Level 2 store misses
PAPI_BR_UCN Unconditional branch instructions
PAPI_BR_CN Conditional branch instructions
PAPI_BR_TKN Conditional branch instructions taken
PAPI_BR_MSP Conditional branch instructions mispredicted
PAPI_TOT_IIS Instructions issued
PAPI_TOT_INS Instructions completed
PAPI_FP_INS Floating point instructions
PAPI_LD_INS Load instructions
PAPI_SR_INS Store instructions
PAPI_BR_INS Branch instructions
PAPI_RES_STL Cycles stalled on any resource
PAPI_TOT_CYC Total cycles
PAPI_L2_DCA Level 2 data cache accesses
PAPI_L2_DCR Level 2 data cache reads
PAPI_L3_DCR Level 3 data cache reads
PAPI_L2_DCW Level 2 data cache writes
PAPI_L3_DCW Level 3 data cache writes
PAPI_L1_ICH Level 1 instruction cache hits
PAPI_L2_ICH Level 2 instruction cache hits
PAPI_L1_ICA Level 1 instruction cache accesses
PAPI_L2_ICA Level 2 instruction cache accesses
PAPI_L3_ICA Level 3 instruction cache accesses
PAPI_L1_ICR Level 1 instruction cache reads
PAPI_L2_ICR Level 2 instruction cache reads
PAPI_L3_ICR Level 3 instruction cache reads
PAPI_L2_TCA Level 2 total cache accesses
PAPI_L3_TCA Level 3 total cache accesses
PAPI_L2_TCW Level 2 total cache writes
PAPI_L3_TCW Level 3 total cache writes
PAPI_VEC_SP Single precision vector/SIMD instructions
PAPI_VEC_DP Double precision vector/SIMD instructions
PAPI_REF_CYC Reference clock cycles

-- JoeBrandt - 2012-11-27
Topic revision: r3 - 2013-01-03, JoeBrandt
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback