- Why is my code slow?
- Where are the 'hot-spots'?
- linux 'time' command -- program run time
- gprof (automatic instrumentation of compiled code)
- tracers (strace, ltrace, etc.)
- library built-in measurements (e.g MPI library)
- oprofile
- perf
- openspeedshop (subject of this tutorial)
Measurement Strategies
- Statistical Sampling
- periodically check where execution is and record location
- aggregate locations
- Event Sampling
- sample based on some 'event', such as function calls
- events are timestamped
- provides a more detailed level of information
- higher overhead, dependent upon program activity
- Hybrid of both
Open Speed Shop
- Supports both statistical and event sampling
- Results easily viewed in GUI
- Run data is stored in sqlite db files
- Can record 'what changed' between runs -- stated not demonstrated. But this is an important feature, as it allows a methodical optimization technique.
- Each run is known as "an experient".
- Remember to compile code with -g, otherwise trace information will not be usable.
Measurement Modes
The "Flat Profile"
- Answers the question "Where does the code consume the most time?"
- Periodic sampling of program counter and stack trace
- Usage: osspcsamp "myprog myargs" # Quotes required
- Alternate: osspcsamp "myprog myargs" [low | high | default | samplesPerSecond ]
Inclusive/Exclusive Timings
- Adds stack traces to flat profile information, but not still a samping (unlike gprof)
- Exclusive timing -- time spent inside function, but not children
- Inclusive timing -- time spent inside function AND functions it calls
- Usage: ossusertime "myprog myargs"
Hardware Counters
The Intel/AMD ia32 processors contain a number of hardware supported counters which can be used to analyse program performance. A few examples are:
- Data cache miss counts (separate L1/L2 counters)
- Data cache access counter (L1)
- Cycles where floating point units (FPU) are idle
- Cycles where no instruction is issued
- Mis-predicted branches
- Number of FPU instructions
- Number of load instructions
- Number of SIMD instructions
- Number of hardware interrupts
- Number of translation lookaside buffer misses
- even more ...
- For time-periodic sampling:
- Usage: osshwcsamp "myprog myargs" countername, countername [ rate ]
- For event (i.e. counter reaches threshold) sampling:
- Usage: osshwc "myprog myargs" countername, counterthreshold [ rate ]
- For event sampling with inclusive/exclusive data:
- Usage: osshwctime "myprog myargs" countername, counterthreshold [ rate ]
Putting it all together
- Where do I spend my time?
- Flat profiles (pcsamp)
- Get inclusive/exclusive times (usertime)
- Identify hot call paths (usertime)
- How do I analyze cache performance?
- Measure memory performance using hardware counters (hwc)
- Compare to flat profiles (custom comparison)
- Compare multiple hardware counters (hwc, hwcsamp)
- How to identify I/O issues?
- Look for time spent in I/O routines (io)
- Compare runs under different I/O scenarios (custom comparison)
- How do I identify parallel inefficiencies?
- Study time spent in MPI or openMP routines
- Look for load imbalance (LB view) and outliers (CA view)
Environment
I have installed open speed shop locally. To use it add the following to your .bash_profile:
useopenss()
{
export PATH=/home/sandboxes/jbrandt/Openss-install/bin:$PATH
export LD_LIBRARY_PATH=/home/sandboxes/jbrandt/Openss-install/lib64:$LD_LIBRARY_PATH
}
Then to use just type useopenss (or place that after the definition of useopenss in your .bash_profile)
Summary of Possible Experiment Types
Note: to run from command line, prefix with 'oss'.
Experiment |
Clues |
Data Collected |
Technique |
pcsamp |
High user CPU time. |
Actual CPU time at the source line, machine instruction, and function levels by sampling the program counter at 100 samples per second. |
Program Counter Sampling |
usertime |
Slow program, nothing else known. Not CPU-bound. |
Inclusive and exclusive CPU time for each function by sampling the callstack at 35 samples per second intervals. |
Program Counter Sampling |
hwc |
High user CPU time. |
Counts at the source line, machine instruction, and function levels of various hardware events, including: clock cycles, graduated instructions, primary instruction cache misses, secondary instruction cache misses, primary data cache misses, secondary data cache misses, translation lookaside buffer (TLB) misses, and graduated floating-point instructions. Hardware counters are read when a predefined count threshold is reached (overflows). |
Hardware Counter Overflow |
hwcsamp |
High user CPU time. |
Similar to hwc experiment, except that periodic sampling is used instead of the overflow mechansim. |
Hardware Counter Sampling |
hwctime |
High user CPU time. |
Similar to hwc experiment, except that callstack sampling is used and call paths are available along with the event counts. |
Hardware Counter Sampling |
io |
I/O-bound |
Times the following I/O system calls: read, readv, write, writev, open, close, dup, pipe, creat. The time reported is wall clock time. |
I/O Function Tracing |
iot |
I/O-bound |
Traces and times the following I/O system calls: read, readv, write, writev, open, close, dup, pipe, creat. The time reported is wall clock time. |
I/O Function Tracing |
fpe |
High system time. Presence of floating point operations. |
All floating-point exceptions, with the exception type and the call stack at the time of the exception. |
Hardware Counter Event Trigger |
mpi ... |
MPI variants |
Similar to above, but for MPI multi-node processes. |
|
This is a list of the x86 performance counters available with the hwc, hwcsamp and hwctime experiments. Additional counters and info is available at
this website.
Name |
Meaning |
PAPI_L1_DCM |
Level 1 data cache misses |
PAPI_L1_ICM |
Level 1 instruction cache misses |
PAPI_L2_ICM |
Level 2 instruction cache misses |
PAPI_L2_TCM |
Level 2 cache misses |
PAPI_L3_TCM |
Level 3 cache misses |
PAPI_L3_LDM |
Level 3 load misses |
PAPI_TLB_DM |
Data translation lookaside buffer misses |
PAPI_TLB_IM |
Instruction translation lookaside buffer misses |
PAPI_L1_LDM |
Level 1 load misses |
PAPI_L1_STM |
Level 1 store misses |
PAPI_L2_LDM |
Level 2 load misses |
PAPI_L2_STM |
Level 2 store misses |
PAPI_BR_UCN |
Unconditional branch instructions |
PAPI_BR_CN |
Conditional branch instructions |
PAPI_BR_TKN |
Conditional branch instructions taken |
PAPI_BR_MSP |
Conditional branch instructions mispredicted |
PAPI_TOT_IIS |
Instructions issued |
PAPI_TOT_INS |
Instructions completed |
PAPI_FP_INS |
Floating point instructions |
PAPI_LD_INS |
Load instructions |
PAPI_SR_INS |
Store instructions |
PAPI_BR_INS |
Branch instructions |
PAPI_RES_STL |
Cycles stalled on any resource |
PAPI_TOT_CYC |
Total cycles |
PAPI_L2_DCA |
Level 2 data cache accesses |
PAPI_L2_DCR |
Level 2 data cache reads |
PAPI_L3_DCR |
Level 3 data cache reads |
PAPI_L2_DCW |
Level 2 data cache writes |
PAPI_L3_DCW |
Level 3 data cache writes |
PAPI_L1_ICH |
Level 1 instruction cache hits |
PAPI_L2_ICH |
Level 2 instruction cache hits |
PAPI_L1_ICA |
Level 1 instruction cache accesses |
PAPI_L2_ICA |
Level 2 instruction cache accesses |
PAPI_L3_ICA |
Level 3 instruction cache accesses |
PAPI_L1_ICR |
Level 1 instruction cache reads |
PAPI_L2_ICR |
Level 2 instruction cache reads |
PAPI_L3_ICR |
Level 3 instruction cache reads |
PAPI_L2_TCA |
Level 2 total cache accesses |
PAPI_L3_TCA |
Level 3 total cache accesses |
PAPI_L2_TCW |
Level 2 total cache writes |
PAPI_L3_TCW |
Level 3 total cache writes |
PAPI_VEC_SP |
Single precision vector/SIMD instructions |
PAPI_VEC_DP |
Double precision vector/SIMD instructions |
PAPI_REF_CYC |
Reference clock cycles |
--
JoeBrandt - 2012-11-27