Benchmarking

difxspeed: a new benchmarking tool

A new tool, difxspeed (attached to this wiki page) has been developed that facilitates benchmarking and parameter optimization of DiFX. The program is provided a file describing the tests to perform that looks like the following:

datastreams = node-2,node-2,node-2,node-2,node-2
cores = node-4,node-5
antennas = BR,FD,HN,KP

nAnt = 4
nCore = 2
nThread = 3
tInt = 1
specRes = 0.125
fftSpecRes = 0.125
xmacLength = 16,32,64,128
strideLength = 8,16,32
numBufferedFFTs = 1,2,4,8

vex = dq224.vex

This file must end with extension .difxspeed . This program takes a given .vex file, creates a series of .v2d files, one for each potential configuration as described by the parameter options, creates DiFX file sets, and executes them. For the above example, DiFX will be invoked 49 times. The first time is a "dummy" run, used to make sure all of the equipment is "warmed up" (and this can make a difference), and then there is a run followed by each of the 48 combinations of xmacLength, strideLength, and numBufferedFFTs suggested by the file. The files are always run in "fake" mode where no data is actually read, rather randomly generated data is pushed through the system. Most of the parameters in this file match those of a .v2d file.

Running this set of benchmarks will produce an output file (ending in .out ) which may resemble:

# Cluster definition:
# datastreams = node-2 node-2 node-2 node-2 node-2
# cores = node-4 node-5
# antennas = BR FD HN KP

# Fixed parameters:
# tInt = 1
# nCore = 2
# nThread = 3
# vex = dq224.vex
# fftSpecRes = 0.125
# nAnt = 4
# specRes = 0.125

# Table columns:
# 1  xmacLength
# 2  strideLength
# 3  numBufferedFFTs
# 4  Average execute time (seconds)
# 5  RMS of execute times (seconds)
# 6...  Individual execute times (seconds)
16 8 1  173.316333 11.533935  167.477000 170.374000 198.990000 168.518000 167.000000 167.539000  # dq2-dummy
16 8 1  168.994833 1.587027  168.248000 169.449000 166.009000 170.317000 169.032000 170.914000  # dq2-001
32 8 1  153.121833 1.754966  152.973000 150.302000 153.099000 155.327000 151.882000 155.148000  # dq2-002
64 8 1  144.864333 0.820851  145.432000 144.675000 145.414000 143.498000 144.241000 145.926000  # dq2-003
128 8 1  140.450833 1.595531  141.689000 138.555000 142.672000 138.522000 141.478000 139.789000  # dq2-004
16 16 1  161.438333 1.277240  162.031000 159.841000 162.904000 162.690000 159.659000 161.505000  # dq2-005
32 16 1  147.192833 1.641706  147.618000 147.069000 149.917000 145.065000 148.090000 145.398000  # dq2-006
64 16 1  139.531167 0.702910  138.658000 139.798000 140.628000 138.639000 139.572000 139.892000  # dq2-007
128 16 1  135.296833 2.163581  139.098000 132.751000 136.408000 135.570000 135.064000 132.890000  # dq2-008
16 32 1  162.137833 2.020091  165.909000 160.574000 160.380000 160.250000 163.111000 162.603000  # dq2-009
32 32 1  146.173000 0.869731  145.583000 146.854000 146.342000 144.651000 147.345000 146.263000  # dq2-010
64 32 1  138.041000 1.054704  138.733000 137.355000 137.407000 140.074000 137.020000 137.657000  # dq2-011
128 32 1  135.441333 1.175489  136.056000 135.924000 134.921000 134.397000 133.913000 137.437000  # dq2-012
16 8 2  162.960333 1.872134  160.249000 165.464000 164.852000 161.855000 163.776000 161.566000  # dq2-013
32 8 2  147.695500 1.031172  148.084000 147.786000 149.308000 146.001000 148.076000 146.918000  # dq2-014
64 8 2  140.857667 1.283726  139.261000 140.713000 142.267000 142.081000 139.101000 141.723000  # dq2-015
128 8 2  137.169333 2.135411  138.614000 135.088000 141.125000 135.113000 136.904000 136.172000  # dq2-016
16 16 2  155.763167 0.947386  155.550000 156.038000 155.843000 154.792000 157.587000 154.769000  # dq2-017
32 16 2  141.416167 0.743320  142.142000 140.400000 142.456000 140.687000 141.673000 141.139000  # dq2-018
64 16 2  134.592333 1.162151  135.721000 135.289000 132.636000 133.523000 134.621000 135.764000  # dq2-019
128 16 2  129.878000 1.714953  131.755000 127.032000 131.530000 130.647000 130.060000 128.244000  # dq2-020
16 32 2  154.156167 0.701789  153.809000 153.519000 154.926000 154.784000 153.126000 154.773000  # dq2-021
32 32 2  142.544500 0.924184  142.413000 141.210000 141.808000 143.850000 142.411000 143.575000  # dq2-022
64 32 2  134.336167 1.294484  132.663000 132.550000 134.623000 135.440000 135.929000 134.812000  # dq2-023
128 32 2  129.712667 1.545513  129.996000 129.518000 131.904000 126.714000 130.356000 129.788000  # dq2-024
16 8 4  159.267000 1.079145  157.869000 158.827000 158.232000 159.909000 159.724000 161.041000  # dq2-025
32 8 4  145.291667 0.853220  145.022000 146.220000 145.935000 145.178000 145.770000 143.625000  # dq2-026
64 8 4  137.506167 1.635699  135.677000 138.278000 135.558000 138.338000 140.215000 136.971000  # dq2-027
128 8 4  132.648500 1.874152  136.052000 132.965000 131.729000 130.824000 130.621000 133.700000  # dq2-028
16 16 4  152.580833 1.330866  151.177000 153.808000 154.512000 152.321000 150.776000 152.891000  # dq2-029
32 16 4  142.534833 5.768529  138.042000 154.916000 139.289000 141.397000 142.808000 138.757000  # dq2-030
64 16 4  131.260500 1.768340  133.423000 133.130000 131.366000 131.395000 129.972000 128.277000  # dq2-031
128 16 4  127.708833 1.688246  124.451000 127.947000 126.891000 129.732000 128.723000 128.509000  # dq2-032
16 32 4  153.663333 0.864594  153.347000 154.816000 152.001000 153.753000 154.115000 153.948000  # dq2-033
32 32 4  139.584833 1.800902  137.971000 141.756000 137.390000 140.782000 138.103000 141.507000  # dq2-034
64 32 4  131.773500 1.885981  132.505000 130.985000 132.531000 131.846000 128.280000 134.494000  # dq2-035
128 32 4  127.477500 1.145064  129.405000 127.688000 126.900000 125.946000 126.618000 128.308000  # dq2-036
16 8 8  157.735500 0.889988  157.684000 156.673000 158.945000 158.361000 156.507000 158.243000  # dq2-037
32 8 8  143.670333 1.359781  145.601000 142.081000 141.747000 143.766000 144.531000 144.296000  # dq2-038
64 8 8  137.096833 1.405056  135.903000 139.766000 136.591000 135.516000 137.823000 136.982000  # dq2-039
128 8 8  131.898000 0.909325  132.055000 132.183000 130.984000 132.989000 130.440000 132.737000  # dq2-040
16 16 8  152.148000 0.789127  152.105000 153.515000 151.217000 151.503000 151.749000 152.799000  # dq2-041
32 16 8  138.139167 1.617144  139.700000 136.463000 139.164000 140.204000 137.247000 136.057000  # dq2-042
64 16 8  131.431000 1.262607  130.196000 132.366000 130.705000 133.447000 130.441000  # dq2-043
128 16 8  126.720167 1.586434  124.873000 127.076000 124.452000 127.404000 127.477000 129.039000  # dq2-044
16 32 8  151.147500 1.748858  152.798000 152.331000 149.240000 149.135000 153.415000 149.966000  # dq2-045
32 32 8  138.268167 0.945924  138.183000 138.132000 137.222000 139.771000 137.160000 139.141000  # dq2-046
64 32 8  130.166500 1.286886  128.630000 131.164000 130.814000 131.897000 130.106000 128.388000  # dq2-047
128 32 8  125.817333 1.249397  124.964000 125.225000 125.255000 124.508000 128.090000 126.862000  # dq2-048

The above file is actually the result of running the same benchmarking script 6 times. The run times are extracted from the .difxlog files which are not erased after each run, this system can be used to construct a history of benchmarking results. The first set of columns in this file are the three parameters that are allowed to vary in this case. The next two are the average and RMS run times, and finally the individual trial run times are printed. A comment at the end of each line indicates the DiFX file set associated with each run.

Some notes of interest:
  • Once in a while (e.g., dq2-043 in the above output) the full runtime seems not to be captured. This is being investigated. The problem appears to be in the capture of the multicast messages by the difxlog program.
  • The files produced invoke "singleScan=false", so a multiple scan .vex file can be used.
  • Things will work poorly (if at all) if more than 1 setup is used within the .vex file.

Test cluster benchmarking 2012 May 30

Test cluster

The test cluster consists of:
  • 2x "service nodes" with access to the outside world
  • 5x "DiFX nodes"
    • 3x 2.6 GHz Sandy Bridge dual 8-core CPUs + 16 GB RAM (node1,node2,node3)
    • 2x 2.9 GHz Sandy Bridge dual 8-core CPUs + 16 GB RAM (node4,node5)
  • 1 Gb network for service & multicast
  • 10 Gb network for inter-process communication

Methodology:

Tests were carried out with the following plan:
  1. schedule a real obsevation with sched
  2. append fake module, clock, eop information to end of the .vex file
    • note: this is not strictly necessary; the .v2d file could do this
  3. generate DiFX jobs
  4. change MODULE to FAKE as datastream source in the .input file
    • this exercises the new "fake" mode
  5. manually construct .threads and .machines file
    • many stations assigned to some nodes to simulate Mark5 output
  6. spawn DiFX with startdifx
  7. use runtime as reported by mpifxcorr

10 station tests:

Used DQ129 observation as basis for 10 station, 256 Mbps test caseA single node was used for processing.Correlated with 2 secondaverages full polarization and 0.25 MHz spectral resolution.Some optimization was attempted.
Test    node    processing rate         cpu speed       ratio
1       1       201.54 Mbps             2.6 GHz         77.5
2       5       225.25 Mbps             2.9 GHz         77.7

Note nice scaling w/ processor speed.My extrapolation to the 30 nodesystem for 10 and 15 stations:
CPU speed       30-node 10-stn          30-node 15-stn
2.6 GHz         6.05 Gbps               2.69 Gbps
2.9 GHz         6.76 Gbps               3.00 Gbps

15 station tests:

Made a synthetic observation using the PFB personality on 15 VLBAstations.Used 2 processing nodes (node4,node5).2 second averagesfull polarization, and 0.25 MHz spectral resolution.
Test            2-node rate     30-node extrap.
Unoptimized     149 Mbps        2240 Mbps
Optimized       180 Mbps        2700 Mbps

The main parameter tuned to achieve this was increasing number of bufferedFFTs to 20 (from 10).No more improvement was made by increasing it further.

Other experiments

Other tests were tried with surprisingly little impact on run time:
  1. Use of gcc 4.7 rather than gcc 4.4.6 which comes with RHEL6
  2. Turning on Hyperthreading (if anything this made it worse)
  3. Changing subintegration times
  4. Changing from 128 to 100 channels

-- WalterBrisken - 2012-06-19
Topic revision: r3 - 2014-08-28, JamesRobnett
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback