GPU Cluster Network Performance

Introduction

Although the hardware which hosts the GPU cards was not specifically chosen for its network/CPU performance, I decided to do some experiments to quantify it anyway. Each host system has dual 1 Gigabit Ethernet link.

Basic Network Configuration

The default configuration is to only use one, connecting to the HP Procurve 3400 switch. Of course this works nicely because it matches the typical linux workstation configuration. DHCP is used to assign IP addresses, and convey routing and DNS information.

The HP Procurve switch supports a number of different forms of link aggregation or 'channel-bonding'. The linux 'bonding' module supports the following modes:
  • Mode 0 - Round-robin style (for load balencing)
  • Mode 1 - Active backup (for redundancy)
  • Mode 2 - Balence-XOR
  • Mode 3 - Broadcast ?
  • Mode 4 - 802.3ad
  • Mode 5 - Balence-tlb
  • Mode 6 - Balence-alb

I decided to try modes 6, 4 and 0. Mode 6 didn't seem to work at all (however I recall seeing different ethernet addresses on eth0/eth1; could be a configuration error). Modes 4 and 0 worked just fine. I tested throughput using the ttcp program. The net results are summarized below.

Mode Host 1 Mode Host 2 TCP Throughput UDP Throughput
unbonded unbonded 114.8 MB/sec 116.8 MB/sec [1]
bonded mode 4 unbonded 105.0 MB/sec not tested
bonded mode 4 bonded mode 4 110.0 MB/sec 200 MB/sec [2]
bonded mode 0 bonded mode 0 106.1 MB/sec 200 MB/sec [2]

  • [1] About a 1% packet loss
  • [2] A huge number of lost packets -- yikes!

Discussion

Not having access to the switch, means I had no real way to check the setup to see if things were working correctly. I suspect the switch was correctly handling the link aggregation, but other factors were limiting the overall bandwidth. I've read that aggregating 1-Gbit links is not for the faint of heart. Apparently at this wire speed, packet reassembly can fail in the kernel. (At least with the usual defaults.)

Here are some tweaks to try:
### IPV4 specific settings
# turn TCP timestamp support off, default 1, reduces CPU use
net.ipv4.tcp_timestamps = 0
# turn SACK support off, default on
# on systems with a VERY fast bus -> memory interface this is the big gainer
net.ipv4.tcp_sack = 0
# set min/default/max TCP read buffer, default 4096 87380 174760
net.ipv4.tcp_rmem = 10000000 10000000 10000000
# set min/pressure/max TCP write buffer, default 4096 16384 131072
net.ipv4.tcp_wmem = 10000000 10000000 10000000
# set min/pressure/max TCP buffer space, default 31744 32256 32768
net.ipv4.tcp_mem = 10000000 10000000 10000000

### CORE settings (mostly for socket and UDP effect)
# Set maximum receive socket buffer size, default 131071
net.core.rmem_max = 524287
# Set maximum send socket buffer size, default 131071
net.core.wmem_max = 524287
# Set default receive socket buffer size, default 65535
net.core.rmem_default = 524287
# Set default send socket buffer size, default 65535
net.core.wmem_default = 524287
# Set maximum amount of option memory buffers, default 10240
net.core.optmem_max = 524287
# Set number of unprocessed input packets before kernel starts dropping them, default 300
net.core.netdev_max_backlog = 300000 

-- JoeBrandt - 04 Nov 2007
Topic revision: r1 - 2007-11-04, JoeBrandt
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback