Green Bank GPU Cluster Description
This page describes a GPU cluster currently being assembled to test the application of GPU technology for backend data processing.
Over the past ten years or so, CPU chip vendors have doubled CPU performance approximately every 18 months. This is described as 'Moores Law'. This progression remained linear for quite some time, but in recent years it has been recognized that single core performance has its limits. Today, multi-core CPU chipsets are common.
Where do Graphics Processing Units (GPU) factor in? Well first it should be stated that the process of drawing three dimensional scenes is a very parallel problem. The gaming industry has recognized this for some time and has been developing hardware to fill a huge demand for 'gaming hardware'. Fortunately this commodity development has opened a door toward high-performance computing for areas where the problem set lends itself to a parallel solution.
The Nvidia corporation manufactures the G80 processor, which contains 128 SPMD (single program-multiple data) processing elements. Cards which host this processor are available as the GTX8800, sold by a number of vendors, such as BFG, EVGA, and XFX. Similarly powerful cards are available from ATI, Nvidia's main competitor.
The cluster consists of four identical nodes. Each node hosts two GTX8800 cards, 1TB of disk, 4GB RAM, and an Intel Quad core processor. Communications are via two 1GBit Ethernet links. Using marketing style numbers, the aggregate GPU computing power should be on the order of 4x1012 floating point operations per second (4 Tera-FLOPS). Realistic numbers are probably around half to one quarter of this figure, however even that level of performance is still rather impressive. The diagram below outlines the cluster arrangement:
| A schematic of the cluster
|| One Quad Core 2.4GHz, 2x4MB cache
|| 4G Bytes
|| ASUS P5N-32 SLI Plus (nVidia 680 Chipset)
|| Dual 500GB, 3Gbit/sec-SATA
|| Dual 10/100/1000 Ethernet
|| Dual eVGA 8800GTX Ultra
|| 1000W SLI Certified
| An annotated picture of our first node:
A number of languages are available for GPU's: Brooke
(Stanford University); Cg (Nvidia); CUDA(from Nvidia); HLSL (Microsoft); and PeakStream
(PeakStream Inc.). Each environment has its own advantages. For our use we are concentrating on CUDA, since it seems most flexible for our use.
The CUDA processing model concentrates on a large number of parallel threads. The theads are organized in groups of threads which can share data is shared memory, called blocks. Blocks of threads are further organized into a matrix called a Grid. This geometrical organization is useful, because each thread must determine what portion of the data to operate upon. This self-identification is a convenient way to coordinate processing amongst threads.
To exploit parallelism, the dependencies between data must be identified, and processing localized. This sometimes requires a re-casting of the problem in a non-traditional way.
Similar to many other clusters, Linux is the OS of choice. We currently run Redhat SMP linux Enterprise 4.
OS Tweaks Required on 32-bit Kernels
After powering up the cluster with Redhat Linux (RHEL-4), I found that CUDA was unable to recognize either card. It turns out that a 32-bit kernel has some limitations with respect to virtual address space. With 4GB's of RAM, (896MB of low memory) there isn't enough kernel virtual address space to map the hardware! (A 64bit kernel wouldn't have this problem of course.)
Fortunately there is a solution. By the addition of some incantations into the /boot/grub/grub.conf file, we can inactivate a little bit of RAM in low memory, in exchange for some virtual address space for mapping the cards. The modified grub entry looks like (added items in bold):
title Red Hat Enterprise Linux AS (2.6.9-55.0.9.ELsmp Dual GPU)
kernel /boot/vmlinuz-2.6.9-55.0.9.ELsmp ro root=LABEL=/ selinux=0 vmalloc=256MB pci=nommconf
The uppermem part directs grub to load the initrd into an area which shortens low memory enough for the vmalloc=256 to work. Without the uppermem entry, the system will panic at boot. For those interested in the full story, a start is here
or search for "allocation failed: out of vmalloc space".
As a quick start to monitor some key parameters, I installed the Ganglia package
. This is fine for host related items, but I feel a strong need to monitor GPU temperature. The addition of this is easy if and only if I can find a way to read the temperature. (Current methods require a connection to an xserver, but I don't plan on having one running.) Publishing the temperature is straight forward using the gmetrics utility.
GPU Programming Language/Environment
The two main programming languages for general purpose applications on nVidia GPUs are:
- CUDA (supported by nVidia)
- Brook (From the BrookGPU project)
To date, I've tried both. CUDA is a bit more involved to learn, but is much more flexible. Unfortunately, CUDA only runs on the G80 series of nVidia GPU's, so that awesome G70 in my laptop can't be used for development. Brook on the other hand, uses Open-GL or Direct-X calls to implement the GPU codes. This means Brook applications have a good chance of running on a wide variety of platforms, whereas CUDA is limited.
There are a number of choices, but Open MPI seems to work well. Other options have not yet been evaluated.
Nothing spectacular here (yet). Currently each node connects into a switch with a single 1 Gbit Ethernet link. There are two 1 Gbit links available per system, and one system will have an additional 10 Gbit card.
Since the cluster is not yet fully operational, performance data is not yet available. However,
Paul Demorest has done an excellent page, examining GPU use for pulsar data analysis
So what nail can be hit with this huge sledge hammer? Possible candidate applications may include:
- Beam forming
- GASP Augmentation
- RFI Applications
- Pulsar data processing
With a typical idling temperature of around 65C, cooling a number of dual GPU systems can be a challenge. Monitoring GPU temperatures during extended processing is a must.
On the question of power, I estimate the following (based on past experience and recent measurements of GPU current):
|| (2 GPU's)
|| N/S Chipsets
|| (625W at 80% efficiency) per node
625W@80% efficiency, for about 780W/node; Which places (a proposed) 8-10 node cluster at between 6.2-7.8 kW.
- 12 Sep 2007