Astro GPU 2007 Notes

The first Astro-GPU conference was held in Princeton, NJ at the Institute for Advanced Studies on November 9-10. The agenda is available on line at the site.

The session began with a number of presentations from persons currently working on GPU accelerated solutions to data processing, followed by CUDA course presentations. The SC07 conference has the CUDA presenations.

The Democratization of Parallel Computing

  • David Luebke (Nvidia)
  • 1980-1990 known as the "golden age" of super computing for big iron and expensive accelerator boards
  • The was overtaken by low-cost cluster technology and grid computing
  • Commodity technologies made it possible
  • There are now 40 million GPU's currently installed
  • CPU's are no longer getting faster, but they are getting 'wider' (i.e. multi-core, wider buses)
  • While CPU's mask latency with cache, GPU's mask latency with computation and a large number of threads
  • Talked about the GPU's 'data-parallel cache' (i.e. shared memory)
  • GPU may process 12,000 threads at a time
  • Covered the G80 thread/block/grid threading model

GPU Acceleration of Scientific Applications

  • John Stone (UIUC)
  • Noted Amdahl's law on performance
  • Noted that it is relatively easy to achieve 12x-100x speedups, even with naiive implementations
  • Cuda is not hard, but re-posing the algorithm may be. (On the plus side, once the algorithm is fine-grain parallel, porting it to the cell or multicore is also relatively easy.)
  • Lessions Learned:
    • Always use 'pinned' buffers (unpinned performance is almost half that of pinned buffers)
    • Understand the GPU (or other) memory hierarchy. The problem data size and algorithm impacts memory choices
    • Remove branching by data padding. Showed example of complicated branching at boundry layers, removed simply by moving the boundry and zero padding
    • Be aware that application speed-ups are often limited by non-gpu code, I/O
    • GPU's even used with inefficient code usually outperforms CPU
    • With large data set sizes, GPU matches or exceeds CPU performance

GPU Applications at University of Maryland

  • William Dorland, Physics Dept.
  • N-body simulations on GPU, Turbulence, particles inside cells
  • Audio/visual computing (I saw a demo of this, very nice)
  • They have produced a middleware layer between the applications and Cuda
    • Seemed willing to share, but not support
  • UMIACS University of Maryland Institute for Advanced Computational Studies
    • They have weekly meetings with discussions and talks (5-20 attendees)
  • Funding from NSF, DOE, NASA, and DARPA
  • They are participating in a DARPA study on measuring development time
  • Found acceleration typically about 25x on simulation codes


  • The (CUda N-BODY version 1) library is an implementation of C/C++/Fortran library to accelerate N-body interactions using NVIDIA's GPUs(GeForce8800 etc).
  • library is implemented using an optimized algorithm called the Chamomile Scheme.
  • Noted reductions are not yet implemented in GPU
  • Found I/O to be limiting factor in using GPU
  • Working on GPU reduction to lessen I/O

High Performance N-body Simulation using GPU

  • Said "simluation of the universe is the ultimate holy grail."
  • Compared GPU performance with the specialized GRAPE series of processors

Real-time Hardware Accelerated Pipeline for Radio Astronomy

  • Kevin Dale, David Mitchell
  • MWA widefield array in the Australian outback
  • No commercial power available at sites
  • 512 tiles of 4x4 dipoles
  • 1k frequency channel from correlator ~1.7GB/sec
  • problem is 'frequency parallel'
  • Implemented a 'tracer bullet' on limited hardware (Quadro 5600)
  • Array calibration
    • Ionosphere
    • Refraction
    • Instrument effects
  • Image formation
    • Gridding
    • Inverse FFT (1600x1600, 2D) 11x speedup

Real-time Digital Signal Processing for Radio Astronomy

  • Walter Brisken gave Paul's talk
  • Described the VLA upgrade, bandwidth
  • Software correlator
  • Pulsar timing applications
  • Talk was well received (in my opinion)

Fast Summation of Potentials

  • University of Maryland (Nail Gumerou)
  • Fast multipole method
    • Matrix vector product
  • Tree approach to force evaluation
    • hierarchical "rings" treated as net force or displacement of enclosing area
  • Described sparse matricies

Grape Processor

  • FPGA vs GPU
    • Better transistor utilization (most gates active per clock, due to spatial parallelism)
    • Cost (GPU's cost less)
    • Easier to code, no VHDL required
    • ASIC have long development times ~5 years
    • GRAPE still requires assembly code to program
  • Pointed out that separation of task from host to GPU, FGPA or ASIC is all the same. (i.e. The hard work is re-posing the algorithm to be parallel, the implementation is then straight-forward to implement on the specific accelerator.

CUDA Course Slides

Rather than try to summarize the tutorials, I would refer interested parties to the CUDA tutorials from Super computing 2007: Note: To my knowledge these are not in the public domain, please do not redistribute!

Round-table/Closing Discussion:

  • Cuda 1.1 Annoucement
    • Language improvements
    • Asynchronous API (but requires G92 hardware)
    • Visual profiler (may not be part of 1.1)
    • Performance improvements
    • Full C++ subset support
    • unroll pragma supported
    • noinline keyword
    • Stream API
      • stream abstraction
      • will work on G80, but I/O and compute will be serialized
    • Profiler demonstrated on Fedora system
  • Talked about Professor Partnership program (
  • Nvidia is recruiting - all levels
    • Openings posted on website
  • Announced discounts at supercomputing 2007 of 50% on Tesla systems
  • Requested feedback on CUDA libraries
  • General interest in GPU clustering, and tools to scale-up for large problem sets (like MPI + CUDA)

-- JoeBrandt - 23 Nov 2007
Topic revision: r2 - 2007-11-26, JoeBrandt
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback