GPU's for RFI Excision Page

FRSC Program

This program originally was entirely single-threaded and CPU based and written in 'C'. It consumes a configuration 'job' file, which specified the input file as well as various parameters. Guppi test data is used as example input. The program appears to be very I/O bound.

'RA_Analyze' Routine

The core processing takes place in the ra_analyze() function, so this was the first item to focus upon. To maintain my sanity and permit further enhancements I refactored the code to (a) be in separate compilation units and (b) use the C++ and cuda compilers.

Examining the general form of the algorithms used, the flow it something like:
for(each_chanel; each_channel<nchan; ++each_channel)
{
    for(sample=0; sample<nsamp; ++sample)
    {
         // compute nsample items
    }
    // perform reductions over nsample results
    // generate statistics over nsample results and reductions above
}

CPU vs. CUDA vs. Thrust

Cuda has a STL like library call Thrust, which abstracts much of the details of GPU kernel launching and has facilities for parallel reductions. Kernels become functors and lauches are performed with the for_each(...), transform(...) and reduce(...) operations.

Normally when porting an application to the GPU, the first item to look for are for loops. Cuda kernels essentially are the core of the for loop, and the cuda_kernel<<block,thread>>(...) syntax is used to specify the manner in which the 'loops' are mapped onto cuda cores/threads. The geometry can be 1-D (DSP applications), 2-D or 3-D. Inside a cuda kernel, the geometry can be queried to determine which data element the kernel should operate upon.

In Thrust, kernels are functors, which appear in the same manner as a normal STL functor. If the geometry needs to be accessed, the best way to do this is to include the dimensions in the functor object, and use a counting iterator to pass the 'global' (1-D) index, then compute the 2-D or 3-D dimensions. This model works well in map patterns. Stenciling operations may be more difficult. (I've not thought about this yet.)

Examining the general form of the algorithms used, the flow it something like:
for(each_chanel; each_channel<nchan; ++each_channel)
{
    // copy data onto GPU
    // Use zip iterator, and functor to compute over nsamples
    thrust::for_each(make_zip_iterator(...), make_zip_iterator(...), functor);
    // perform reductions over nsample results
    thrust::reduce(...);
    thrust::reduce(...);
    // copy results onto CPU
    // generate statistics over nsample results and reductions above
}

The good is that this is very clean. The bad is that the for_each(...) call is one kernel (fairly dense which is okay), but the reductions are not so much. More on this later. What is needed instead of 10-20 reduce calls is a parallel functor which operates and does all the reductions in parallel.

Older Documents From Rick Fisher

(Unrelated to algorithms/code above)

-- JoeBrandt - 20 Dec 2007
Topic revision: r4 - 2014-07-12, JoeBrandt
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback