GUPPI Support Guide

Introduction

GUPPI is the Green Bank Ultimate Pulsar Processing Instrument, the backend system used for nearly all pulsar observing at the GBT.This wiki page is mainly intended for GB support staff to learn about how GUPPI works and some basic troubleshooting tips.More info about GUPPI can be found on these wiki pages:

Some background info about pulsar observing in general can be found at:

If this page does not contain information that you would like to see, please let us (Paul Demorest or Scott Ransom) know!

IF Paths

Basic info about the IF path to GUPPI:
  • Two analog inputs, one per polarization.
  • These split off from the Analog Filter Rack high-speed sampler outputs SG4 and SG8.
  • In the AFR, either the 0.8-1.6 GHz filter or 0.8-1.0 GHz filter should be in use, depending on GUPPI BW mode.
  • The lower edge of the GUPPI IF band is always at 800 MHz. So the IF3 center freq is 800 MHz + BW/2.
  • The Converter Modules used are either CM4/8 (usually) or CM12/16 (less often).
  • Balance() sets the GUPPI signal levels by adjusting the CM attenuators and reading power at the AFR output.
    • Target is 20 counts RMS (out of 256 levels total) for the GUPPI ADCs.

GUPPI Hardware

Block Diagram

  • Overview of GUPPI hardware components:
    BlockDiagramALL.jpg

Analog Components

  • IF conditioner - The initial signal path input to GUPPI. This contains amplifiers and attenuators to set input power levels appropriately. There are no software-controllable settings.
  • Clock synthesizer - The synthesizer is used for the main GUPPI ADC clock. It should be set to one of the allowed GUPPI bandwidths (800, 200, or 100 MHz). The synth is locked to the site 10 MHz maser reference frequency. The synth frequency is set via software control.

Digital Components

  • ADCs - Two CASPER iADC boards, one per polarization. The ADCs are run in interleaved mode (so clock rate = bandwidth), and sample with 8 bits. Each ADC board is attached to an IBOB.
  • IBOBs - CASPER single-FPGA boards. For GUPPI, the IBOBs simply read the ADC output and send it to the BEE2 over XAUI connections.
  • BEE2 - CASPER FPGA board containing 5 FPGAs. The BEE2 reads sampled data from the IBOB, and splits it into frequency channels using a digital (polyphase) filter bank. In some modes, the spectra are detected and accumulated in the FPGA, while in others the channelized voltage data are output directly. The output is over four 10-gigabit ethernet lines to the GUPPI 10GbE switch.
  • power strip- A software-controllable power strip that can be used to power-cycle the IBOBs and/or BEE2.
    • Power-cycling the IBOBs is done routinely whenever the clock frequency is changed (for different bandwidth modes). This takes ~5 seconds.

Computer Hardware

  • 10GbE switch 24-port Fujitsu 10 gigabit network switch. The BEE2, beef, and the gpu cluster are all connected to this. It is also connected to a few other 10gig-enabled machines at GB.
  • beef- The main computer used to record GUPPI data and monitor observations.
    • beef is visible on the main GB network, and login is via your normal NRAO account.
    • beef contains two 7 TB RAIDs, /data1 and /data2.
      • During observing, GUPPI data are recorded here for incoherent modes, in /data?/(username)/(project_id)/(date).
      • Observers select /data1 or /data2 for output in their Configure() block.
  • gpu1 through gpu9- The GPU cluster is used to process and record data for coherent dedispersion mode.
    • Each node contains an NVIDIA GTX 285 graphics card, and has CUDA installed for programming it.
    • Each also has a 7 TB RAID for data recording, mounted as /data.
    • There is only a single user account, called 'gpu'. Access is controlled via SSH keys.
    • These machines are not visible on the main GB network, and can only be logged into via beef.
    • Several shared directories (eg. /home/gpu) are NFS-mounted from beef.

Software Components

  • Manager- A standard GBT manager which sets GUPPI parameters, and stops/starts observations.
    • There is no GUPPI CLEO screen associated with this, the manager can be accessed via Device Explorer.
    • The manager currently only does 'control', there is no monitoring of the GUPPI state via the manager.
  • Controller- Internal software component responsible for getting/setting HW and SW parameters for GUPPI.
    • Python-based, written mainly by R. DuPlain and P. Brandt.
    • Main server runs on beef. Also has processes running on GPU nodes, and the BEE2.
    • Client programs connect to the controller to get/set parameters.
      • The manager is one such client.
      • A python-based interpreter (the "guppi prompt") is another, useful for debugging.
    • This code loads/configures FPGA personalities on the BEE2.
    • Code available at github - http://github.com/nrao/guppi-controller
  • DAQ software- also called guppi_daq, this code is responsible for receiving, processing and recording the data.
    • C-based, written mainly by P. Demorest and S. Ransom
    • Runs on beef and on gpu nodes.
    • Uses shared memory segments for:
      • Storing observing parameters ("status shared memory")
      • Ring buffers for data processing ("databuf shared memory")
    • Also available on github - http://github.com/demorest/guppi_daq
  • In addition to these major components, several smaller scripts are always running in the background on beef:
    • supervisord - This daemon is responsible for (re)starting all the other software components.
    • gbtstatus_loop - Periodically reads info from the GBT status database and fills it into the guppi_daq status shared memory.
    • guppi_gpu_push - Distributes configuration info from beef to the GPU nodes.

Observing Modes

GUPPI has two main modes of operation, incoherent filterbank mode, and coherent dedispersion.Each has a large amount of flexibility in terms of number of channels, etc.

Incoherent Mode

Sometimes this also called search mode or filterbank mode.This mode is just like a normal spectrometer except the integration times are very short (typically tens of us).In this mode, the spectra are accumulated in the FPGA hardware, then output over the network and recorded to disk on beef.

The available parameters are:
  • Total bandwidth: 100, 200, or 800 MHz
  • Number of channels: Any power of two from 32 to 4096.
  • Output: 8 bits only.
  • Polarizations: Total intensity (summed) or full-Stokes. 4096-channel mode is total intensity only.
  • Minimum spectrum integration times:
    • For 100 and 200 MHz bandwidth the minimum integration is given by 4 * (# channels) / BW.
      • This is a FPGA hardware limit.
      • e.g., for 200 MHz, 2048 channels, minimum integration time is 40.96 us.
      • Except for "Fast4K" 4096-channel mode, the minimum is 2 * 4096 / BW.
    • For 800 MHz, the minimum is 16 * (# channels) / BW.
      • This is a disk write speed limit (max 200 MB/s to the RAIDs on beef).
  • Maximum spectrum integration time:
    • The accumulation length (T_int * BW / N_chan) should not exceed ~1000 (typically ~ms).
    • The 8-bit output values do not have enough dynamic range for this.
  • Data format: Search mode PSRFITS.
  • There is a real-time folding mode available, but it is hardly ever used now due to the availability of coherent dedispersion mode.
    • An exception is when observing the pulsed noise cal, the data are often folded in real time at the 25 Hz standard cal period.
  • Note that in this mode, the true achievable time resolution depends not just on GUPPI setup but also on dispersion smearing. This in turn depends on the pulsar's DM and the observing frequency.

Coherent Mode

For observing known pulsars, a dispersion measure-specific filter can be applied to completely remove dispersion smearing.This is often done for MSP timing.In this mode, voltage data are output from the FPGA hardware to the gpu cluster computers.The GPUs apply the dedispersion filter and either fold or integrate in real time.
  • Total bandwidth: 100, 200, or 800 MHz
  • Number of channels: Powers of two from 32 to 2048.
  • Polarization: Total intensity or full-Stokes.
  • Fold mode:
    • Folds the data modulo the current pulse period in real time.
    • Outputs time-integrated profiles, minimum dump time is ~1 second.
    • Output is floating point profiles in PSRFITS format.
    • The time (or pulse phase) resolution is given by N_chan / BW.
  • Cal mode:
    • Special case of fold mode, folds at a constant 25 Hz period for the cal signal.
  • Coherent search mode:
    • Coherently-dedispersed spectra are accumulated in time rather than folded.
    • Spectrum integration time is subject to the same total data rate limit (200 MB/s max) as incoherent mode.
    • Data are re-quantized to 8 bits and output in search-mode PSRFITS format
  • Note, for all types of coherent dedispersion observations, each gpu node processes and independently records 1/8 of the total BW.
    • The 8 sub-bands are assembled into a single output file after the observation is over.

Typical setups

While there are a potentially large number of combinations of BW, # channels, etc available, a few setups get used much more than others:
  • 800 MHz BW, 2048 channels, 40.96 us, incoherent. This is used for searches at higher frequencies (L-band, S-band).
  • "Fast4k" - 100 MHz BW, 4096 channels, 81.92 us, total intensity, incoherent. This is used for low-freq searches (350 MHz; GBNCC, etc).
  • 800 MHz BW, 512 channel, coherent. For MSP timing at L-band and higher.
    • Fold mode is usually used for single pulsars (eg NANOGrav timing).
    • Search mode is used for multiple pulsars in the same beam (double pulsar, globular clusters).
  • 200 MHz BW, 128 channel, coherent. For MSP timing at 820 MHz.

There are also a set of standard frequencies and bandwidths for the commonly used pulsar receivers:
  • Rcvr_342 - 350 MHz center, 100 MHz BW.
    • Note: since there is no 100 MHz-wide AFR filter, the 80 MHz IF filter must be selected.
  • Rcvr_800 - 820 MHz center, 200 MHz BW.
  • Rcvr1_2 - 1500 MHz center, 800 MHz BW.
  • Rcvr2_3 - 2000 MHz center, 800 MHz BW.

Software tools for GUPPI

  • GUPPI environment - When logging in to beef, first do:
    source /home/pulsar64/guppi/guppi.bash
    or:
    source /home/pulsar64/guppi/guppi.csh
    to get access to the programs listed here.

  • guppi_status - Run on beef or gpu nodes, this shows a gbtstatus-style text window of the current status shared memory. Look for the observer's name, project id, source name, etc to be set correctly on the screen. guppi_status gets updated once approximately every second and so things which should normally change at that cadence should be changing (such as LST, AZ, ZA). If you only see about half of this information or if the information is stale, see the troubleshooting section below.

  • guppi_gpu_status - Run on beef, shows an overview of the status of all gpu nodes. Note how all 8 of the GPU nodes have consecutive frequencies listed as well as similar statuses (in this case "exiting"). Ht logs can differ a bit, though, due to filesystem caching and other issues.
guppi_gpu_status_screenshot.png

  • guppi_adc_hist - Run on beef, shows a histogram of count values from the GUPPI ADCs:

There will also be text output listing how far away from the target values the levels are:
ADC Power level info:
CM4 (FPGA3): Mean=-1.255 RMS=4.611 Min=-20 Max=16
       Remove 12.7 dB attenuation (for target RMS 20.0)
CM8 (FPGA1): Mean=-0.683 RMS=5.907 Min=-26 Max=22
       Remove 10.6 dB attenuation (for target RMS 20.0)
   

  • guppi_monitor - Run on beef, shows a real-time bandpass display only for incoherent modes.

The L-band bandpass should look something like this:

The S-band bandpass should look like this:

  • pav and psrplot - Can be run on beef or gpu nodes to view data file contents. These are data display programs that are part of the PSRCHIVE data analysis software. These will only work on fold-mode or cal-mode files.

  • Log files - guppi_daq logs status information on beef in /tmp/guppi_daq_status.log

Troubleshooting

If GUPPI appears not to be working correctly when a scheduling block is run, there are several things to check, depending on what exactly the issue is.

guppi_status is not updating every second or has old or incorrect looking data in it

If not, there could be a problem with the GBT Status database. Check the GBT Status tab in Astrid and see if things are correct and updating there. If they are not, ask the operator to re-start GBT status. If they are updating there, but not in the guppi_status window, then the guppi_status daemon on beef needs to be restarted. This can only be done by root on beef via the supervisorctl command. Call Paul or Scott.

The bandpass in guppi_monitor looks terrible and not like a bandpass

More than likely, "guppi.scale" in your configuration file is set incorrectly. Correct values are in the range 0.1-1.0, usually, depending on the BW and integration time you are using. If the values in guppi_monitor seem to be wrapping crazily, try decreasing "guppi.scale" by a factor of 3-10. If it is flatlined, try increasing it by a similar amount. Remember that you need to re-config and then re-start a scan for this to change.

GUPPI is dropping a lot of packets in an incoherent mode

If guppi_status or the guppi log files show that a significant fraction of the data are being consistently dropped, there can be several issues.
  • You may be running low on disk space on /data1 or /data2. use "df -h" to check and re-configure with the other data directory "guppi.datadisk" and re-start the scan
  • There could be data transfer happening by one or more people (using "rsync" typically) which are affecting the available I/O rate. Use "ps aux | grep rsync" and/or "top" to see if there are any active rsync users, especially those with an associated sshd process as well. Using the ssh-protocol accidentally over the 1gig (as opposed to 10gigE) network for internal GB transfers can easily cause many dropped packets. Also check to see if the processes are "nice'd". If you find this, either call Paul and/or Scott, or even better, contact the person running the rsync to have them stop.
  • There may be people doing significance data analysis using the same /data[12] that you are writing to. You can either contact the person and have them "nice" their processes, change to using the alternative /data[12], or (final option) ask the operator to kill the processes. Note that data processing by 1 or 2 processes, properly niced does not usually cause many dropped packets.

GUPPI will not take data at all in either an incoherent or coherent mode

If guppi_status has information in it which is incorrect. Or if it (and also guppi_gpu_status for a coherent modes) has basically correct looking information but a scan won't start or record data, what has probably happened is that the GUPPI manager somehow got out of sync with the GUPPI hardware. This can often happen after software work on the GBT system. The solution is to ask the operator to open the GUPPI Manager via the Utilities/Device Explorer and do a "Conform Parameters" (via the Manager menu optin) and a "Prepare" (button at the bottom of the screen). If you are in the gateway, you can actually perform this yourself as an observer. Once the prepare has finished and the status in the Manager is "Ready", try re-configuring the system via Astrid and taking a scan again. This simple procedure fixes many of the issues we see with GUPPI.

If the GUPPI Manager shows it self as being "NotConnected" or in some kind of error state, there are a couple things that can be tried with the manager.
  1. Make sure that the manager is "On". If it isn't, turn it on via the Manager menu option.
  2. Make sure that the Parameter called "test_mode" has a value of "production" rather than "test".
  3. If both of the above are OK but there are still issues with the GUPPI manager Status or State, try turning the manager "Off", waiting a few seconds and then turning it back "On" again. If the Status goes back to "Ready",. try a "Conform Parameters" and a "Prepare".
  4. If the Manager is still not talking to GUPPI properly, there is probably an issue with the GUPPI Controller on beef. To reset that, you need to be root on beef and re-start the GUPPI controller via supervisorctl.

In a coherent observation mode, one or more of the GPU nodes is not working

The first thing you should do is decide if you want to abort your observation or not in order to fix the problem (which may or may not be fixable). If you decide to do it, here are a couple steps to take:
  1. Can you ping the node in question? If so, ssh to it as gpu (e.g. "ssh -l gpu gpu1") and run guppi_status on the node. If the status looks good, and is updating, there is probably a problem with GUPPI itself or another hung GPU node. You can check for the latter by checking the uptime's (for instance) on all the GPU nodes:
    • Try "gpu_broadcast uptime" or equivalently "gpu_run uptime". If any of the uptimes are very short, it could indicate that a node crashed and re-started incorrectly. If that has happened, the guppi_status screen will not be complete or updating properly. You can restart the GUPPI daemons on a node using the command "source /home/gpu/gpu_startup" as root. Note that this requires Paul or Scott (or someone else with root permissions on the cluster).
  2. If you can't ssh into the node, the node needs to be (properly) rebooted. You can ask the operator to hit the reset switch on the GPU node. When it reboots, ssh to it as gpu and check guppi_status to see if it looks OK.
  3. If guppi_gpu_status shows one of the nodes in a strange mode, perhaps with one of the statuses as "Unknown" or in a state very different from the other nodes (like "Exiting" when all the others are running), the GPU node may be in a strange hung state. If this happens, the node may need to be rebooted. Call Scott or Paul, or in an emergency, ask the operator to simply hit the reset switch.

Checking that GUPPI is OK following M&C version tests (or any other non-standard activities)

Follow the Conform Parameters instructions given in the "GUPPI will not take data at all ..." section above. If this succeeds with no errors reported by the manager, GUPPI is working correctly. Running a test scan is not usually necessary. Conform parameters is a safe way to put GUPPI back into a known-good state, and should be run after any maintenance/testing activity that may have affected the state of GUPPI or the M&C system.

How to do a complete restart of all GUPPI software

This is how to completely reset everything, if necessary
  1. (as root on beef) "/etc/init.d/supervisord stop" -- this stops all the currently running processes.
  2. (as root on beef) "/etc/init.d/guppi restart" -- clears out and reallocates shared memory on beef.
  3. (as a user account with ssh access to root on the gpu nodes) "gpu_run -r source /home/gpu/gpu_startup" -- restarts all gpu cluster software
  4. (as root on beef) "/etc/init.d/supervisord start" -- restarts all software on beef

How to change the list of GPU nodes for coherent GUPPI

The GUPPI GPU compute cluster has 9 nodes, but only uses 8 at a time for observation, so there is one spare. If a node dies, it can be replaced with the spare using the following procedure:

  1. (requires monctrl access) Edit the file /home/gbt/etc/config/Guppi.conf to remove the bad node and put the new node's IP address into the list. These parameters are IP0, IP1, ... IP7 in the COHERENT_MODE section of the file. The IP addresses are entered as hex values. Valid numbers for each node are listed in the config file comments.
  2. On beef, edit the file /opt/64bit/guppi/guppi_daq/gpu_nodes.cfg to remove the bad node and add the new node. The ordering of entries is important and should match the ordering used in Guppi.conf. It is a good idea, although not strictly required, to keep the nodes in increasing order of their names.
  3. On beef, make sure that none of the nodes are commented out in /opt/64bit/guppi/lib/python2.7/site-packages/guppi-2011.01.26-py2.7.egg/guppi/demux.py. This controls which nodes receives a start command from the GUPPI manager.
  4. Follow all the instructions in the "How to do a complete restart..." section above.
  5. The GUPPI manager needs to be restarted (turned off then on). Use Device Explorer in CLEO, or ask the operator to do this for you.
  6. After the manager is restarted, run Conform Parameters.
  7. Make sure the auto plotting script displays the right nodes by checking that new node is listed in the puppi*.shtml files in /home/www.gb.nrao.edu/content/guppi/plots. You can use sed to replace the out-of-service node with the new one.

-- PaulDemorest and ScottRansom - 2012-07-30
Topic attachments
I Attachment Action Size Date Who Comment
BlockDiagramALL.jpgjpg BlockDiagramALL.jpg manage 61.3 K 2014-04-28 - 16:14 PaulDemorest GUPPI hardware block diagram
guppi_gpu_status_screenshot.pngpng guppi_gpu_status_screenshot.png manage 74.3 K 2012-07-30 - 11:32 ScottRansom  
Topic revision: r13 - 2017-07-13, RyanLynch
 

This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback