Mustang 1.5 Data Transmission

1. Intro

The Mustang 1.5 firmware is producing data that needs to get off the Roach board somehow and onto another machine. How to do this?

2. Background

2.1 VEGAS Data Transmission

  • VEGAS Data Transmission diagram:
    VegasDataTx.jpg

2.2 VEGAS vs. Mustang:

  • VEGAS uses a ROACH2 board, Mustang uses a ROACH1 board
  • ROACH1 FPGA only has access to 10 GbE ports
    • too expensive so we must use 1 GbE port on PPC
    • 1 GbE will be sufficient for data tx rates
    • FPGA cannot send data out, so we may need process on PPC to do it
  • Don't need HPC server since lower data rates, so packets will be sent to manager

3. Overview

The way the other Green Bank FPGA's work is they get their data off the Roach via a 10 Gbe connection. This is done directly in the firmware and there are Casper tutorials for how to use these blocks. However, for various reasons, this is not an option for us. We have to utilize the 1 Gbe port on the Roach, even though this is really limited to about 10 MB/s.

Here's the elements of how we are trying to do this:
  • firmware writes data products to two simple shared memory blocks on the Roach, 'ping ponging' between the two: a software register indicates which memory block is currently being written to, while the other is safe to be read.
  • a process on the Power PC (the linux end of the Roach) uses the software register to know which shared memory block is free to read.
  • after a successful read of this memory is completed, the data can be processed and then transmitted via UDP.
  • a separate host receives the transmitted data.

3a. Data rate requirements

According to DaqReadoutMrM2, the required data rate for Mustang 1.5 for full demodulated data is 41 Mbps for the instrument, which will likely include 2 ROACH boards. Therefore the required rate per ROACH is 21 Mbps.
  • Comments in "ROACH gigabit Ethernet status" emails in the CASPER mail archive:
    • Dan Werthimer: i think all roach1's PPC ethernet ports need to be used at 100Mb/sec port speed, independent of when the boards were made, and not at 1Gbit/sec. It's tricky to write PPC software that gets more than 20Mbit/sec out of the port, but i'm not sure about this 20Mbit/sec number.
    • Jason Manley: I bank on 10Mb/s on ROACH1, which is achievable using the normal katcp and tcpborphserver2 without any trickery. You can get better if you write your own PPC code.

4. Details

4.1 Data packet description

As a result of a data transmission meeting held May 30, 2013, we have defined our data packet to be in this format:
  • frame counter (for detecting packet loss) - 32 bits/4 bytes
  • clock counter (for timestamping) - 32 bits/4 bytes
  • up to 256 channels of data, 16 bits/2 bytes each = maximum 512 bytes
    • Note the number of channels is now variable and defined in a config file and set in a software register
  • Maximum packet size for max channels = 520 bytes

4.2 Timestamping

Timestamping will be done with a frame counter, as it is for GUPPI, VEGAS, and DIBAS. The frame counter is zeroed out on the first 1 PPS tick after the Arm() command is issued (which sets an "ARM" software register in the firmware) when the scan starts. Between knowing the start time, how fast the instrument is running, and the frame counter you can generate time stamps. The counter gives you the fractional part of the second, in units of the FPGA clock, then the computer adds the integer number of seconds to this calculated fractional part.

In more detail, the main computer controlling the observation would know the time that it armed the instrument to within a few milliseconds. It would know that the 1 PPS happened after the arm() command, and for packet counter == 0, the time at start of the scan is:

floor(Tarm) + 1 second

For each packet, then, the time on the packet is

Tp = floor(Tarm) + 1 sec + (Packet Counter * resolution of Packet Counter)

The main computer starts saving the packets when the packet counter starts over at zero, but the packets are streamed all the time to the computer so you can see the data between scans.

Another use of the frame counter is to detect dropped packets, so that you can handle it accordingly in the data capturing software.

mba15Timestamping.jpg

4.3 Firmware - writing data to shared memory

The 'ping pong' method described in the overview has not been implemented in the real model yet. Instead we are using separate models that simulate our output. Models are in /export/home/ptcs/scratch/models/mba15/mdls.

4.3.1 Mock Ping Pong Bram #1

Here is a brief description of ping_pong_bram_counter.mdl, our first attempt at a mockup of what we want the firmware to do:

  • inputs:
    • ttl_rate - sets the rate at which data is written. The system clock is set to the ADC's 256 MHz clock, so data will be written at a rate of (256 MHz / ttl_rate). Ex: (256 MHz / 256000000) = 1 Hz.
    • reset - toggle this from zero to one and back again for the system to reset all it's counters and pickup the newly set ttl_rate.
  • outputs:
    • Shared_BRAM, Shared_BRAM1: these are the two shared memory blocks, each with a size of 4096 B. The data in them should look like this: the first 8 bytes contains the value of the switch_cntr_reg (this is effectively our counter), and the next 254 4 bytes hold a value that increments from 2 up to 254. This is just dummy data to make sure we're reading it correctly on the other end.
    • switch_reg - a software register that toggles between one and zero. A value of 1 means that BRAM1 is being written to (then it's safe to read the other BRAM), and vice versa.
    • switch_cntr_reg - this gets incremented each time switch_reg toggles. If a subsequent read of this register is more then 1 then the previous read, you aren't reading it fast enough. This gets reset to 0 when the reset software register is toggled.

So, this is a pretty good simulation of how we want to implement things in our real model, except for the following:
  • Here, the BRAMS our basically treating data points as 32-bits wide (4 B) instead of 16-bits wide (2 B).
  • Here, after the switch reg. gets toggled, we simply write the counter then the 254 data points then wait for the switch to toggle again. In the real implementation, we will keep writing to the BRAM till it's full, then toggle.

So, if you set the ttl_rate to write out data at 1 Hz, you can use 'hd' to watch the software registers toggle and increment, and watch the first 8 bytes of the BRAM's increment appropriately as well.

ping pong.jpeg

4.3.2 Mock Ping Pong Bram #2

We have created another set of models that comes closer to what we will need to implement in the firmware. See the series of models entitled: ping_pong_bram_vn.mdl.

These models use a lot of the same concepts (and blocks) as the model mentioned previously in 4.3.1, but improvements include:
  • PPS signal coming out of the ADC block
  • Proper data headers included in data (i.e. frame counters)
  • Proper use of BRAM (16-bit data only using up 2 words, for example)
  • On the fly configuration of memory usage (number of channels to simulate, and number of data packets, or frames to write to each BRAM before switching over to the next one)
  • 'Arm' mechanism, which resets the counter found in the header

For testing, there is also a corresponding Python class in /home/sandboxes/pmargani/mba15/scripts/ChannelizerControls: PingPongBram.py. Here's some example usage:

$ source /home/gbt7/newt/McPython.bash
$ source /home/gbt7/newt/Mustang1.5/mustang.bash
$ export YGOR_TELESCOPE=/home/sim   #needed for config file path, else uses '.'
$python
>>> from PingPongBram import PingPongBram
>>> p = PingPongBram()                           #or for different roach: p = PingPongBram('vegas-r1')
Connecting to roach mustang-r1 on port 7147                           
Setting firmware num_channels to 64                                   
Setting firmware num_packets to 60                                    
Setting firmware ttl_rate to 20000
>>> # To override defaults:
>>> p.setTtlRate(10000)
Setting firmware ttl_rate to 10000                                    
>>> p.setNumChannels(128)                                             
Setting firmware num_channels to 128                                  
Setting firmware num_packets to 30
>>> p.run()     # note counters are already messed up in this example and don't increment!
read:  Shared_BRAM                                                    
  i: 0; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 1; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 2; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 3; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 4; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 5; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 6; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 7; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 8; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 9; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 10; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 11; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 12; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 13; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 14; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 15; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 16; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 17; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 18; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 19; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 20; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 21; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 22; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 23; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 24; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 25; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 26; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)     
  i: 27; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 28; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
  i: 29; frame: 3.997758e+06; clk: 4.128832e+06 (65, 66, 67, 68) .. (57, 58, 59, 60)
elapsed switches: -1767291960; elapsed seconds: 0.003395
>>> # To reset system:
>>> p.loadBof("")    # stops loaded bof file
>>> p.loadBof()       # loads last file in listbof()
>>> # now reload config values
>>> p.setTtlRate(20000)
>>> p.setUpFirmware()     # loads num_chans and num_packets
>>> p.run()
>>> # now counters increment and it keeps going

TBD: - the firmware doesn't always work when you reconfigure with different # of channels and packets. You may have to restart the bof file and this python class.

4.3.2.1 Mock Ping Pong Bram #2 testing strategy

1. User sets up firmware using PingPongBram.py

   * gets default values for roach name, roach port, bof file, number of channels, and ttl rate from config file Rcvr_MBA1_5_Xmit.conf; number of packets per read/write is computed based on: floor(8192/packet_size)
   * user can override any of these with set methods
   * user starts bof with loadBof() using default bof file or specifies another one (I think at some point we'll have tcpborphserver load the real bof file at startup, or we could do this now)
   * user runs setUpFirmware() to set ttl_rate, num_packets, and num_channels registers

2. User starts dataTransmit on roach
   * default receiver host is colossus, or use -r argument
   * default port number is 50000, or use -p argument
   * default bof file is ping_pong_bram_v4, or use -b argument for another boffile
      * bof file is already running at this point
      * could we use a popen command to determine what bof file is running?
   * dataTransmit reads num_packets and num_channels registers to determine read size, then reads bram and transmits data
   * Runs continuously

3. User starts dataClient on receiver host
   * command-line args: data dir (optional, default "/tmp"), scan number (optional, default 1), udp port (optional, default in config file), scan length (required)
   * config file input: ROACH name and port, udp port
   * reads num_packets and num_channels from software registers in firmware using RoachInterface
   * sets arm register at beginning of scan and resets at end
   * receives data and writes to FITS then exits

4. To change UDP port:  use -p option on dataTransmit then (a) change config file and run dataClient or (b) run dataClient with -p option; possibility for inconsistency here but can't be helped

To change the basic firmware setup between "scans":
   a. manually stop dataTransmit on roach
   b. use PingPongBram.py to change whatever (system is disarmed at this point)
   c. restart dataTransmit on roach
   d. run dataClient as usual

4.4 Power PC - reading and transmitting data

4.4.1 tcpborphserver2 - mba mode

4.4.1.1 tcpborphserver2

  • See also: BorphDevelopment
  • This is a daemon started at PowerPC startup which make commands to interact with the firmware available to a telnet session.
  • Normally, the tcpborphserver daemon is run from /usr/local/sbin and is a link to an executable in the same directory; so you would copy your executable here and redo the link
  • For testing, you could stop the running tcpborphserver and run your own from the command line
  • c code must be compiled on roach board
  • Debug: there is a DEBUG flag in the Makefiles if you need better borphserver output! Usually commented out.

4.4.1.2 'mba' mode in tcpborphserver2

  • Details: MbaMode
  • We created a new 'mode' for the tcpborphserver2 on the PPC. This allows us to create new KATCP commands that can be issued via telnet. In this new 'mode' we read the appropriate software registers and shared memory, and transmitted data.
  • We started developing the mba mode in root@roach:/boffiles/mba_devel/tcpborphserver2. Then we decided we needed our own development areas while developing at the same time: /boffiles/paul_devel and /boffiles/pam_devel
  • Pam created tcpborphserver commands to use with the ping_pong_bram model; run /boffiles/pam_devel/tcpborphserver2/tcpborphserver2 on roach
    • starts in 'mba' mode and runs /boffiles/ping_pong_bram_counter_2013_Jun_11_1350.bof (note that some modes need a certain boffile because the register names to read/write are hard-coded in the commands)
    • Note: mode will not start if boffile is already running, make sure you kill it before restarting tcpborphserver!

4.4.1.3 Results

  • Commands added to mba mode of tcpborphserver2:
    • katcp scheduler: capture_start and capture_stop
      • starts/stops capture of 1024 B using the katcp scheduler to run a function on an interval. Minimum interval is 1 msec. Set ttl_rate to 12800.
        • interval is too slow; for flux ramp of 20 kHz, need to read bram at a rate of 0.05 msec
      • bram read & transmit ranged from 0.4 - 0.9 msec
      • messages seemed to indicate that process was not meeting schedule:
        • #log info 159212205198 roach.mba detected\_time\_warp\_or\_stall,\_rescheduling\_periodic\_task\_for\_159212205.197787s
      • on destination (tofu), run udp_recv:
        • $ udp_recv -s 1024 -p 7000 -e -a mustang-r1 (packet size 1024, port 7000, endian (byte-swap), print packet nums, sent from mustang-r1)
        • As expected, client to receive data reported many dropped packets, about 99% drop rate. * Conclusion: this will not work. That's too bad because katcp session returns control to the client but the capture keeps running.
    • single read/transmit: mba_read_bram
      • reads a 1024 B block of data from the bram and transmit it takes a total of ~0.623 msec (average of 10 trials), similar to standalone process reported below.
  • Integrating standalone process into tcpborphserver2
    • At startup into mba mode, loads bof file specified in config file, sets defaults, starts dataTransmit thread which immediately starts reading BRAMs and sending the data over UDP
    • To change number of channels selected, use command mba-set-chans, which recalculates number of frames per packet and packet size, then restarts dataTransmit thread with this information.

4.4.2 stand alone process

We also implemented some tests in a stand alone c program, which took a lot of it's motivation from the tcpborphserver2 code mentioned above, and the udp demonstration code Paul Demorest shared with us via John Ford.

This stand alone code can be currently found at /export/home/tofu/cicadaroots/vegasbof/paul_devel/c_test AND paul_devel/dataTransmit (TBD - check this code in somewhere?). The latest stable working version is in dataTransmit. An executable can be built using =gcc pI. -g -Wall 'files.c' -0 'executable_name' = on the Roach PPC.

4.4.2.1 Reading Data

Conceptually, the method for reading the data is simple: treat the software registers and shared memory like files and use open, close, read and lseek to to manage this data that exists in /proc//hw/ioreg. As described above, the software registers are read in order to determine which BRAM to read, and counters are checked to see if anything is getting skipped.

4.4.2.2 Transmitting Data

The udp demonstration code is available at /home/sandboxes/pmargani/udpCode and /export/home/tofu/cicadaroots/vegasbof/paul_devel/udp_test. TBD - check this code in somewhere?. The two relevant files are udp_send.c and udp_recv.c.

We tested the data transmission rates between various hosts using this code. With the exception of host 'trent', the following hosts all demonstrated a maximum data transfer rate of about 11 MB/s: vegas-r1, nereid, tofu, arcturus, colossus.

For our tests, we simply cut and paste the relevant code from udp_send.c into our PPC process for transmitting the data.

4.4.3 - Using Memory Mapping

On the Casper mailing list, there was a discussion started by Ross Williams at Caltech entitled 'bram and katcp'. In this thread, the same data transmission problem we have is discussed. The solution presented seems to enable much faster read rates of the BRAMs by the PPC process. We've investigated using this method ourselves, but as of this writing (Dec. 2013) , it seems that our current performance is good enough. Therefore, this section provides notes in case we need to improve our performance on the PPC.

There seems to be two main steps in implementing the performance upgrade described in the Casper email thread: updating the Roach's PPC kernel to support memory mapping, and then modifying Ross William's code to take advantage of memory mapping to read our BRAMs and transmit their data:

4.4.3.1 Upgrade PPC Kernel

Why do we have to upgrade the PPC Kernel? The current kernel does not support loadable device drivers. And we need a new device driver to map the FPGA's memory.

It seems that we successfully completed most of these steps. I don't understand all of them, but I'll attempt to list what we did (skipping my dumbass mistakes along the way):
  • https://casper.berkeley.edu/wiki/FPGA_Device_Driver_Memo
  • under 'Kernel Source', clone the git repo described there.
  • we won't be building the kernel from this source, but instead Wolfgang can point the system to the uImage-roach-mmap binary at the top of the repo.
  • Reboot the roach, and follow the steps on the wiki page entittled 'Steps to follow':
    • For step 2, you must use a laptop to connect to the serial port of the roach, and run the hyper terminal program so that you can interrupt the autoboot process using the serial console (Jason Ray showed me all this).
    • After hitting 'any key', enter the 'setenv' commands listed in step 2 (this is necessary in part due to the size of the uImage-roach-mmap).
    • In addition, we also have to 'setenv bootcmd roachboot'; savenv (use printenv to view these vars in uboot)
  • After the roach boots back up, double check you have the right kernel: uname -r should give: 3.10.0-saska-08613-gf750886
  • step 3 creates what we'll read from (/dev/roach)
  • step 4: we haven't done this yet, but you need to build 'tcpborphserver3', which will run on roach1; despite the name. 'tcpborphserver2' will not work with this new kernel (fails to load a bof file).

4.4.3.2 Using Memory Mapping

Okay, so now, how to use memory mapping and make a huge improvement in performance? Ross Williams pointed me to his code on github: https://github.com/tweekzilla/ccat-wfs-software.

Until we finish the step above and get everything working (including the new kernel and the new tcpborphserver3), we can't really run this code, not too mention the fact that it's reading memory address's for a boffile we don't have. However, I tried to understand it by trimming it down to run in a test environment on my a linux box like colossus. This has been pretty successful so far: I've got it reading a test file and transmitting the data. The next step would be to make sure I know how to use the boost tcp libraries, then write a client to make sure the data is being properly transmitted. My code so far is at /home/sandboxes/pmargani/mba15/dataTransit2 (not committed to that repo).

Presumably, if we revisit this, and I get this shit working, all add more here.

4.5 Client - receiving data

For simply receiving the data, we did not need to make any modifications to udp_recv.c, apart from some code to check that the entire packet's data looks uncorrupted. Note this code also has an option for writing the data to disk.

We have been developing a C++ client for a) communicating with the Roach, b) receiving the transmitted data c) writing that data to disk. This code can be found in /home/sandboxes/pmargani/mba15/dataClient. The purpose of this application is to enable the writing to disk of timestamped MBA 1.5 data for engineering tests (before the M&C manager is ready).

The current design is attempting to reuse data stolen from else where, and aims to produce building blocks that can then get passed on for M&C Manager development. Classes include:
  • RoachInterface - this is stolen directly from M&C Vegas code. It enables communication with the Roach via katcp. Note that building this requires linking to the M&C ygor libraries.
  • ReadData - this is a class responsible for capturing the transmitted data, parsing it into structures, and interpreting the counters into a timestamp. It also writes data to disk (There are byte ordering issues with this).

4.5.1 Current status example

As of this writing, here's an example of of what the dataClient does:

  • opens a katcp connection to the roach
  • fires off the thread that uses ReadData to capture the data stream
  • sends an 'arm' command to the roach
  • when the 'zero' frame counter is found in the data stream, the clock counter is used to apply a timestamp to this frame.
  • when the specified 'scan' duration has elapsed, the data stream (and disk writing) is stopped
  • program exits.

5. Performance

5.0 Theory

>>> nChans = 32*2.
>>> dumpSize = 2*nChans # bytes
>>> fRamp = 20000. # Hz
>>> dumpPeriod = 1/fRamp # secs
>>> bramSize = 4096. # bytes
>>> nDumpsPerBram = bramSize/dumpSize
>>> fBram = fRamp / nDumpsPerBram # Hz
>>> bramPeriod = 1 / fBram # secs

So, for a flux ramp frequency (fRamp) of 20 KHz, the bram would have to be read/transmitted every bramPeriod of 1.6 ms for 64 channels. This is within the .60 ms that we seem to be able to do this. For flux ramp frequency of 40 KHz, bramPeriod becomes 0.8 ms; still seems doable (see below).

For 64 channels:
  • dumpSize = 128 bytes
  • nDumpsPerBram = 32
  • bramPeriod = 32/20000 (20 kHz flux ramp) = 1.6 msec
For 256 channels:
  • dumpSize = 512 bytes (really 516 with frame counter data header)
  • nDumpsPerBram = 7
  • bramPeriod = 7/20000 (20 kHz flux ramp) = 0.35 msec

5.1 Network Performance

How fast can we send data across the network?

It seems that between the roach 1 boards and our linux machines we are limited to about 10 MB/s.

We tested using the udp code John Ford shared with us (from Paul Demorest) on different hosts and with different packet sizes; The server was configured to transmit data as fast as possible (no delays). I was a little surprised by some of the results: the data rate, with few exceptions, hovered around 11 MB/s.

Hosts used (both as servers and clients):
  • vegas-r1
  • nereid
  • tofu
  • arcturus
  • colossus

Switching hosts caused changes in CPU load, but not on the data rate. I assume that this means we are bound by the network between the server and the client

Tweaking the packet size only had a significant impact on the data rate as the packet size shrunk from 8192 to 256. At this point, the data rate was reduced slightly (9 Mb/s), but the load went way up (from about .10 to .40). I assume this makes sense, since we're getting hit by the overhead of each data transmission.

some hosts are connected with 1 gb links, and some with 100 Mb links. Tofu has a 1 gb link.

With 1040 byte packets, I got 112 MB/sec between tofu and tank, which is consistent with 1 gb links. Packet drop was high at 1 e-4

5.2 Reading/Transmitting BRAMs

We performed two types of tests. The first involved simply seeing how quickly a BRAM could be completely read and all its data transmitted. The code in /export/home/tofu/cicadaroots/vegasbof/paul_devel/c_test/bram_udp_test.c simply reads and transmits data in a loop, measures the real and user time elapsed, and calculates the average read/transmit time for an entire BRAM's worth of data (4080 B). Results:

  • Real time per read: 0.602382 ms
  • User time per read: 0.000602 sec

5.3 Read/Transmitting ping ponging BRAMs

5.3.1 Trail #1: (Summer 2013?)

Our second kind of tests involved the full stand alone process described above (i.e. ping_pong.v4.c). The variables we changed included:
  • amount of data read/transmitted ('Size')
  • flux ramp rate (ttl_rate register, 'Switch Freq')

The below table shows our results. As an example, when the 'switch' register toggled at 5 KHz, and half the BRAM was read/transmitted at a time (256*4), it was found that 15% of the time, the 'switch_cnt_reg' showed that we were missing toggles (another words, we weren't keeping up); also, under these conditions, we measured 3.8 MB/s of data being transmitted by the receiving end.

Switch Freq(Hz) Size(B) Data Rate(MB/s) % Missed
1 8 ? 0
1 256*4 ? 0
1000 256*4 0.76 0
5000 256*4 3.8 15
7500 256*4 3.3 72
5000 256*4*2 4.4 71
3500 256*4*2 4.4 20
2500 256*4*2 3.8 0
1000 256*4*2 1.56 0
2 256*4*4 0 0
1000 256*4*4 3.0 0
1500 256*4*4 4.5 0
2500 256*4*4 5.4 20

Notes:
  • 256*4*4 = 4096, the full size of the BRAMs, but our read failed unless we limited it to just 4080 B.
  • None of the examples with 0 % missed counters approaches our maximum transmission rate of 11 MB/s.

Looking at the highest data rate with no missed counters, we have a frequency of 1500 Hz and 4096 B data size. At this rate we are reading/transmitting each BRAM at 1/1500 Hz = .66 ms. This sort of jives with the results we had with our first test, that shows these BRAMs can be read/transmitted at .60 ms real time.

5.3.1.1 Reading vs. Transmitting

We did some comparisons of performance for reading/transmitting vs. just reading. Reading the entire BRAM (256*4*4 B), here's what we got:

Freq (KHz) % Missed Reading/Transmitting % Missed Just Reading
2.0 11 2
2.5 40 27
3.0 66 53

So, this makes it clear that the transmission of the data is not negligible in terms of performance, but is it bad enough to warrant trying to buffer the data and then transmit it, perhaps in a different thread?

5.3.2 Trail #2: Nov. 21, 2013

We decided to revisit the performance of just the data transmission process running on the PPC. At this point, this was code from mba15/dataTransmit, and the firmware was ping_pong_bram_v4. We examined the theoretical data rate needed to be read, the actual rate read, and compared performance for just reading vs. reading and transmitting. Data Rates where calculated by:

>>> packetBytes = 8160
>>> nFrames = 60
>>> fRamp = 20e3
>>> packetRate = fRamp / nFrames
>>> dataRate = packetRate * packetBytes
>>> dataRate
2720000.0
>>> dataRateMB = dataRate / (1024*1024.)
>>> dataRateMB
2.593994140625

  • Only Reading:

Flux Ramp (KHz) Data Rate (MB/s) Read Rate (MB/s) % Reads Missed%
20 2.59 2.56 0
30 3.89 3.85 0
35 4.54 4.50 0
40 5.18 5.13 0
45 5.83 5.77 0
50 6.48 6.40 0.31
55 7.13 6.40 10
60 7.78 6.40 20

  • Reading & Transmitting:

Flux Ramp (KHz) Data Rate (MB/s) Read Rate (MB/s) % Reads Missed%
45 5.83 5.75 0
50 6.48 5.8 10
55 7.13 5.8 21
65 7.78 5.8 32

  • Conclusions:
    • Just reading, we hit a wall of 6.40 MB/s
    • Just reading, at 6.4 MB/s, w/ each read at 7.78e-3 MB, that's about 1.2 ms per read. Hmm, that's 2 * 0.6 ms per read when our BRAM was 1/2 the size, 4080.
    • Reading and Transmitting, we hit a wall of 5.8 MB/s

5.4 Performance of final Data Transmission system

5.4.1 Trial #1: Sep. 10, 2013

Sep. 10, 2013: Here's some results of running the whole system: moc-up firmware, dataTransmit, dataClient.

Flux Rate (KHz) # chans # frames / packet MB/s Received performance notes
1 64 60   DONE  
10 64 60   DONE  
15 64 60 1.946 DONE  
20 64 60   DONE  
30 64 60   DONE  
40 64 60 5.189 DONE < 1% dropped packets
50 64 60   ALERT! < 10% dropped packets, FITS missing writing all data!
60 64 60 5.876 ALERT! lots of dropped packets

5.4.2 Trial #2: Oct. 31, 2013

Trail #2 was conducted completely on the Upenn system:
  • roach 'roach1'
  • host 'egret'
  • complete firmware (not a moc-up): umux_demod_v17q. * firmware configured by a modified version of PingPongBram.py
  • dataTransmit and dataClient copies from our repository; needed minor modifications for things like software register names
  • 'load': as measured both by the diagnostic code in dataClient, and by watching 'top'
  • 'MB/s Received': as measured by diagnostic code in dataClient
  • '# frames/packet': this is a number calculated and set in the firmware simply to take full advantage of the BRAM size (8192 B)
  • scan length : 10 seconds

Flux Rate (KHz) # chans # frames / packetSorted ascending coadd load (%) MB/s Received FITS MB performance notes
20 128 30 1 25 5.0 92 DONE data client claimed it dropped 60 frames, but FITS file checks out (?)
20 64 60 1 20 2.6 50 DONE  
20 64 60 2 11 1.2 24 DONE  
20 64 60 4 5 0.6 13 DONE  
10 64 60 1 11 1.3 24 DONE  
5 64 60 1 5 .65 12 DONE  
30 64 60 1 20 3.9 71 DONE  
35 64 60 1 24 4.5 83 DONE  
40 64 60 1 25 5.1 95 ALERT! client dropped 120 frames, because transmitter missed sending 2 packets
45 64 60 1 25 5.8 107 ALERT! client dropped 240 frames, because transmitter missed sending 4 packets
50 64 60 1 25 5.4 100 ALERT! everything goes to hell: transmitter misses tons of packets, client drops 123,000 frames - that's why FITS files size actually drops; note also that after scan ends it takes a while for the FITS file to finish writing.
20 32 113 1 14 1.3 26 DONE it seems like the firmware gets confused when changing # chans and # frames/packet, and bof has to get restarted. TBF

6. Optimization Ideas

We don't even know if we need to optimize (improve performance) yet, but, in case we do, here's some ideas:

  • on the PPC, don't read the /proc files, but look into using 'mmap'; but that would require understanding what's going on under the hood there:
  • find out what's causing the bottle neck on the PPC - the reading of the /proc files, or the data transmit
  • on the client side, there's tricks that can be done when parsing the memory buffer, like rolling through 16 bytes at a time, but it probably won't buy us much (how we are parsing the data now is already pretty good).
  • send data over a 10 Gig connection instead of a 1 Gig ???
  • if the PPC has more then one core, use one for reading, and another for transmitting the data. But it looks like it has only one.

7. To Do

  • General
    • DONE - PF - writing to software register (e.g. "arm") more than once from katcp
      • had to cast value and offset to uint32_t
    • DONE - PF - global control of variables (ex: num channels, num frames / packet & BRAM) which need to be known by data transmitter and client
      • use firmware software registers which can be read by both transmitter and client
    • DONE - PF - remove "variable constants" from code (constants in code but should be settable)
      • still being implemented in software but will use config file and/or software registers to get needed values; details below
  • DONE config file /home/sim/etc/config/Rcvr_MBA1_5_Xmit.conf
    • contents so far: roach name, roach port, bof file name, number of channels, ttl rate, udp port
  • firmware
    • DONE - PF - modify PingPongBram.py (python setup script) to use defaults from config file and add any 'set' methods needed for user configuration
      • Note that the bof file must be started by the tcpborphserver (at tcpborphserver startup or using the 'progdev' command in a katcp client) in order to get and set registers and shared memory
    • ALERT! - mock-up firmware: reset issues? shouldn't have to restart bof when changing ttl rate, num channels, etc.
      • There is currently no way to reset the counters, which interferes with the bram switching and the frame and clock counters
    • first frame counter should be zero (not one)
  • dataTransmit
    • DONE - PF - change to read number of channels & frames (packets) from firmware registers at startup
    • DONE - PF - add command line arguments to set UDP port (default 50000) and bof file (default ping_pong_bram_v4)
      • TBD: use popen command to determine what bof file is running?
    • DONE - PF - transmitting all the time? but then how to change number of channels?
      • main thread shares global var 'chans' with data_xmit thread using mutex lock to change value; data_xmit checks if value changed between bram reads and then recomputes the packet size
      • At ttl_rate of 20 kHz, max number of channels without data_xmit missing packets is 123.
    • DONE - PM - read more than 4080 bytes from an 8192 BRAM?
    • investigate using /dev instead of /proc for reading BRAM.
  • dataClient
    • DONE - PF - building with Makefile (using Ygor build system)
    • DONE - PF - user input (command line args): data dir (optional, default "/tmp"), scan number (optional, default 1), scan length (required)
    • DONE - PF - config file input: ROACH name and port, udp port
    • DONE - PF - read num_channels and num_packets from firmware
    • DONE - PM - get rid of redundant frame.h structure and simply use Rcvr_MBA1_5Data struct.
    • ReadData
      • DONE - PM - make more OO; first pass!
      • DONE - PM - watch for missed frame counters
      • DONE - PM - after scan, report on missed frames
      • DONE - PM - add error checking for missed frame counters in parsed data.
      • DONE - PM - write these missed frames and other diagnostics to FITS file
    • FITS file
      • config values (config file, user args to dataClient) written to FITS header
      • gathering meta-data for writing in FITS headers (see also OutputDataSpec)
      • DONE - PM - making FITS name unique timestamp (using getTimeOfDay and generateDatetimeName)
      • DONE - PM - finish writing timestamps for each packet
      • DONE - PM - finish writing out data columns
      • DONE - PM - create script for identifying problems with FITS files (ex: missed frame counters, etc.)
      • DONE - PF how to dynamically define binary table (columns hard coded for now at 64 channels)
        • dataClient knows this from reading software registers in firmware - num_channels argument added to Rcvr_MBA1_5FitsIO constructor - not tested
      • DONE - PM - make the data columns (the actual data data, not the counters and timestamps) into a single multi-dimensional column (see also ProjectNotes.mustangfits.pdf)
      • DONE - PM - why does frame counter dip every 2**16? plot it in fv to see what I mean. Turns out this was a minor bug in the firmware!
  • Performance
    • DONE - PM measure performance (first pass anyways - see section above)!!! measure changing many variables (flux ttl, num channels, ...)
  • Test setup offsite
    • DONE - PM - working at UPenn on 'egret' by copying executables there and setting YGOR_TELESCOPE for config file for dataClient
    • JB - get working at NIST, should just need location of libcfitsio.so in library path
  • And FINALLY:
    • add client code (firmware setup, udp client, fits writer) to Rcvr_MBA manager
    • DONE - run dataTransmit automatically
      • done in tcpborphserver2 in mba mode (in pam_devel) BUT starts bof file automatically so this needs more work

-- PaulMarganian - 2013-06-26

Topic attachments
I Attachment Action Size Date Who Comment
Re:_Mustang_data_transmission_session.emleml Re:_Mustang_data_transmission_session.eml manage 8 K 2013-06-27 - 14:39 PaulMarganian  
Re:_udp_code_and_data_rates.emleml Re:_udp_code_and_data_rates.eml manage 3 K 2013-06-27 - 14:39 PaulMarganian  
Re:_udp_code_and_data_rates_2.emleml Re:_udp_code_and_data_rates_2.eml manage 5 K 2013-06-27 - 14:39 PaulMarganian  
VegasDataTx.jpgjpg VegasDataTx.jpg manage 30 K 2013-11-04 - 13:18 PamFord VEGAS Data Transmission diagram
mba15Timestamping.jpgjpg mba15Timestamping.jpg manage 65 K 2013-11-20 - 14:38 PaulMarganian  
ping_pong.jpegjpeg ping_pong.jpeg manage 126 K 2013-06-26 - 16:10 PaulMarganian  
This topic: Software > GB/Software > SoftwareDocs > Mba15FpgaDocs > Mba15DataTransmission
Topic revision: 2013-12-05, PaulMarganian
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback