Implementing turboSETI in Near-Real-Time at the GBT
In the control room for the GBT, there are currently 64 BL compute nodes in use for data collection and processing. Typically 8 compute nodes are run at a time during an observing session, and each compute node gathers data from a different chunk of the spectrum, meaning that the data products from each compute node must be spliced together to form a continuous spectrum after processing. The first part of the project I am proposing here is to implement turboSETI as a part of the data reduction pipeline at the GBT so that the SETI candidate information that turboSETI provides will immediately be stored alongside the spectral products which are already saved as part of the Breakthrough Listen pipeline. This will require implementing turboSETI to search through data in each of the compute nodes directly after the observation, and then to write a program to splice the turboSETI data products from each of the compute nodes together after the initial processing is completed. This project will streamline and speed up the process of searching for technosignatures in the BL pipeline, and has the potential to ensure that possible detections of narrowband, Doppler-drifting technosignatures are identified by scientists nearer the time of observation, which could improve the likelihood of a successful follow-up observation of a promising SETI candidate.
We are primarily using Trello
to track to-do items. However, here's a list for quick reference:
- Compare output from running turboSETI on single nodes to running on spliced files
- Write program to splice UFUDs
- Deploy as part of post-reduction scripts -- needs to not run during observing
- Analyze output
- spliced-filterbank dat (SFD) - output from turboSETI run on the spliced filterbanks
- unspliced-filterbank unspliced-dats (UFUD) - output from turboSETI run on the unspliced filterbanks
- unspliced-filterbank spliced-dat (UFSD) - UFUDs combined with the script that you'll be writing
Above definitions courtesy Steve Croft.
Software / Data Notes
Prior to starting analysis on our test data, I worked through this tutorial by Elan Lavie on my home laptop to get familiar with turboSETI and its associated tools: https://github.com/elanlavie/VoyagerTutorialRepository
I then installed turboSETI version 22.214.171.124 on the blpc1 machine in my own Anaconda environment (ewhite).
For the near-real-time turboSETI project, we wanted to look at some test data and compare the resulting SFDs with the results of combining UFUDs. We decided to use the Voyager 2020 X-Band data (located on the BL cluster), and performed the following steps to generate the files we need:
To create the UFUDs...
To create the SFDs...
- Ran rawspec (https://github.com/UCBerkeleySETI/rawspec) on the .raw files for each compute node to create .fil (filterbank) files -- one for each of the 3 data products (high freq., high time, and mid resolution) for each of the 6 scans in the ABACAD cadence for each node.
- Ran turboSETI on each high-res .fil file to produce a separate .dat for each of the 6 scans in the cadence for each node (i.e., generated UFUDs).
- Ran splice2 to create the 6 spliced filterbank files (one for each scan in the cadence) from the single-node filterbank files created in the last step by running rawspec.
- Ran turboSETI on the 6 spliced filterbanks to create the SFDs
After creating UFUDs and SFDs via a process which will be described in the above section, I created some histograms to attempt to compare the files' contents (using the iPython notebook plotting_ufuds.ipynb in this repository: https://github.com/ewhite42/bl-gbt-ewhite
). A few brief notes on what I found:
- The histograms' data is from the first scan in the 6-scan cadence. I have the .dat files for the remaining 5 scans in the cadence on blpc1 and can create separate plots for them as well if needed.
- The UFUD plots are created by plotting all data from the UFUDs on one diagram. For details of how this is done, inspect the first cell of the notebook mentioned above. The SFD plots are created by plotting the results from the spliced filterbank in one screen as predicted. The plots' number of bins is equal to the number of data points that came out of the respective files.
- There seem to be more data points for the plots composed of the data from the UFUDs (3300 rows) than for the spliced .dats (1408 rows). Not sure why this is or how this will affect things.
- One test I tried in an attempt to make sure there were no overlapping regions in the individual UFUDs was that I created an array of frequencies in order, then subtracted each frequency entry from each other. No negative values were returned, which seems to indicate there is no frequency overlap.
I've pasted in the histogram plots below; UFUDs are on the left, SFDs on the right (note that you can see more zoomed-in versions by clicking on the files in the table at the bottom of the page).
Histograms -- No Binning
Histograms -- 500 Bins
Note the x-axis of the SNR plots should be labelled "log10(SNR)"; I'll correct this later.
Histograms -- 100 Bins