Basic data flow from science data to gridded images - parallelization is by feed output unless otherwise indicated
Backend produces data output
Individual M&C FITS files gathered together to produce a coherent data set with properly labeled data (sdfits does this now)
TP calibration and Tsys spectrum determined
output is time-series of TP spectra with associated Tsys spectra (could instead be weight spectra, essentially equivalent for identical integration time but if integration time is folded in here to produce a true weight then it might be easier to combine differently sampled data from this step if that was deemed necessary - since something like that is probably useful later it might be best to just do that at this step).
output format could be SDFITS, or ASAP (CASA) - this output should be optional
If this is a calibration scan where the goal is measure Scal (Tcal determined astronomically) then that result will be saved to a calibration data base for later use
If this is not a calibration scan then the most appropriate Scal spectrum is retrieved from the calibration data base and used here to produce the Tsys spectrum.
User feedback may be necessary here to indicate how appropriate an Scal is. Does it look reasonable, should it be used?
The TP data and Tsys spectra are examined for RFI and other statistical flagging.
This is a potential bottleneck.
Parallelization here is by blocks of time since this needs to compare values between feeds and polarizations.
output is same as identified in previous step. This output is not optional since user interaction at the next few steps will likely refine the calibration applied to this data for use in those steps (i.e. an iterative process starts here).
OFF scans are identified where appropriate (depends on observing mode)
designated off pointings
synthesized off scan from the data
there should be a default synthesis using captured user intent
interactive identification of appropriate off regions will be necessary to refine this step
OFFs may be processed before being used (e.g. smoothing, fitting a polynomial and using that for the off, etc)
OFFs are associated with the calibration data base
OFFs are used (various ways of using the offs need to be explored - nearest, interpolated, other?) along with other calibration information (efficiencies and opacities) to produce a time-series roughly calibrated data set that is ready for griding onto an image
This step also covers the in-band frequency-switched case where the names "off" and "on" are arbitrary but this step is functionally the same as using a sky OFF is in position switched data.
output format should be matched to the default gridder
Initially ACSIS format so that their gridder and related tools can be used
Translators to other formats for other gridders will likely be expected
AIPS, CASA, SDFITS, other
More data editing may happen here
editing (flagging) of the calibrated data or even the TP data using the calibrated data (statistical and visual and command line for experts)
Using data crossing points to refine calibration (related to basketweaving although the exact pattern will likely be different)
this is likely to be computationally intensive and it's unclear how best to parallelize this.
Grid the data
ACSIS gridder will be used here
Has the appropriate griding function
Has a well-tested user interface
They have offered some support in understanding their data format and the conversion step and dealing with the learning curve
Other gridders could be substituted here by providing a data translation tool or modifying what the output format is of the previous step
Iterate until happy with data editing, flagging, appropriate offs and calibration
Take the resulting WCS FITS cube to whatever analysis tool you want.
Broad summary
Sequence of fast samples of the region of interest leads to
Sequence of WCS FITS cubes with associated weight cubes (JCMT calls these variances, I'm pretty sure)
"coadd" these or mosaic them if they do not cover the same area
This might also be done by simply multiplying the ongoing image by the weights and continuing to grid to that same image and weights
The images from the individual fast samples should be kept because they will be useful for identifying problem data
Take final image to to favorite image analysis tool(s)
Issues needing more investigation, research
Capturing user intent
type of scan (map, calibration for scal, off, etc)
tie together related scans - possibly even between scheduling blocks and sessions
type of switching (frequency switched)
desired parameters of image
desired parameters of final image if this is one segment of a larger image
default off locations for synthesizing the first pass of the off
Perhaps a few other tunable parts of the default pipeline (e.g. smoothing or polynomial fitting to determine offs)
Calibration database
Needs to be designed
includes Scal (Tcal determined from astronomical sources)
includes lab Tcal values for case where Scal is not available
Efficiencies
opacities
default
determined from weather predictions
determined from tipping scans
Basketweaving calibration
Interacting with large data, especially for real-time displays
displays
zooming
editing
It would be nice to be able to look at an image and use it to select data that contributed to a specific region of that image (also in frequency space) to locate and edit/flag that data so that an improved image can be generated
Other thoughts
GAIA is a promising visualization tool
Even if parallel processing isn't necessary to keep up with the data rate for the default processing, the iteration steps need to be as fast as possible and so those will benefit from any available parallel processing.
Parallel processing should be designed in from the start, although the initial pipeline will likely not need it to keep up with the data rates.
This pipeline is generally useful to all data coming from the GBT. Individual components should be developed with that in mind if possible.
"data capture" should be extended beyond it's current definition (raw data with no additional processing) to the step just upstream of the use of "offs". This is where the user-driven iterative loop is likely to start.
Every step should have a reasonable default so that it will be possible to directly to the gridded quick-look images without user interaction.
The pipeline will be written in Python at least up to the gridding step.
We could use current K-band single-pixel data taken with the spectrometer to help with simulation during the development of the pipeline.