Scratch space for thoughts on the KFPA data pipeline and processing issues.

I am attempting to turn this into a KFPA data pipeline design document prior to the review.

Resources:

  • BobGarwood - some small fraction of time, more likely to be available if the work benefits more than just the KFPA
  • 0.5 FTE for M&C software KFPA needs
  • Possible postdoc from U. Calgary for research into interesting components of the pipeline for a specific observing mode
  • Possible postdoc from U. Maryland who might help with items in the pipeline closer to the hardware. His interest is in applying DSP to astronomy.
  • Possible summer student or co-op student for displays of and interaction with large datasets and other aspects of the pipeline

My attempt to capture the contents of the whiteboard on 11/29/2007.

Basic data flow from science data to gridded images - parallelization is by feed output unless otherwise indicated

  • Backend produces data output
  • Individual M&C FITS files gathered together to produce a coherent data set with properly labeled data (sdfits does this now)
  • TP calibration and Tsys spectrum determined
    • output is time-series of TP spectra with associated Tsys spectra (could instead be weight spectra, essentially equivalent for identical integration time but if integration time is folded in here to produce a true weight then it might be easier to combine differently sampled data from this step if that was deemed necessary - since something like that is probably useful later it might be best to just do that at this step).
    • output format could be SDFITS, or ASAP (CASA) - this output should be optional
    • If this is a calibration scan where the goal is measure Scal (Tcal determined astronomically) then that result will be saved to a calibration data base for later use
    • If this is not a calibration scan then the most appropriate Scal spectrum is retrieved from the calibration data base and used here to produce the Tsys spectrum.
    • User feedback may be necessary here to indicate how appropriate an Scal is. Does it look reasonable, should it be used?
  • The TP data and Tsys spectra are examined for RFI and other statistical flagging.
    • This is a potential bottleneck.
    • Parallelization here is by blocks of time since this needs to compare values between feeds and polarizations.
    • output is same as identified in previous step. This output is not optional since user interaction at the next few steps will likely refine the calibration applied to this data for use in those steps (i.e. an iterative process starts here).
  • OFF scans are identified where appropriate (depends on observing mode)
    • designated off pointings
    • synthesized off scan from the data
      • there should be a default synthesis using captured user intent
      • interactive identification of appropriate off regions will be necessary to refine this step
    • OFFs may be processed before being used (e.g. smoothing, fitting a polynomial and using that for the off, etc)
    • OFFs are associated with the calibration data base
  • OFFs are used (various ways of using the offs need to be explored - nearest, interpolated, other?) along with other calibration information (efficiencies and opacities) to produce a time-series roughly calibrated data set that is ready for griding onto an image
    • This step also covers the in-band frequency-switched case where the names "off" and "on" are arbitrary but this step is functionally the same as using a sky OFF is in position switched data.
    • output format should be matched to the default gridder
      • Initially ACSIS format so that their gridder and related tools can be used
    • Translators to other formats for other gridders will likely be expected
      • AIPS, CASA, SDFITS, other
  • More data editing may happen here
    • editing (flagging) of the calibrated data or even the TP data using the calibrated data (statistical and visual and command line for experts)
    • Using data crossing points to refine calibration (related to basketweaving although the exact pattern will likely be different)
      • this is likely to be computationally intensive and it's unclear how best to parallelize this.
  • Grid the data
    • ACSIS gridder will be used here
      • Has the appropriate griding function
      • Has a well-tested user interface
      • They have offered some support in understanding their data format and the conversion step and dealing with the learning curve
    • Other gridders could be substituted here by providing a data translation tool or modifying what the output format is of the previous step
  • Iterate until happy with data editing, flagging, appropriate offs and calibration
  • Take the resulting WCS FITS cube to whatever analysis tool you want.

Broad summary

  • Sequence of fast samples of the region of interest leads to
  • Sequence of WCS FITS cubes with associated weight cubes (JCMT calls these variances, I'm pretty sure)
  • "coadd" these or mosaic them if they do not cover the same area
    • This might also be done by simply multiplying the ongoing image by the weights and continuing to grid to that same image and weights
    • The images from the individual fast samples should be kept because they will be useful for identifying problem data
  • Take final image to to favorite image analysis tool(s)

Issues needing more investigation, research

  • Capturing user intent
    • type of scan (map, calibration for scal, off, etc)
    • tie together related scans - possibly even between scheduling blocks and sessions
    • type of switching (frequency switched)
    • desired parameters of image
    • desired parameters of final image if this is one segment of a larger image
    • default off locations for synthesizing the first pass of the off
    • Perhaps a few other tunable parts of the default pipeline (e.g. smoothing or polynomial fitting to determine offs)
  • Calibration database
    • Needs to be designed
    • includes Scal (Tcal determined from astronomical sources)
    • includes lab Tcal values for case where Scal is not available
    • Efficiencies
    • opacities
      • default
      • determined from weather predictions
      • determined from tipping scans
  • Basketweaving calibration
  • Interacting with large data, especially for real-time displays
    • displays
    • zooming
    • editing
  • It would be nice to be able to look at an image and use it to select data that contributed to a specific region of that image (also in frequency space) to locate and edit/flag that data so that an improved image can be generated

Other thoughts

  • GAIA is a promising visualization tool
  • Even if parallel processing isn't necessary to keep up with the data rate for the default processing, the iteration steps need to be as fast as possible and so those will benefit from any available parallel processing.
  • Parallel processing should be designed in from the start, although the initial pipeline will likely not need it to keep up with the data rates.
  • This pipeline is generally useful to all data coming from the GBT. Individual components should be developed with that in mind if possible.
  • "data capture" should be extended beyond it's current definition (raw data with no additional processing) to the step just upstream of the use of "offs". This is where the user-driven iterative loop is likely to start.
  • Every step should have a reasonable default so that it will be possible to directly to the gridded quick-look images without user interaction.
  • The pipeline will be written in Python at least up to the gridding step.
  • We could use current K-band single-pixel data taken with the spectrometer to help with simulation during the development of the pipeline.

-- BobGarwood - 29 Nov 2007
Topic attachments
I Attachment Action Size Date Who Comment
updated_kpfa.pdfpdf updated_kpfa.pdf manage 4 MB 2007-11-29 - 13:52 PaulMarganian Picture of Whiteboard that this discussion refs
Topic revision: r8 - 2016-06-08, PatrickMurphy
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback