KFPA Data Analysis Meeting - June 23-25, 2008
Purpose and Goals
This is the first face-to-face meeting concerning the KFPA data pipeline and data analysis needs since the KFPA workshop in November of 2007. After that workshop, a small group of attendees sketched out a possible data pipeline design. That design was turned into a wiki document for presentation at the CDR in February, 2008. This meeting will focus on the following issues associated with the KFPA data pipeline (these are explained in more detail later in this document).
- Spectral line observing modes
- Pipeline and other data processing requirements
- Pipeline design
- Pipeline infrastructure
- "New" calibration strategies
- Parallel processing
- Current hardware needs
This a working meeting and as such there will be little formal agenda. The meeting will begin with a presentation by D.J. Pisano of his draft spectral line observing modes and calibration strategies. It is expected that much of the first day will be taken up with discussions related to observing modes and calibration as well as expectations for the output(s) of the KFPA data pipeline (requirements). Once the requirements and expectations have been covered, the meeting's focus will shift to begin modifying the pipeline design if necessary and to start to fill out as much of the basic elements of the design as we can cover in a few days. The primary goal is to ensure that the pipeline requirements and expectations are well understood by all and that the basic elements of the pipeline design are sufficiently defined so that work can begin on the individual parts of that design.
- 9:30 AM - 5 PM each day in the Green Bank Auditorium
- There are 2 talks scheduled for the Auditorium that we will need to break for. We will break about 30 minutes before the start of each talk to ensure that the room can be prepared for each talk
- Monday, 11 AM: Colloquium by Gerrit Verschuur, "On the Morphological Similarities between Galactic Neutral Hydrogen Structure and WMAP Data and What They Reveal About Interstellar Physics"
- Tuesday, 3:30 PM: Technical Seminar by Dr. Danielle Kettle from U of Manchester, "RF and microwave research in Manchester (UK) including SKA and a Ka band multi-beam array"
- Lunch breaks at the usual time
- Snacks and drinks will be provide for break in mid-morning and mid-afternoon each day.
Note that I'm still fleshing this section out
Observing Modes and Calibration Strategies
- Works off-site
- Will likely be difficult to support off-site for the general case due to lack of staff to support this.
- Should not be excluded by the design.
- Needs to work away from Green Bank because of the location of the developers (CV, Calgary).
- e2e/archive would like this to be available off site for their use
- Pipeline needs to keep up with the data rate.
- Data can be easily re-processed with different parameter settings.
- Individual components work outside of the pipeline (possibly with more options exposed)
- Pipeline scripts are easy to configure and use (astrid model) - don't expose much of the scripting language itself.
- "VNC" compatible
- Pipeline should have a GUI to configure the options and turn on/off various components (e.g. livedata)
- Quick-look requirements
- Take shortcuts for speed
- no vanVleck correction
- nearest cell when gridding
- default calibration
- Visualization - this follows the Arecibo/alpha model
- raw bandpasses
- calibrated bandpasses
- raw waterfall raster images
- calibrated waterfall raster images
- Options to freeze the display in time, zoom, freeze y, # of time steps to keep in the display, etc
- Other suggestions included Tsys vs time
- What does the pipeline produce?
- Whatever the user wants it to produce. i.e. the user can stop the pipeline at any step and take away a product they can use elsewhere
- default: raw M&C fits files, calibration information, flags file, calibrated data just prior to imaging, output images.
- the KFPA pipeline does not include data analysis tools - e.g. tools to align (shift) and average multiple spectra or to align and coadd or mosaic output images.
- The output images will include weight images and both will be written as FITS image cubes
- The internal data format has yet to be determined at each step but it should be possible to produce an sdfits table at each step if necessary. Is there some other format that would be more suited to publishing to the VO? What other format conversions are likely to be requested? Which of those should we support?
- What is archived?
- Raw M&C fits files.
- Output FITS image cubes. Calibrated data (just prior to gridding step)?
- Need to clarify from e2e what they would like in an ideal world.
- We know what the data rate is going to be. Can we estimate the data volume over a typical observing season? What fraction of the observing time is this instrument likely to get? And how much of that time is going to be spent taking data at the full rate? Does the scientifically necessary data rate depend on the observing mode? Do you need to dump at the fastest rate for an Scal/Tcal scan as you do a raster imaging scan? What about MX mode or other single pointing modes? What about jiggle mode?
- What needs to be tweaked?
- Any major components missing?
- Is there a use case that breaks the design?
- python. Now what.
- Are these just scripts or is the pipeline smarter than that?
- Simple script to invoke each component in turn?
- How to allow tweaking of control parameters?
- Complications necessary to support parallelization - or is that left up to each component?
- Very basic issues:
- code management system: CVS because sparrow is in CVS?
- code tree - separate from sparrow because of the need to redistribute?
- What if we need/want to reuse parts of sparrow (e.g. sdfits)?
- Is sdfits part of the pipeline or the tool that creates the data that then is used by the pipeline (i.e. just upstream of the pipeline)? Whatever the answer, do we need to be able to distribute sdfits to satisfy the pipeline requirements?
sdfits produces single FITS files. Probably not best for parallel processing. Certainly output files would be too large if used as is on the expected output of the KFPA over the course of an observing session. Are there better alternatives for the middle of the pipeline? The output product is probably still sdfits (if the ungridded, calibrated data is desired) or a FITS image cube but that shouldn't necessarily limit us to that format for ongoing processing in the pipeline.
"New" calibration strategies.
- Probable bottlenecks.
- When is the processing of each sampler (feed + polarization) not independent from all of the other samplers?
- In those cases, how can the problem be divided so that the parts can be processed independently
- Focus on simple, easy parallelization cases first - independent data paths that can be handled at a high (e.g. application or pipeline component) level. Threads and concurrent python. Focus on more complicated cases (e.g. deep in some application where perhaps a different programming language may be necessary/desirable) as needed.
The following people participated in at least some of the discussions (please add your name here if you've been inadvertently left off this list):
- Joe Brandt
- Patrick Brandt
- Bob Garwood
- D.J. Pisano
- Ron Maddalena
- Paul Marganian
- Toney Minter
- Karen O'Neil
- Roberto Ricci
- Amy Shelton