Science Pipeline Workshop Notes (November 26-28, 2007)

Science Pipeline Workshop Notes (November 26-28, 2007)

Monday, November 26, 2007

Welcome/Introduction

Overview of the KFPA project

Program Personnel
Goals:
- 7 Pixel array construction complete in 2 yrs.
- Production use in 3 yrs.
Deliverables
- Frontend - cryogenic package, modular downconverter, modular noise calibration, mechanical packaging.
- software - package for engineering and M&C; package for data analysis.
- cost per channel - reduce cost, know what it will cost to expand array.
Milestones
baseline instrument specs
- freq - 18 - 26.5 GHz
- Trx - < 25K (75% of band), < 36K (entire band)
- 7 beams
- dual, circular polarization
Focal Plane coverage
- beam spacing - 3 HPBWs
- beam efficiency of outermost elements issues.
- compromises
key subsystems
Bandwidth Limitations
Future FPA Developments
- Significantly enhanced spectrometer with this new array and existing if system.
- New digital I.F. distribution system.
- Expanded FPA.
- Phased, expandable software development to match hardware capabilities.
IF System
- Current analog IF system limits expansion.
- Digitization at antenna with filter/compression schemes probable solution of IF transmission.
Backend strawman
- 3 GHz analog bandwidth
Software
- similar systems do exit at other telescopes.
- challenge in software is to transfer
Open Issues
Comments from UMass
- hexagonal array - "pluming issues" with bringing the IF in and out - modular and serviceable.

The GBT at K Band

Current Rx
Backends
- DCR, Spectral Processor, Spectrometer (mostly used)
Observing Techinques
- Doppler tracking, switching modes, observing types
Antenna
- nearly perfect at K Band; aperture eff 65 - 58%, beam eff 89%79%
Atmospheric Limitations
- Opacity, winds, day/night time (not an issue k-band)
Precipitable Water/Cloud stats
- under 5 mm of year 25% of the time
Atmospheric Conditions - Opacity
- 15 - 30 K system contributions due to weather during the winter.
Forecasting
- 4mm accuracy in 48 hrs prediction
- 5K accuracy in 48 hr Tsys predictions
Relative Effective Tsys
- Tsys normalized using the best Tsys at that frequency
- Used to determine what percentage of the time conditions are favorable for observing at a given frequency.
Winds cause a 5% loss of efficiency
RFI
- 22.21 - 22.5 GHz: shared protected band
- 17.7 - 20.2 Ghz

Science with the KFPA

Characteristics of the GBT
High frequency over subscription factor is on the rise.
5 large GBT projects (> 200 hrs)
1.5% sampling with a 7 pixel array.
K Band had the highest demand for mapping in 2007.

Chemistry with the KFPA

Sources of large interstellar molecules
- SgrB2 and TMC-1 are the primary sources for the Remijan et al. chemistry survey.
Chemistry of molecules discovered with the GBT
- New molecules have opened new questions regarding their formation.
- The GBT is "sensitive" to detecting "large" (>5 atoms) molecules.
- It is important to understand the distribution and excitation of new molecules.
VLA/BIMA mapping projects.
Interstellar molecule "myths".
Why a GBT spectral line mapping array?
- Weak Line Intensities
- Low State Temperatures
- Low-Energy Level Transitions
- Information about Cloud Density, Kinematics, and Structure
- No Interferometric Arrays

Continuum polarimetry

Scientific Motivation
- Galactic Astrophysics - Spinning dust emissions in the Galactic ISM.
- CMB
- Extragalactic Sources
  - Not much known about extragalactic sources at 20 GHz.
Mapping Speed
- for 50 MHz bandwidth - ~1 hour for one-sigma to 1 mJy in 1 sq degree
- for 1800 Mhz bandwidth ~ 10 minutes

High-redshift studies with the KFPA

Molecular clouds and star formation. I.

Molecular clouds and star formation. II.

Observing modes and array configuration

On-the-fly OTF observing
Why OTF?
- efficiency in signal-tonoise
- efficiency in telescope motion
- better data consistency
OTF Sampling Issues
- convolution functions; beam broadening/noise aliasing; sample at lambda / 2D; sampling interval only needs to be lambda/(d+D)
- How much rope do you give to the observing?
OTF differential Doppler correction
- spectra must be fully sampled in frequency so that the frequency axis can be resampled.
Observing Modes
- Point-and-Shoot mapping
  - Frequency switching
- Multi-beam nodding
- OTF
OTF Scanning Patterns
- Raster
- Spiral
- Hypocycloid
Array Feed configuration Issues
- Array Rotation Capabilities?
  - Only necesary if point-and-shoot mapping long integration science a requirement.
  - Array rotator on NRAO 8-Beam receiver considered a design flaw (mechanically unreliable).
  - APEX
    - No rotator on bolometer arrays.
    - 490 GHz spectral line array does have rotator.
- Why have a feed rotator?
  - Point-and-Shoot while rotating to track along paralactic angle.
  - Attempting to fine the edge of a region.
- UMass prefers OTF over Point-and-shoot.
- Proposal: Leave the rotator out for the prototype and consider it for the 61 element array.
Antenna Control Issues
- Scanning efficiency
  - Minimize turnaround time; Multiple scanning patterns avaiable.
Monitor and Control Issues
- Data dump rate
  - Selectable to balance data volume with sampling necessities.
  - Selectable scanning patterns in variety of coordinate frames.
  - Tagging total power samples with "observing intent" information ( ON, OFF, etc.).
    - JCMT: Tagging is a major issue because tagging is done in the corrilator. Its very hard to recover bad tags. Problems with tagging with a rotator.
  - Output to FITS format with (u,v) binary extension (formerly reffered to as "uvdata" format) for portability to any imaging analysis package.

Feed rotator

Con	Pro
Expensive	Full control of array on sky. - Don't have to make a complete map.
Complex
Not required for some/most projects
Can be added later
Weight

Discussion on configuration

Discussion on calibration

Calibration Issues
- Tcal - Noise diode?
  - Number of diodes? Arrangement?
    - 1 per feed, shared by both polarizations?
    - 1 for the array, shared by all feeds and polarizations?
  - Intensity?
    - 1 dB dynamic range for linearity : Tcal / Tsys < 0.25; <= 10K
  - Feed, polarization, and frequency dependent
    - Varies by a few %, few MHz
    - Vector Tcal (freq, feed, pol)? -- wide bandwidth, multi-lines, continuum sources
  - Stabile - time scale?
  - Usage
    - Blinking? Duty cycle?
  - How to determine? How often to measure? Stability?
    - Lab Hot-Cold loads
    - On telescopes Hot-Cold loads
    - Astronomical - standard astronomical calibrators
Opacity
- Frequency and time dependent
- Tippings
  - Requires Nufss, Tatm, Tcal
  - Takes telescope time or ancillary radiometer
  - Measures...
  - Accuracy? 2-4 %
- Forecasting
  - Accuracy? 2-4%
Efficiency
- Elevation, frequency, feed dependent
- Stable - Once modeled
- Depends upon source size
  - Nua - Point sources
  - Nubeam - Extended to first null
  - Nufss - Very extended
  - Deconvolution problem?
- How to measure with sufficient accuracy?
  - Beam shape is important.
  - A specification on the dynamic range is needed to determine beam shape.
Frequency Calibration
- Doppler track a single window
- Not needed with current Spectrometer - single window
How to apply vector tu, Nu, Tcal?
- Use GBTIDL model (in development)?
  - Doppler track in software after the fact.
  - Current LO knows nothing about the antenna position.

Discussion on polarimetry

Calibration signal to go to both polarization to determine phase changes.
Calculate a Mu-Matrix for each element.
Off axis beams will have worse asymmetry than the center beams.
Cross polarization terms could get as high as 10%.
Proposal: Drop one of the polarizations to improve Tsys?
- If the continuum performance would improve significantly, yes.
- Will removing the OMT help Tsys at K-band?
- VLBI experiments need dual polarization.
- Lost polarimetry in Ka-Band.
- The only science case for dual polarization is polarimetry.
- Better frequency coverage without the OMT.
  - There is a bandwidth limitation that is not due to the OMT.
- Polarization is needed to identify RFI.
  - Correlated signals across feeds could be used instead.
Will the prototype be thrown away?
- Not a prototype, but phase A.
If dual polarization is dropped, could you have 16 elements?
- No, more like 8 - 9 because current backend limitations.
- More integrations per pixel.
- 2 more samplers.
Polarization vs Number of feeds.

The KFPA and future GBT instrumentation

Tuesday, November 27th

Recap of previous day's science discussions

Frequency coverage should be pushed to its highest possible limit. 27.5 - 28 GHz
Dropping the feed rotator will compromise science (projects that require deep integrations where the array foot print has rotated on the sky). However, this might still be the way to go to avoid rotator complications for the prototype.
By dropping the OMT it might be possible to lose ~15K in Tsys.
Calibration and observing modes are well understood.

State of GBT data reduction software

M&C produces "raw" FITS files
- Set of files for each scan
- "Static" files, fixed at start of scan
- "Dynamic" files, grow as scans progress.
Backends
- Spectral line: Spectrometer, Spectral Processor, Zpectrometer
- Continuum: DCR, Mustang
- Pulsar backends
Real-time monitoring
- GFM - The GBT FITS Monitor
Data capture
- sdfits
Data Reduction and Analysis
- GBTIDL
  - IDL scripts
  - Modeled on package of Tom Bania
  - Heavily influenced by UniPOPS and sips++ dish plotter
  - Primarily spectral line including raw zpectrometer data.
- Aips++
  - Dish: Precursor to GBTIDL, has graphical flagging, statistical flagging, can us aips++ imaging tool
  - GBT continuum tools: Can be used with DCR data, Calibration and imaging
- Aips: Imaging spectral line GBT data
- OBIT - continuum imaging
- Pulsar - Scott Ransom's package. Others.

Pipeline talks

Crossley | Pipeline basics and VLA pipeline

Besides the data, you should think about outputting other things to help figure out what is going on in a pipeline. E.g. log files, etc.
standardized pipeline for the observatory (Tilanus). Tunability is great, but you need standardization for things like quick looks and for verifying data quality.
There are Python interfaces to AIPS. (Cotton)
It is important to gather user intent for a pipeline. So, you need to figure out what information is required of users.
What is a Pipeline?
- links other programs that reduce data interactively
- it's implicit that there's something that came before the pipeline that was interactive.
Why make a Pipeline?
- saves time!
- it's consistent.
- ease of use for non-experts; it's like a digital cookbook
how to build a Pipeline?
- start simple
- start interactive
- find good default values
Outputs
- consider, in addition to main data product, producing logs, etc.
- users can then build on what pipeline did
how much control should be given to users?
- depends on user.
- But a compromise would be to give lots of control, but with good initial defaults.
Validating Outputs
- necessarily interactive
- provide easyily viewable output with diagnostics
- provide flagging capabilities
VLA Pipeline
- distributed w/ Aips
- used w/ survey

Comment: a telescope should provide a standard output, and this may conflict with a configurable pipeline. These standard products are also important for Quick Looks.

comment: this is a reduction script! not a pipleine! comment: (Cotton) POPS is horrible. debate: pipeline tied to a control system as black box VS. data analysis tool

comment: (Garwood) KFPA is the threshold where you need a pipeline to get Quick Looks; up to now, our instruments have produced data that can be reduced quickly enough and simply enough to get a quick look without a sophisticated pipeline.

Q: (Marganian) are products of VLA pipleine public? Yes, and there are plans to get it to the VO.

Cotton | OBIT

Built primarily for interferometry. Single dish OTF imaging package is experimental. If you want it to work, you have to ask Bill.
Support for DCR, CCB, and Mustang
Replaces pops with a Python interface. pops is very user unfriendly.
Uses own version of SDFITS - OTF FITS data format
Reads GBT raw engineering FITS files.
This is a toolkit, that could be made into a pipeline
talks about reduction methods:
- need to slew telescope fast enough so that sky is modulating faster then background
  - need good servos
  - fast sampling
- need to make redundant measurements of same sky along different trajectories, in order to separate sky from background
toolkit is C libraries bound to python
can invoke Aips tasks, but replaces POPS w/ Python
Obit creates its own version of a single SDFITS file w/ tables for original data, results, etc:
- target
- feed geometry
- flagging
- index (for fast searches)
- data
- total calib
- incremented calib
Obit reads raw GBT FITS files
Single Dish component independent of Aips
C was choosen for libraries because of disklike of C++

Gopal/Heyer | The FCRAO OTFTool

32 pixel array, (16 polarizations) 2 dewars
On 14m for serveral years, now headed for LMT
Backends:
- array of 64 spectrometers, BW: 50 - 25 MHz
- array of 16 wide band filter banks
OTF Mapping was developed under the gun, while still doing "point & shoot" observations; "best thing they ever did".
OTF Raw Data
- all data is between two 'bookend' refs
- files have header and data sections
  - header: cal, position, etc.
  - data: 1024 * 32 * (ndump*2)
- typical maps have 10's of these files (100's of Mb's)
- large maps are 10 - 100 Gb's
14 m configuration
- Linux Server sends header info to SQL DB, and data to Raid 5 data server (500 Gb)
- every 24 hours, the Raid is cleaned up and backups made to LTO Tape Library (7 Tb)
MySQL db critical to OTF data collection. Allows users to query the acquired data via the OTFTOOL (GUI). OTFTOOL has basic data evaluation tools and produces FITS data cubes or GILDAS (CLASS) files.
8.7 Tbytes in 5 seasons of operation, fluctuates between ~1 GB to ~28 GB per day
OTFTOOL (GUI)
- accesses DB to find data
- includes basic evaluation tools
- coadds and regrids into 1) FITS data cubes 2) Gildas files
OTFMAP (command line)
- also coadds and regrids
- embedded into scripts
Languages/Libraries:
- Current: C, GTK+ 1.2, PGPLOT, cfitsio, Gildas libraries, Perl & python (cleanup job)
- Future: Python, GTK+ 2, matplotlib, and PyFits
See slides for graphic of how regridded data is produced from the DB, Raid 5, Tape library. Regrided data is saved for reuse, and users can use a number of packages to work on this regridded ata.
Users use IDL (fcraoidl), AIPS, IRAF, CLASS, etc. for data reduction.
OTF Mapping is essential to acquisition of the highest quality data with focal plane arrays
- Dewar rotation system less critical
Processing to final data cubes must be straightforward for users
- Relational database facilitates management of large quantities of data
Constructing the data cubes and images is the first (and easiest) step to science goals
Both observers on a project and anyone else can access the project data currently. Could, but do not have passwords for data during proprietary period.

Willis | Design and Implementation of the ACSIS Real-Time Data Reduction Pipeline

(Tilanus): what is being called a 'pipeline' we consider just a part of the system: a black-box the user doesn't see

Massive and complex data volume mandates pipeline for real time display and near-real time imaging products for visual inspection and quality control. Otherwise you are observing blind.
Initial concept - Beowulf cluster, produce gridded cubes with calibrated spectra which also feeds the real time displays
- final system had 10 dual processor 2GHz PCs
- parallel system needed because of
  - data rate vs available CPU speed
  - 32-bit memory limitations (image cube has to be spread over multiple machines)
ACSIS Implementation
- use c++ class library provided by aips++
- client c++ programs are connected by glish scripting and messaging system
- communication takes place by means of events. Clients can send events directly to each other by 'links'; data doesn't go through glish interpreter, but through 'links', so can be huge.
- See Aips++ Note 230 for gorry details
System can be configured to run on anything from a laptop to (at least) a 32 processor PC cluster
- Upper limit is unknown
glish scripts read XML-based configuration files to load and link tasks together across the network
Can be used with any feed configuration from 1 to n where n has a maximum of ??
Reducer Processes convert lags to spectra, and write raw input data to disk in the form of aips++ Measurement Sets
Gridders write out (sub) data cubes as FITS files
See slide for overview of how Acsis, Ancillary data, Sync Tasks, Reducers, Gridders, Reduction Controls, and glish interact.
- 16 Acsis correlators and ancillary data plug into sync tasks (using Drama?), one per machine
- these sync tasks make glish links to the Reducers (programable) that produce Aips++ measurement sets (one per machine, again)
- the Reducers make glish links to Gridders, that produce FITS files.
- Drama messages (XML?) interact with the Reduction Control, which, through glish, tell the Reducers how to reduce.
Glish scripts are 'fixed' but configurable w/ (large) XML files
Reducers are very complex
- consist of nodes
- they talk to eachother using Aips++ query language
- they are spectra calculators (like GBTIDL?): take in XML commands and make RPN operations on 8K channel vectors.
- they're flexible:
  - can write new recipes
  - can create new nodes
Data displays
- Image - continuously updates during observation and displays the spectrum through the entire image cube at current position of the mouse (kview)
- Spectrum - shows all current spectra, can look at just one as well
- Metadata - a tab for each piece of information, each tab has its own graph

Q: why Glish instead of NPI(?) ? Q: isnt' Glish dead? Willis claims to have email from Joe M. claiming that glish will be supported as long as someone is using it. He is informed that Joe is now hiding out in South America. He admits that given the choice today, they'd probably be using a python-based system.

Zwart/Tilanus | HARP Data Pipeline and LOFAR Real-Time Pipeline

Simulator is an important debugging tool. Creates configurable realistic spectra.
metadata is important
flexibility in the right places
data that is compressed or lost cannot be restored

Taylor/Dever | GALFACTS Pipeline

Processing software has to be fast - large data volume means lots of processing, re-processing
Parallelize as much as possible - multiple processes spawned on one or more machines by simple scripts.
Aimed for automated, "push button" behavior for generation of data cubes. Configuration file though all stages of processing for creation of all data products.
Data products: output, intermediate, quality assessment
- expressed as FITS files (CIMAFITS2 - Arecibo-specific variation of SDFITS), images, plots, and tables
- Data dumped over Gigabit Ethernet from the spectrometer to a data server
  - Spectrometer info: http://www.mock.com/pdev/
  - Spectrometer uses FPGAs (Xilinx)
Written in C/C++. Expected to port some of the routines into CASA as "tools" and "tasks"
Optimized for x86_64 and uses the AMD Core Math Library
They post their quick-look output to the web. Note: the quick look isn't intended as a real-time monitoring tool; there is other software for that. They also auto-generate RFI plots.
RFI: No software mitigation strategy. Detections are flagged and not used, no interpolation in time or frequency.
Imaging solely done in OTF mode. Uses basketweaving.
Data rate - 875 MB/sec with millisecond dumps

Pipeline Demos | Same order as talks - not all pipelines will have a demo. Keep demos short. Will include a demo of Livedata if time allows.

VLA Pipeline

Obit

Joint Astronomy Centre

Pipeline is divorced from data. Their pipeline is used for non-JCMT instruments. They call is a pipeline backbone.
Tour of pipeline recipes:
- Recepies exist even for non-JCMT instruments.
- recipies for a given instrument are very high level commands
- one can dig down further to view the primitives, and commands get more specific: knowledge of actual instrumentation is found here. All in Perl, calls made to Starlink libraries.
Comment: you shouldn't have to build your own Pipeline backbone.
Have tasks running all the time that are sent data for processing. Speeds up the pipeline by eliminating the need to start and stop transient tasks.
Input from the user is critical for the pipeline in order to determine what to do with the signal or even where to look for a signal.
You need as a goal to create a pipeline with the minimum amount of user intervention necessary. Ideally, it is a fully automated process. The more parameters you have to twiddle, the harder it is to capture the user's intent while doing the data reduction. If you can't capture intent, it is useless for an archive. (Jeness)
Get the experts involved early. This helps in easing the acceptance pains of using a new pipeline.
The pipeline is also available offline to users for data reduction.
The pipeline is written in Perl.
One pipeline exists for many instruments
- All instrument data products are translated to a common format
Pipeline has no knobs available for Quick Look system. For data reduction, primitives and Perl scripts can be changed by user.

Discussion

What should the data products be?
Long discussion ensues about Basket-Weaving.
Bob tries to get answers from group:
- how fast does the pipeline have to be?
- how big are the maps?
Discussion on polarization
Bob asks if there's any pipeline that's been discussed we can use
- Remo suggests using theirs (can they read SDFITS files? bob says it's not an issue)
- do we want to build off GBTIDL? We lack imaging and handling of multiple pixels.
- do we build off of CASA?

Dinner

Mashed Potatoes, Green Beans, Rolls, and Meat.

Wednesday, November 28th

Garwood | Review of yesterday's progress and unresolved issues

Technical Hurdles

Create a simulator for the 61 element array early to identify bottlenecks and resolve.
Ask Scott Ransom how he does distributed computing on his cluster.
ICE
Need to incorporate data quality checks in the pipeline. Feedback commissioning and development experience into these checks.
Need to get observers to realize that a data reduction plan needs to be included in a proposal, especially if they need special resources (e.g. a cluster running custom software) in order to do it.
Need to carefully define the data products sent to the VO and taken home because we probably do not want to require observers to use a cluster in order to reduce their data. We also want to make it easy to transport the data, e.g. as small a data set as possible.
Tilanus strongly recommends that we have a single data reduction environment for KFPA observers. A slightly different alternative is to strongly support a single environment, but provide paths for observers to at least get their data into other packages.

Non-technical Hurdles

The User Experience

Data Formats (throughout pipeline and final)

Archiving

Data Visualization

Tilanus | Demo of GAIA. Other unique visualization tools could be demoed here if there is interest.

http://star-www.dur.ac.uk/~pdraper/gaia/gaia3d/index.html

Where do we go from here?

Lessons Learned and Advice from Tim Jeness

Disks are the limiting factor at the JCMT. Their i/o is not multithreaded.
Again, capturing user intent is key!
Pipelines get more complicated with capability, e.g. more observing modes.
It is good to have persistence in the pipeline to increase robustness and reliability.
Moving from files on disk to a database. This is possible because they abstracted the input and can easily change the input source without rewriting the code.
Quick start: sdfits -> acsis command (requires calibrated spectra). Then use ACSIS pipeline (plus add some of our instrument specific stuff) to produce data cube and then can view in Gaia.
- Could start work right now on existing single pixel receiver spectrometer data as proof of concept and first steps toward a KFPA pipeline.
ALMA single dish pipeline - Chris Wilson
Have invitation to go to the JCMT and develop sdfits -> acsis

This topic: KPAF > SciencePipelineWorkshopNotes
Topic revision: 2007-11-28, AmyShelton

Copyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback