CASA Parallelization Meeting Minutes

Thursday October 16th, DSOC 280, 8:00AM MT

  • Polycom Video: 192.33.117.12##8110
  • Voice: 434-817-6524

Attendees:

  • Socorro: James, Rob, Kumar, (Urvashi), (Sanjay)
  • Charlottesville: N/A
  • Garching: Justo, Sandra

  • Apologies: N/A

Discussion

  • Review of Action Items from last meeting (see table below).
    • JG: Testing of OpenMPI infrastructure on AOC cluster, including integration with Torque. Revisions to MPI framework documentation.
      • Built a package with CASA and required 3rd party packages (MPI, mpi4py, etc). Used imaging script to move system libs for portable package.
      • Ran several tests using multiple nodes and cores. Integration between OpenMPI and Torque was transparent, use hostfile format. No modification of hostfile required, just have to point OpenMPI to it.
      • Ran ALMAM100 regression. Identified a few issues with 3rd party packages (almawvr).
      • Some MSTransform issues identified and fixed. Then had clean run on the almam100 regression.
      • Satisfied with OpenMPI tests.
      • Memory issues identified with large data dictionaries. Allocating defined memory buffers to work around the issue. Currently fixed at 100MB, but could be adjusted or made configurable.
        • ACTION: Justo to describe the issue over e-mail. Discussion to continue after meeting.
      • Ready to continue to integration, updating repositories, updating c-make.
    • ACTION: Rob to work with Darrell on B&T integration for test/stable package production.
    • SC: MSTransform and MMS structure documentation in advance of PWG meeting.
      • trying cycle II data reduction script, but having applycal issues. May be similar to Sanjay's issues reported below. Requires further investigation.
        • using simple cluster for cycle ii data.
        • KG: If applycal is using syscal table, we get a lock since they all point back to it.
          • Solution is not clear. Could make copies of the sub-tables, but then end up with too many open files. Pointing table, syscal are single copies. Only need read only.
          • Opening all sub-tables as read only may be a fix.
          • ACTION: Kumar to talk to Jim about table system implementation.

  • Imager & Testing Updates
    • Trying to do full parallel processing of a full data set. Summary: not really ready for prime time, problems at almost any stage.
      • Locking vs too many files issues. Address this first.
      • Need a robust, deep solution to some of these issues before proceeding.
      • Get the testing group to work on mstransform.
        • UR: May not be ready to let testers loose.
      • Address issues methodically, ensure fixed robustly. Lets not go scatter shot.
    • Testing for wide-band mosaic imaging using EVLA data at L-Band. The image being made is a 102-pointing mosaic image using WB A-Projection+MT-MFS. This is to account for instrumental time-, freq- and polarization-dependence (using WB A-Projection) as well as freq. dependence of the sky (using MT-MFS).
  • Parallel Imaging Tests:
    • So far touches imager framework, parallelization framework, calibration tool-chain, monolith-MS and MMS, plotms, mstransform
    • Normalization issues tested.
      • Finally works. These were numerical issues.
      • Tested to give the same result in parallel and serial execution (for natural weighting).
    • Imaging
      • Full deconvolution was done using read-only monolith MS.
        • Works and gives the expected performance.
          • Data splitting is done via selection along the time axis.
      • Had to debug several things, now don't quite remember all.
        • too many files open in gather operation
          • solved by careful splitting into pieces and no. of processes per node
          • NOT FULLY SOLVED
        • issues with scatter operation, interactive masking, issues with debugging code that produces conflicts in a parallel execution, data splitting without gaps,...
    • SelfCal
      • Requires saving the MODEL_DATA. Virtual MODEL_DATA is not useful since computing is more expensive than disk i/o.
      • Multiple processes cannot write to the same MS. So the MS was split into sub-MSes.
        • Serval issues with mstransform:
          • In general mstransform is not yet robust for everything it can be used for. Needs much more scientific testing (i.e. the CASA Testing Group needs to step-in).
      • Selection on MMS
        • When selecting the sub-MSes, the VisIter wants to write the SORTED_TABLE to the disk. Found via gdb. This has lock contentions.
          • Fixed it by forcing it to make in-memory SORTED_TABLE only.
    • Imager issues with MMS:
      • Race condition between acquire and release locks.
        • Processes did not release their sub-MSes [Fixed]
      • Something else, do with scatter operation I don't recall precisely. [Fixed]
      • Re-start of imager from existing products [Fixed]
      • IPEngine starting/closing. Now follows the mstransform model.
      • Load and/or computes avgPB and/or CFs more often than needs to.
        • Requires work (not yet fixed)
    • Now using parallel framework, prediction takes 7 hrs. Serial run took 10 days.
    • Data looks OK -- solutions did not look OK. And finally realized that antsolver looks at DATA not CORRECTED_DATA column. The solutions, with pre-apply of the first calibration table, now look OK.
      • Need to figure out how to generate solutions automatically only for fields that have good SNR.
    • splitting CORRECTED_DATA and MODEL_DATA columns using mstransform on an MMS
      • Did not work. Reported the problem.
    • applycal()
      • applycal on sub-MSes throws various errors -- not sure if it works.
        • RuntimeError: Error in Calibrater::selectvis():MSSelectionNullSelection : The selected table has zero rows.
        • In a parallel run, this can happen and is a valid state. Reported the problem.
    • Parallel execution of tclean on MMS (which was required for selfcal) throws errors coming from parallel_go

  • Too many files vs locking
    • option: bind files (to reduce qty)
      • files are typically small. In memory storage manager?
      • ACTION: Rob - Make a jira ticket, track it, make progress (put on 4.4 target list). Ask Jim to try binding.
    • subtables locking - they are never edited while applying cal. Should be a mode to open main table in write mode and sub-tables in read mode so they can be shared. Ger is keeper of the tables framework. Jim has the highest understanding in our group.

  • Preparations for Pipeline Working Group (Parallel Pipe Discussions)
    • (Skip since Lindsey is not available)

  • Anything else you would like to discuss (AOB)
    • UR: How to run multi-day runs of casapy on a cluster
      • We can't use casa-test or casa-stable as they may change unpredictably.
      • run in subshell: # (exec /home/casa/packages/RHEL5/test/casa-test-4.4.1/casa)
        • May not be as safe as prepend to path.
      • Prepend to PATH: # PATH=/home/casa/packages/RHEL5/test/casa-test-4.4.1:$PATH
      • Will work with ipengines framework. Takes environmental variables with it.
      • May not work with openmpi as implemented, but it can be addressed.
        • JG: currently respecting .bashrc environmental variables. ACTION: Will need to revisit - inherit from shell instead of bashrc.
      • ACTION: Rob to follow up with K scott on EL5 vs EL6 paths to tars.
        • (submitted helpdesk ticket)

  • Next Meeting
    • October 30th (two weeks)

Deferred Items for a future meeting

10/16/14:
    • Multiple xterms
    • Schedule for first OpenMPI enabled test/stable package.

08/14/14:
  • Logger
    • Single log file & concerns regarding ordering of entries.
    • Pipeline think of log as a data product. Single file.
    • No synchronization at c++ level amongst loggers, but no indication of log overlap in current implementation. Can happen with intense logging activity.
    • Improvements may not be urgent, revisit use cases/requirements when Sandra is getting ready to start work.

  • Imaging
    • Discuss requirement imposed on VI/VB2 to pass required information for MMS processing.

  • MSTransform
    • Discuss time separation axis for MMS creation.

Action Item List

Item # Date Opened Description Leads Status Status Notes
02 8/07/14 Review available documentation (on wiki), in particular the MPI document. Lindsey Partial 9/11/14: Useful to complete before pipeline meeting. 8/21/14: Initial read, but still has questions.
11 9/11/14 Evaluate and characterize file descriptor limit issue in imager James Open 9/11/14: Task for next week.
12 9/11/14 Basic test of mpi4casa on OSX 10.8 to characterize any multi-platform problems. Justo Open 9/11/14: Don't need mpi to work, but just ensure changes we are considering do not create blockers for OSX builds.
13 9/25/14 Kumar to fix a bug in virtual model column not cleared at times Kumar Open  
14 9/25/14 Sanjay (with help of James R) will get a parallel run with boosted limit on file descriptors Sanjay, James Open 10/16/14: See notes from today's meeting.
15 9/25/14 Jim J to fix the open all files of all the columns after the release Jim Open  

Closed Item Record

Item # Date Opened Description LeadsSorted ascending Status Status Notes
01 8/07/14 Update Issue Chart to reflect current status. James Closed 8/14/14: Contributions from Kumar incorporated. Justo provided separate notes on tasks outstanding. 8/7/14: Contributions from others on the team also welcome.
08 8/21/14 Test mpi4casa integration on cluster. Justo Closed 10/14/14: Completed and reported to group. Documentation updated based on experience.
09 8/21/14 Review openmpi features relative to requirements for preferred library. Justo Closed. 10/14/14: Completed.
03 8/07/14 Update reference doc links to point to MPI doc in SVN Rob Closed 8/7/14: Complete.
04 8/14/14 Talk to James, Kumar, Justo and others and bring some resolution to preferred MPI library/implementation issue. Rob Closed 8/21/14: made MPI Implementation page. will leave open until library selection finalized. 8/14/14: Have feedback from Justo, James, Kumar and Martin Pokorny. Will document and distribute.
05 8/14/14 Add new wiki pages for requirements capture, task list, and other project artifacts. Update based on recent meetings, then circulate for iteration by others. Rob Closed 8/21/14: see main page
07 8/21/14 Clarify MSTransform use cases with Jeff. Rob Closed  
06 8/14/14 Evaluate feasibility of completing cvel2 for 4.3 release. Sandra Done Committed to r31041, r31056 and 31057
10 9/11/14 Documentation of MSTransform MMS functionality for users Sandra Closed 10/15/14: Completed. 9/11/14: Target of 10/20/14 (pipeline meeting).

-- RobSelina - 2014-09-10
Topic revision: r4 - 2014-10-16, RobSelina
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback