TGBT15A _915_105 - Testing of "new manager, new HPC" software

Goals:

  • Run through all modes with VEGAS, L-band TP and SP, X-band TP only

Details

  • 7pm to ~ 8:50pm futzing around with scripts
  • 8:50pm - get started, but scripts are not quite right, online filler is not running
  • 9:00:19 - in Mode 1, get error "Abort due to scan terminating too early - Try cycling Vegas Off/On"
  • message clears itself, so try just running another scan?
  • Looks like it might have been caused by me aborting the scan - did this used to happen? Either way, it is too scary an error message...
  • Try just mode 1, with no aborts.
  • It seems to be that. But the scan should have been finished, and have written data - taking much longer in stopping? Astrid says scan was finished.
  • Getting lots of "Accessor on gbtdata.gbt.nrao.edu lost connection to Transporter on vegas-hpcN.gb.nrao.edu messages - clear themselves instantly.

  • 9:21pm start runRegress.sb, L-band total power 60 sec tracks
  • Mode 1 is scan 10
  • Converter Module Attenuator 16 is dodgy
  • Scan 14 didn't get filled - why not? Nor did scan 18
  • 01:36 UT - all VEGAS banks turned off, and then recovered? Not obvious why.
  • Note: "Mode" label in CLEO is no longer legible
  • 10:12pm - total power track tests complete.

  • 10:15pm - check X-band configs
  • 02:13:38 UT_ All managers turn off and on again (while setting configurations, but not doing anything...)

  • 10:21pm start runRegress.sb, L-band frequency-switched 120sec scans
  • VEGAS knows we are frequency switching, but Status screen shows TPWCAL... - seems to update at the end?
  • online filler not filling again.... after mode 2 - picks up at mode 5
  • 02:36:22 UT Valon frequency 1500 MHz is not equal to that set - cleared after 1min 1 sec
  • 02:44:30 UT All banks turn themselves off, that then clears
  • keep watching through mode 10.

  • 10:50pm Modify runRegress.sb to do X band, OnOffs, 5 mins per position. Ask Greg to run it...

  • 8:00am - still running (has made it to X-band Mode 20. Various VEGAS error messages (UT time now):
    • 06:46:54 UT Something died in band D
    • 08:21:17 UT Bank A,B,E,H - Aborting, deadline missed, arm time is in the past
    • 08:33:50 UT Vegas turned off, Bank A turned off
    • 09:47:15 UT Abort due to scan terminating too early
    • 10:13:12 UT Banks B through H turned off
    • 11:08:15 UT hc8 Valon frequency 1500 MHz is not equal to that set
  • Note, systems was not magically recovering from aborts as I initally thought, Greg restarted the SB after each one. But it seems like he could just resubmit?
  • "The Accessor lost connection to transporter..." messages are because someone is running a version of CLEO connected to an obsolete set-up, we can forget about these.
  • Filler seems to have skipped a number of scans, apparently at random...
  • Bob diagnoses this as due to the fact that the actual filenames are off by one second from the Scan Log filenames.

Tests of aborts, morning of 26th with Joe and Ray

  • Make a version runRegressAbortCheck to do 10 sec OnOff
  • Scans 156/157 - as normal
  • Scan 158/159 - abort as soon as 158 is finished.
    • It seems like all data for 158 does get written as expected, and the "hung in abort" message comes because VEGAS gets two aborts in a row.
    • This seems new behavior, but Ray and Joe understand it - they will investigate further.
  • Scan 159/160 - same behavior, we don't catch anything in the CLEO display. Bank A was not turned on...
  • Scan 160/161 - same behavior, Ray captures some information....
  • Same tests directly through CLEO manager screen, Joe and Ray isolate info they want.

Summary (as of 27th July noon)

  • This includes notes from 25/26/27th July
  • Generally, tests were extremely smooth, and very promising.
  • Data quality nominally looks good, on preliminary investigation
  • Approximately 25 percent of the time, the actual VEGAS file names are different by one second than the filenames in the Scan Log FITS file.
    • This is a show-stopper in that the online filler (and anything which uses the Scan Log) cannot find the files
    • May be cosmetic, but may indicate an actual error (i.e. something getting the time wrong).
  • There appears to be new behavior, in that if you abort through Astrid, you get a bunch of "Manager hung in aborting" messages.
    • These clear, so in some sense again this is just cosmetic. But it looks alarming, and does seem to indicate an error in the behavior of the state machine.
    • Joe can reproduce this just with the VEGAS CLEO Screen, in the new and the old systems.
    • I checked, and it doesn't happen (or doesn't always happen) aborting through Astrid in the old system.
  • We got about four aborts in ten hours on July 25th, I think zero aborts in eight hours on July 26th.
  • I think the Operator recovered from the aborts by just resubmitting the SB, no need to turn Manager on and off.
  • The "VALON frequency" message is an old, known problem.
  • An HPC program apparently died at least once, in Bank D.
  • Still getting "arm time in the past", Manager turned off, and other messages.
    • Joe and / or Ray should check the message logs for 25th and 26th ET for all of the details.

-- RichardPrestage - 2017-07-25
Topic revision: r4 - 2017-07-27, RichardPrestage
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback