10:00am get logged in. Need to wait for Ray to switch to classic version. Encounter problems, Problem was needed to be compiled for different CUDA version?
10:40am - submit RunRegressLSP, with 10 sec scans. Boom is retracted, but Amber has lost control so SW status may be incorrect
UT 14:41:07 - Bank D "HPC program taking too long to be ready: 4803ms" on mode 4.
10:44am - resubmit
UT 14:46:15 - Bank D aborts again.
10:50am - resubmit
UT 14:54:27 - aborts due to detection of invalid data
failure is always on trying to start Mode 4.
Ray runs scans independently of Astrid, consistent failures. Manager on hpc4 had terminated.
Ray decides to switch to new HPC (Matrix version)
11:25am - submit RunRegressLSP, modified to start with Mode 4.
UT 15:26:31 - all banks apart from A turn off. In mode 10. Abort, and restart from here.
11:29am - submit RunRegressLSP, modified to start with Mode 10. this does not turn the banks back on, have to do this manually.
Mode 10 config causes all banks but A to turn off again.
Bank A thinks it is in mode 24, all others were (correctly) in Mode 9.
Difficulties changing bank A's mode through CLEO. Changes in Device explorer - seems to be a CLEO VEGAS window problem.
Run Mode 10 config through Astrid again
Bank A now goes (c.f. dev explorer) to Mode 23
Ray tries controlling through VEGAS coordinator
Works for Ray through CLEO. Could this be a config tool problem?
The problem is in my script! I was setting vegas.subband = None, which meant config tool could get 23.44 MHz BW, 32678 channels, 8 windows with Mode 23, not Mode 10! Modify regress.py accordingly.
Fix up the scripts - runs modes 1 through 29 with no problems
Ray and Ryan do pulsar testing
4:30pm - switch back to new manager, new hpc, spectral line mode
4:30pm run RunRegressLSP - starting from mode 1
Deliberately abort after Mode 7 to decrease a wait time in VEGAS - need to cycle managers off and on
4:42pm re-submit
UT 20:43:36 - Bank C took too long to be ready 2801ms - Ray increases wait time again.
4:50pm run RunRegressLSP - starting from mode 1, 60 sec scans
UT 20:51:56 - VEGAS taking too long to be ready - cannot get into Mode 4 again. Ray changes back to original delay.
UT 21:01:45 - fails on bank 4 again
Ray makes one more fix
5:50pm fill up the queue, and leave running with Amber
8:00am the next day...
Appears to have run all night. There were two instances where scans were out of sequence, indicating an Abort / Restart. From the CLEO message analyzer, these seem to have been related to "Astrid hangs", there were no relevant VEGAS messages at those times, but evidence that grail was restarted.
Scan start times
I wrote a little python script (!~rprestag/vegas/regression/checkTimes.py) to check that STRTMJD in the primary header agrees with the start time encoded in the file name. I did not check that the MJDs in the DATA table are correct (no reason to suspect they aren't). The STRTMJDs were always in agreement, apart from when they weren't there! It appears that at least sometimes (maybe always) when a scan gets aborted, this header keyword is absent.
see for example /lustre/gbtdata/TGBT15A_915_111/VEGAS/2017_08_09_14:52:14D.fits