In previous tests, VEGAS does not independently balance when used in dual backend mode with GUPPI. Issuing a Balance("VEGAS") causes Astrid to report that balancing has failed, with an option to abort, even though the IF rack appears to be balanced. Ray made changes to the balancing routine so today we will try to test these and resolve any remaining balancing issues.
Details
Session begins at 12:00
Ray previously switched versions to the regression test candidate
Also got a "VEGAS HPC program taking too long to be ready" fault. Bank B aborted.
We do not appear to be in pulsar mode. Taking another scan
SB Submitted at 12:12
Scan #2 is i0800x0512
Same behavior. Manager has the right value for mode but shared memory is configured for spectral line and guppi_daq is not running. Seems to be in matrx-based spectral line mode.
Trying to stop the vegas_matrix_server through Task Master.
Task Master just hung and didn't seem to stop the server.
It seems like Task Master completely died on vegas-hpc1 (from where I was trying to stop the matrix server)
Ray is putting things back into pulsar mode.
Restarting at 12:41
Scan #3 is c0800x0512
Astrid aborted. No CLEO error message, but the log file for vegas-hpc1 says "ERROR: No CUDA-capable device found"
Banks B-H did take some data, but Bank A did not due to CUDA error.
Ray is not sure what went wrong on bank A, but guppi_daq was restarted successfully.
Resubmitted SB at 12:55, but once again Bank A issued an abort. It cleared quickly but the scan was not successful.
Resubmitted SB at 12:58. Scan started but then aborted part way through. More aborts at end of scan. No messages in CLEO. Not sure what is happening.
Resubmitted SB at 13:01. This time Bank C didn't start properly and aborted. CLEO message from bank C is "VEGAS HPC program taking too long to be ready"
Incidentally, we took the Balance("VEGAS") call. We confirmed that the IF rack is balanced for GUPPI but not for VEGAS.
Ray was unable to get the manager working. There were issues with loading the correct BOFs and with getting the correct MAC addresses for the HPC nodes.
Conclusions
Did not solve balancing issue due to other problems. However, I confirmed that we can easily run a server on each node to access shared memory parameters via vpmHPCStatus.py (analog of guppi_gpu_status). We should be able to run the server via Task Master.