TGBT15A_915_131 - Tests over Thanksgiving Shutdown

Goals

Take advantage of the time available during the Thanksgiving shutdown to:
  • Thoroughly test of all spectral line modes using position and frequency switching
  • Thoroughly test of all pulsar observing and dedispersion modes (incoherent DD and coherent DD, search, cal, and fold modes)
  • Test new changes to config tool and managers that should allow for smooth switching between spectral and pulsar line modes
  • Test changes to network initialization on ROACH's that should prevent ARP storms (and possibly help with lost packets in coherent 1500 MHz modes)
  • Test offline script validation

Details

  • Ray began switch to regression candidate version at approximately 18:30
  • Ryan took control at approximately 19:20
  • L-Band is in focus, telescope is parked. Operator has left and Ryan will do tests during shutdown.
  • The first round of tests will use the SpecLineTestsTP SB. This will do on/off observations in all 29 spectral line modes with maximum number of windows, 4-sec integrations, and 60-sec scan lengths
    • SB submitted at 19:22
    • First few scans seem to be running alright. Ryan will monitor while doing other holiday things.
    • SB ended about 20:45. Scan numbers from this block are 1 -- 58
    • Quick spot checks look good. HI is still a thing.
  • Will now run SpecLineTestsFS SB. This will do frequency switched observations in all 29 spectral line modes with maximum number of windows, 4-sec integrations, and 60-sec scan lengths.
    • SB Submitted at 20:49
    • Astrid had an "Aborting" state briefly, but there were no CLEO messages or errors in VEGAS. It seemed to clear as configuration continued but then Astrid asked to abort. The only obvious issue is an "LO1 activate inhibitted because of illegal value".
      • LO1 does not like the switching frequencies of -2,2 MHz.
      • Upon further investigation, I had swmode='sw' instead of 'sp'. This did not cause a config tool error when I saved and validated the script but it seems to have confused the LO manager, since it didn't know it was supposed to be frequency switching. Seems happier after correcting this.
    • Back up and running. SB resubmitted at 20:59
    • Scans not showing up in gbtidl. SDFITS does seem to be running. Data is on lustre. Suspect that the filler is not able to keep up.
    • Bank B died during scan #82 (mode 24) and caused an abort. Banks B, D, F, G, and H turned off. I ignored the abort and let the block continue to see how things recovered. Next scan started but the banks that were turned off did not turn back on.
      • Letting the block continue. I'm going to run this block again with longer integration times. We'll see if this happens again, and if it happens in the same mode.
    • Seem to have lost connection to some other banks during scan #87 (mode 29). Banks A and E are the only ones still running.
      • Astrid threw an abort. CLEO messages reported "Bank C died during a scan. Aborting scan."
    • SB ended at 21:47
    • Scan numbers in this block are 59 -- 87 (note: no scan #58)
  • Will run same block with same settings, except using 10-sec integrations to see if that helps with online SDFITS.
    • Did an off/on cycle of VEGAS
    • SB submitted at 21:48.
    • VEGAS running well at start.
    • Online SDFITS is filling now
    • Bank H died during scan #95 (mode 8). Will abort and restart block but will skip modes 1--7. Note that the following scans will have "sp" instead of "tp" in the source name (I apparently didn't change that in the first few runs of this block).
    • Another abort during scan #112 (mode 24). Bank B died again. Will resubmit starting with mode 24.
      • After a VEGAS off/on cycle, mode 24 scan ran fine
    • Scan #118 (mode 29) aborted. Bank B again. Banks B and D--H all also turned off their managers.
      • Cycled VEGAS off/on and resubmitted mode 29 scan.
    • Scan numbers in this block are 88 -- 119.
    • SB ended at 23:07
  • Moving on to pulsar modes.
  • Incoherent modes will use fftshift = 0xaaaaaaaf
  • Using IncoherentModeTestsCal SB. This will test all 800 and 1500 MHz incoherent DD, cal observing modes using 65-sec scan lengths.
    • SB submitted at 23:56.
    • i0800x0064 bandpass looks bad. This might still be a bad BOF file.
    • Lots of artifacts in the i0800x0256 phase plot. May be related to FFT overflow. Seems to have similar structure in frequency.
    • i1500x0064 bandpass also looks good.
    • Lots of artifacts in the i1500x0256 phase plot. May be related to FFT overflow. Seems to have similar structure in frequency.
    • Scans in this black are numbers 120 -- 135
    • SB ended at 00:27
    • No issues with VEGAS managers.
  • Will now use IncoherentModeTestsFold SB. This will test all 800 and 1500 MHz incoherent DD, fold observing modes using 65-sec scan lenghts. The parfile is appropriate for the 25 Hz cal, which will be on. Note that there will be a known cal frequency drift due to the unnecessary barycentric correction when generating polycos.
    • SB submitted at 00:34
    • Scan 136 -- 137 were accidentally in search mode. Stopping SB and resubmitting to fix.
    • Got an abort with "BankAMgr hung in aborting" (scan #138)
      • This seems to be because the Manager does not know where to find tempo.
      • Edited the path_list and lib_path entries in vegas.conf to include pulsar software. Also did a full stop/start cycle on VEGAS to pull in changes.
    • Resubmitting at 00:54. Looks good now.
    • Fold mode scans in this block are numbers 139 -- 154
    • No further issues.
  • Will nose use IncoherentModeTestsSearch SB. This will test all 800 and 1500 MHz incoherent DD, search observing modes using 65-sec scan lengths. Diode will still be on so that the data can be folded and checked offline.
    • SB submitted at 01:24
    • vegas_status reported high dropped packet rate in scan #162 (i0800x4096). May need to use longer integration times than 40.96 us.
    • vegas_status reported high dropped packet rate in scan #170 (i1500x4096). May need to use longer integration times than 40.96 us.
    • Scans in this block are numbers 155 -- 170
  • Will move on to coherent DD modes.
  • Will use CoherentModeTestsCal SB. This will test all 800 and 1500 MHz coherent DD, cal observing modes using 65-sec scan lengths.
    • SB submitted at 01:55
    • Samplers on Banks B -- H are not off. This is new behavior, but perhaps expected given the work Ray and Joe did this week? Doesn't seem to affect data quality.
    • Looks like the 2048 channel modes experience a high dropped packet rate for both bandwidths, but the c0800x4096 channel modes do not. Will need to investigate further.
    • ROACH samplers turned off during c1500x4096 scan, and NETSTAT hung in waiting. No aborts triggered, but it doesn't seem like we got any packets. No data written.
    • No obvious 1-ms spurs in c1500 modes! That is great!
    • Scans in this block are numbers 171 -- 183
  • Ryan is going to go to bed but will run the CoherentModeTestsFold and CoherentModeTestsSearch SBs overnight. These will test all the 800 and 1500 MHz coherent DD, fold and search observing modes using 65-sec scan lengths. As with incoherent modes, the 25 Hz diode will be left on. The fold modes will use a par file to fold at this frequency but we expect drift due to unnecessary barycentric corrections. WIll also submit the IncoherentModeTestsScales and Coherent ModeTestsScales SBs to test different values of vegas.scale for all the modes.
    • SBs submitted to queue at 02:26
    • Seems as though the Bank A ROACH is not responding. Got an abort with faults in several registers.
    • Was able to manually load the c0800x0064 BOF.
    • Resubmitted SB at 02:32, but skipping c1500x4096 mode to avoid possible issues with ROACH.
    • Seems to be OK now.
    • Ryan signing off at 02:35.
  • Ryan signing back on 08:44.
    • No apparent issues overnight! There are no messages in CLEO, uncleared or otherwise. There are no missing scans at all. This is excellent.
    • Note that I did seem to accidentally use coherent cal for the c0800x4096 and all the c1500 modes, instead of coherent fold. If there is time I'll run through these again, but it doesn't seem crucial. Fold modes seem to be working well. The data are there, just with the wrong obs mode.
    • The IncoherentModeTestScales SB is currently running and is on the i1500x2048 modes. I am going to cancel the CoherentModeTestsScales SB since the levels seemed OK last night, and it would be better to test the switch between spectral line and pulsar modes.
    • This group of SBs finished at around 09:36
    • Scans are 184 -- 434
  • Will now test switching between Modes 1, 20, i0800x0512, and c0800x0512. Mode 1 will use 4 windows, and Mode 20 will use 64. All scans will be 65 seconds long. Using a permutation over scan types so all modes will be tested in all combinations. Python itertools.permutations is sweet.
  • Using TestModeSwitching SB.
    • Submitted at 09:41. Ryan is going to have some breakfast and coffee, chill with his nephew.
    • Scans in this block are numbered 435 -- 531
    • Not all banks are balancing reliably, but so far the switch between modes seems fine.
    • Checking in at 10:27. Seems like we may have had some issues
      • at 10:02 there was a problem with vegasr2-1: Valon frequency 800 MHz not equal to that set.
      • At arond 10:13 it looks like the manager on Bank E crashed. Core in /home/gbt/etc/cores/core.23996. Seems realted to a bad switching state. vegas-hpc5 reports "Illegal value in BankEMgr actual_switch_period"
      • At 10:23 Switching Signal Selector reported "Compare error in Sig/Ref"
      • None of the above stopped the SB from executing. In all cases things seemed to recover automatically.
    • At 10:45. VEGAS reported "LBW Balance failed. Check Digital IF snaps" during mode 20 scan. Scan continued and there were no errors in Astrid. Cleared on next scan.
    • At 10:48 VEGAS reported "IF Balance failed for VEGAS" in mode 1. Scan continued.
    • Looks like we had a manager crash at 10:38 on Bank G. TaskMaster on vegas-hpc7 reported "vegasManager: vegasManager Terminated abnormally:signal = 6". Also, there was a core dump: "TasMaster Failed to unlink /home/gbt/etc/cores/core.4935". Seemed to recover on its own.
    • At 11:11 Bank G got "Illegal value in BankGMgr actual_switch_period"
    • At 11:27 there was a fault and abort. Bank G reported "Aborting: VEGAS HPC program taking too long to be ready: 2803 mS".
      • This occurred when going from i0800x0512 with one bank to c0800x0512.
      • Scan number on failure was 488.
    • Picking the permutations block back up near the iteration where this failed. Cycled Bank G off/on first.
    • Resumed SB.
    • At 12:05, "IF Balance failed for VEGAS" in Mode 1. Cleared on next scan.
    • At 12:11 got a couple of crashes on Banks G and H. This was while going from Mode 20 to i0800x0512. Cores are core.11255 and core.5891. Same problem with illegal value in Bank[GH]Mgr actual switch_period. Once again, recovered on its own.
    • LBW balance for mode 20 seems to have failed on Bank H at 12:14. This was the first scan on this bank after the Manager crashed.
    • Another failed LBW balance in mode 20 at 12:22
    • May have gotten another manager crash around 12:34. Core is core.17813
    • VEGAS reports balance failed at 13:12
    • During c0800x0512, noticed the following in Astrid:
      [18:14:20] Warning: Configuration complete but inconsistencies were found between the M&C system and the configuration tool. Verify the configuration 
      [18:14:20] ('(manager,   ', 'parameter,  ', 'expected value, ', 'actual value )')
      [18:14:20] ('ConverterRack', 'Gfrequency,1', '13100.0', '12750')
      I did not notice if this happened more frequently. Will have to check the Astrid log.
    • SB ended at 13:30.
  • Going to collect some raw mode data for Natalia using the RawModeTests SB. This will take raw mode scans in 800x0064 and 1500x0064 modes, hopefully with good balancing. Diode will be on.
    • SB submitted at 13:31
    • 800 MHz BW mode ran well, didn't notice dropped packets on Bank A
    • 1500 MHz BW mode ran but with quite a bit of dropped packets. NETSTAT kept going from receiving to blocked.
  • Going to go back and re-run the CoherentModeTestsFold SB so that we actually use fold mode for the c0800x4096 and c1500 modes. Will also turn on the daq_server agent and run the multi-bank status monitor, as well as the autoplot routine.
    • SB submitted at 13:39
    • Manager is printing a DBUG message about the pid of the codd_net thread. That can probably be removed.
    • Autoplotter crashed with some psrplot errors. This is something Ryan will have to look into.
      psrplot: error while processing /tmp/vpm/D/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::get_Integration (Index)                                         
              IntegrationManager::get_Integration                                     
      Error::InvalidRange                                                             
      Error::message                                                                  
              isubint=0 nsubint=0                                                     
      
      psrplot: error while processing /tmp/vpm/D/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::get_Profile (Archive, Index subint, pol, chan)                  
              Pulsar::get_Integration (Index)                                         
              IntegrationManager::get_Integration                                     
      Error::InvalidRange                                                             
      Error::message                                                                  
              isubint=0 nsubint=0                                                     
      
      psrplot: error while processing /tmp/vpm/D/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::get_Integration (Index)                                         
              IntegrationManager::get_Integration                                     
      Error::InvalidRange                                                             
      Error::message                                                                  
              isubint=4294967295 nsubint=0                                            
      
      psrplot: error while processing /tmp/vpm/D/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::get_Profile (Archive, Index subint, pol, chan)                  
              Pulsar::get_Integration (Index)                                         
              IntegrationManager::get_Integration                                     
      Error::InvalidRange                                                             
      Error::message                                                                  
              isubint=4294967295 nsubint=0                                            
      
      psrplot: error while processing /tmp/vpm/D/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::Archive::start_time                                             
      Error::InvalidState                                                             
      Error::message                                                                  
              no Integrations                                                         
      
      psrplot: error while processing /tmp/vpm/E/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::get_Integration (Index)                                         
              IntegrationManager::get_Integration                                     
      Error::InvalidRange                                                             
      Error::message                                                                  
              isubint=0 nsubint=0                                                     
      
      psrplot: error while processing /tmp/vpm/E/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::get_Profile (Archive, Index subint, pol, chan)                  
              Pulsar::get_Integration (Index)                                         
              IntegrationManager::get_Integration                                     
      Error::InvalidRange                                                             
      Error::message                                                                  
              isubint=0 nsubint=0                                                     
      
      psrplot: error while processing /tmp/vpm/E/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::get_Integration (Index)                                         
              IntegrationManager::get_Integration                                     
      Error::InvalidRange                                                             
      Error::message                                                                  
              isubint=4294967295 nsubint=0                                            
      
      psrplot: error while processing /tmp/vpm/E/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::get_Profile (Archive, Index subint, pol, chan)                  
              Pulsar::get_Integration (Index)                                         
              IntegrationManager::get_Integration                                     
      Error::InvalidRange                                                             
      Error::message                                                                  
              isubint=4294967295 nsubint=0                                            
      
      psrplot: error while processing /tmp/vpm/E/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::Archive::start_time                                             
      Error::InvalidState                                                             
      Error::message                                                                  
              no Integrations                                                         
      
      psrplot: error while processing /tmp/vpm/H/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::get_Integration (Index)                                         
              IntegrationManager::get_Integration                                     
      Error::InvalidRange                                                             
      Error::message                                                                  
              isubint=0 nsubint=0                                                     
      
      psrplot: error while processing /tmp/vpm/H/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::get_Profile (Archive, Index subint, pol, chan)                  
              Pulsar::get_Integration (Index)                                         
              IntegrationManager::get_Integration                                     
      Error::InvalidRange                                                             
      Error::message                                                                  
              isubint=0 nsubint=0                                                     
      
      psrplot: error while processing /tmp/vpm/H/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack                                                                    
              Pulsar::get_Integration (Index)                                         
              IntegrationManager::get_Integration                                     
      Error::InvalidRange                                                             
      Error::message                                                                  
              isubint=4294967295 nsubint=0                                            
      
      psrplot: error while processing /tmp/vpm/H/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:                                                                     
      Error::stack
              Pulsar::get_Profile (Archive, Index subint, pol, chan)
              Pulsar::get_Integration (Index)
              IntegrationManager::get_Integration
      Error::InvalidRange
      Error::message
              isubint=4294967295 nsubint=0
      
      psrplot: error while processing /tmp/vpm/H/vegas_58080_67728_c0800x1024_fold_0544_0001.scr:
      Error::stack
              Pulsar::Archive::start_time
      Error::InvalidState
      Error::message
              no Integrations
      
      
      Error::stack
              Pulsar::Archive::dedisperse
              Pulsar::FITSArchive::load_Integration
              Pulsar::ProfileColumn::load_amps<>
              ProfileColumn::load_amps
      Error::FailedCall
      Error::message
              Error reading subint data iprof=3/4 ichan=127/128
              colnum=17 firstrow=5 firstelem=1046529 nelements=2048: tried to move past end of file
      
      psrplot: '/tmp/vpm/E/vegas_58080_67728_c0800x1024_fold_0544_0001.scr' not found
      psrplot: please specify filename[s]
      psrplot: '/tmp/vpm/E/vegas_58080_67728_c0800x1024_fold_0544_0001.scr' not found
      psrplot: please specify filename[s]
      psrplot: '/tmp/vpm/E/vegas_58080_67728_c0800x1024_fold_0544_0001.scr' not found
      psrplot: please specify filename[s]
      psrplot: '/tmp/vpm/E/vegas_58080_67728_c0800x1024_fold_0544_0001.scr' not found
      psrplot: please specify filename[s]
      psrplot: '/tmp/vpm/E/vegas_58080_67728_c0800x1024_fold_0544_0001.scr' not found
      psrplot: please specify filename[s]
      Traceback (most recent call last):
        File "vpmCoherentAutoplot.py", line 80, in <module>
          os.remove(tmpfile)
      OSError: [Errno 2] No such file or directory: '/tmp/vpm/E/vegas_58080_67728_c0800x1024_fold_0544_0001.scr'
    • Noticing some dropped packets on certain banks in the c1500 modes. Seems to be an issue for 128, 256, 512 channels. In these scans it was limited to Banks F, G, and H.
    • Noticing error messages in the Manager log on Bank A about not being able to set priority for net thread. I don't think this is critical but it would be good to make it look less scary if that is the case.
    • Still getting packet loss issues in the c1500x2048 modes (on all banks)
    • Samplers turned off again on Bank A. Seems to be the same issue with the ROACH as last night. Will re-program to c0800x0064.
  • It is 14:11 and I think I am just about done. I am going to get the CoherentModeTestScales SB running again and just let it go, but I will also tell Ray and Joe that they I can just kill it whenever they are ready to do the software switch back.
    • SB submitted at 14:13.
    • Ryan is not going to monitor too closely now. In-laws are here and Thanksgiving stuff is starting. But we'll see how this runs!
  • The previous SB finished at some point. Joe switched back to version 16.3
  • Joe ran some test scans under session number 132. These are scans 684 -- 690.
  • Ryan ran CheckMode1, CheckMode4, CheckMode10, and CheckMode20 SBs.
    • All check out. Scans are 691 -- 695.
  • Sessions ends at 20:25 on Nov 23.

Conclusions

  • All and all, this was a very successful session.
    • Total power spectral line scans executed in all modes without any problems.
    • Most frequency switched scans executed without any problems. See below for a few exceptions.
    • Most incoherent cal, fold, and search mode pulsar scans executed without any problems once vegas.conf was modified to include paths for tempo. There are some potential issues with the 64 and 256 channel BOF files but these seem unrelated to the M&C system (see below for details).
    • Most coherent cal, fold, and search mode pulsar scans executed without any problems. See below for a few exceptions.
    • Related to the c1500 modes, a spot check of the data show no sign of the 1-ms dropped packet spurs. Some of the higher channel modes had high dropped packet rates, but this seems to be more closely related to data rates than networking issues.
    • Switching between spectral and pulsar modes went mostly smoothly, but there were some errors that we will need to look into more closely.
    • Offline validation works.
    • No issues with coherent modes while running the monitoring utilities.
    • c0800x0064 raw mode scan seemed to run well.
  • A few issues that we do need to look at:
    • VEGAS is regularly sending out "Test message (please ignore)". It would be good to turn this off.
    • Bank B seemed to have issues with some of the frequency switched scans. Specifically with modes 24 and 29.
    • Bandpasses in i0800x0064 and i1500x0064 are bad. This may be a case of having the wrong BOF file. Ryan will need to check this.
    • There seem to be a lot of artifacts in the i0800x0256 and i1500x0256 scans. Need to investigate if this is related to FFT overflow issues.
    • There were no packets being sent in the c1500x4096 mode. This seemed to be an issue with the ROACH. Need to investigate further.
    • During the mode-switching tests, there more frequent than usual errors while balancing, and a couple of manager crashes. Mostly related to illegal values of actual_switching_period. All these errors seemed to self correct without any intervention but we'll need to look into the causes. Bank G also failed to activate in time during a coherent mode scan. That did not self correct and required an off/on cycle, but I was able to resume after that.
    • c1500x0064 raw mode scan had lots of dropped packets and NETSTAT was frequently blocked. Maybe a data rate issue.
  • Some minor things:
    • Need to check the current vegas.conf into the working branch, since I made changes to the path lists.
    • Need to ask Ray and Joe if samplers should now remain on for Banks B -- H in coherent modes.
    • Can probably remove DBUG and non-critical error messages in Manager logs.
-- RyanLynch - 2017-11-23
Topic revision: r6 - 2017-11-23, RyanLynch
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback