TGBT15A_915_126 - Coherent Pulsar Mode Troubleshooting

Goals

  • During tests last week the coherent pulsar modes were not running because of an incorrect MAC address for the 10-gigE ports on vegas-hpc3 and vegas-hpc4. I will try to recreate and investigate this issue. I will also look into the 1-ms packet loss artifacts in the c1500x0128 -- c1500x4096 modes.

Details

  • Session begins 14:15
  • Joe switches versions to regression test candidate
  • Will use CoherentModeTests scheduling block and L-Band
  • SB submitted at 14:29
    • Scan #34 is c0800x0064 (forgot to reset scan numbers)
      • Scan seems to run OK, although there is a lot of packet loss on vegas-hpc4
      • Getting some packet loss on other banks as well, though not as much. Little bursts of maybe a few %. Haven't seen this with this mode before...
    • Scan #35 is c0800x0512
      • Same problems with vegas-hpc4 and some small packet loss on other banks.
    • Both scans seem to run alright but Astrid throws an abort at the end of the scan. Will look more closely at scan coordinator to see when this happens.
  • SB submitted at 14:49
    • Scan #36 is c0800x0512
    • Scan aborted. Faults on Banks C and D.
    • I suspect this is the MAC address issue. I am not sure how to check the MAC addresses.
    • One possibility (however remote) is that my server script for checking the HPC status is causing some issue. I'll trying killing that and seeing if it makes any difference.
  • Cycling VEGAS Off/On
  • SB submitted at 14:55
    • Scan #37 is c0800x0512
    • Still aborted, but this time it was only Bank C
  • Not really sure where to go from here. Will look into incoherent mode scaling instead.
  • Switching to IncoherentModeTests scheduling block.
  • Will run through incoherent modes. fftshift is set to 0xaaaaaaaf
  • SB submitted at 15:04
    • Scan #38 is i0800x0064
      • This is the one with the screwy bandpass, but levels look OK
      • Got an abort at end of scan, but continuing
    • Scan #39 is i0800x0128
      • Levels are too low
      • Otherwise seems to have gone smoothly
    • Scan #40 is i0800x0256
      • Levels are low but the bandpass monitoring plot has some periodic structure in the max values. Will need to look at data
      • VEGAS tried to abort about 10 seconds too early with "Abort due to scan terminating too early -- Try cycling Vegas Off/On"
    • Scan #41 is i0800x0512
      • Very good agreement between GUPPI and VEGAS levels
      • Once again got abort about 10 seconds before end of scan with same error message
    • Scan #42 is i0800x1024
      • Levels are good.
      • Data seems to be flowing in VEGAS too soon. The bandpass plot appears before the scan coordinator says the scan has started and while VEGAS is still in a committed state. I wonder if this is why the scans seems to be ending early.
      • Abort and early termination again.
    • Scan #43 is i0800x2048
      • Levels are low
      • Data flow on VEGAS once again seems to start too soon
      • Abort again
    • Scan #44 is i0800x4096
      • Levels are good
      • Data flow starts early again, maybe 14 seconds into "Countdown" in scan coordinator
      • No abort. Interesting. I was running GUPPI during the previous scans but not this one. I wonder if that has something to do with it.... * Scan #45 is i1500x0064
      • Bandpass monitor died with
      • /home/pulsar64/lib/python2.7/site-packages/setuptools-3.4.4-py2.7.egg/pkg_resources.py:1031: UserWarning: /users/rlynch/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).
        Traceback (most recent call last):
          File "./vpmMonitor_noninteractive.py", line 57, in <module>
            main_spec = data[:nspec_sum, 0].mean(0)
        IndexError: too many indices for array
      • Need to look into this
      • No abort
    • Scan #46 is i1500x0128
      • Bandpass monitor died again
      • Levels too low
      • No abort
    • Pausing SB to look at monitor issue
    • Restarting at 15:23
    • Scan #47 is i1500x0256
      • Levels are a little low
      • No abort
    • Scan #48 is i1500x0512
      • Levels are low
      • No abort
    • Scan #49 is i1500x1024
      • Levels are good
      • No problem with bandpass plotter
      • No abort
    • Scan #50 is i1500x2048
      • Levels are OK
      • No problem with bandpass plotter
      • No abort
    • Scan #51 is i1500x4096
      • Bandpass plotter crashed. Data buffer had wrong shape. Maybe some issue when things are activating. Should be easy to catch
      • Levels look OK
      • No abort
  • Will adjust levels on i0800x0064, i0800x0128, i0800x0256, i0800x2048, i1500x0064, i1500x0128, i1500x0256, i1500x1024
  • SB submitted at 15:38
    • Scan #52 is i0800x0064
      • Scan aborted with "Aborting: failed to start the VEGAS HPC subprocess"
  • Resubmitted at 15:41
    • Scan #53 is i0800x0064
      • Aborted
      • Seems like guppi_daq is hung in running
  • Cycling VEGAS Off/On
    • DAQSTATE is still running
  • Doing a stop/start on Bank A in TM
    • DAQSTATE switched to exiting
  • Resubmitting at 15:47
    • Scan #54 is i0800x0064
      • DAQSTATE definitely switches to running before the rest of the stuff in scan coordinator. So scan starts about 10-12 s early, then ends 10-12 s early, causing an abort. But this only seems to happen in dual backend mode.
      • Bandpass is bad
      • Didn't actually trigger an abort that time
    • Scan #55 is i0800x0128
      • Levels are still a bit low
      • Got an abort that time. DAQSTATE switched to off too soon.
    • Scan #56 is i0800x0256
      • Levels are OK. Could be a bit higher, but nothing too much to worry about
      • Scan seemed to end OK that time
    • Scan #57 is i0800x2048
      • DISKSTAT started writing about 10 s too early
      • Levels are still too low
      • Aborted early
    • Scan #58 is i1500x0064
      • Bandpass still weird
      • No abort, things seem to have started OK.
    • Scan #59 is i1500x0128 (seem to have got my scan numbers mixed up here)
      • Levels are OK
      • No abort
    • Scan #60 is i1500x0256
      • Levels look good
      • No abort
    • Scan #61 is i1500x0512
      • Levels look good
  • Calling Joe to switch versions at 16:04
  • Session ended at 16:

Conclusions

  • Didn't make much progress on the issues with the coherent modes
  • Did fix some issues with the bandpass plotter
  • Discovered problems with VEGAS arm/stop times when configuring with GUPPI. Ray or Joe will need to look into this.
  • Tweaked vegas.scale values for incoherent modes.
-- RyanLynch - 2017-10-05
Topic revision: r2 - 2017-10-06, RyanLynch
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback