Code Changes 2/12/2009

  • commented out stream verification messages
  • commented out no interrupts message in PARBuffer::wait Put back in 2/13/2009
  • commented out _pci_reset messages, reproduced problem without them so then re-enabled them.
    • line 285
    • line 1475
    • line 1594
    • line 2671
  • commented out line 397 and 256 in DataCollect
  • commented out 850 and 291 in Parser.cc
  • changed minPublishSize in conf file. set changed from 2000 to 500.

Things to try on 2/13/2009

  • If CPU utilization drops, parsing has stopped.
  • Something stops working within parsing code.
  • ddd parsing approach:
    1. how many threads are running when the manager is happy? how many are running when parsing stops?
    2. ps | grep ./Rcvr_PAR to get PID. top -p PID then H to see threads.
    3. cd /proc/PID/fd to look at file descriptors. strace -p to look at system calls with file descriptors.
    4. are there interrupts? cat /proc/aa_info
    5. to set breakpoints:
      • ddd Rcvr_PAR & attach PID or execute with ddd...
      • click Interrupt
      • (gdb) break DataCollect.cc:line#
      • click cont
    6. what is the timeout value in PARBuffer::wait() (line 191)? is readyDevCount incrementing (line 250)?
    7. DataCollect::_run: are the sentinels OK (line 255)?
    8. DataCollect::_run: is _destripe returning a valid size (line 262)?
    9. DataCollect::_run: is _parse_destriped_data getting PARPackets from parser.parse_buffer (line 846)? Are there subscribers?
    10. DataCollect::_run: is the buffer being saved? it's just a memcpy after all...
    11. DataCollect::_run: is the data being published (_publish_parsed_data, line 919)?
  • If the parsing is OK, then maybe a subscriber is not processing the data.
  • ddd subscriber approach:
    1. The only PARDataPak subscribers are Rcvr_PARFitsIO (Rcvr_PARFitsIO::write & convertPublishedDataToFitsData, lines 742 and 1058) and DetBias (DetBias.cc Rcvr_PARMgr::processBiasData, line 387)
    2. Is convertPublishedDataToFitsData working (line 754)?
    3. Is processBiasData cleaning up properly?

CLEO Info for 2/13/2009

  1. Did DBC scan. Clicked Blast Bias. CPU -> 0%. The PCI Card was reset, many bad frame counter errors. Interrupts were OK.
  2. Couldn't attach with ddd.
  3. Restarted with ddd.
  4. Did DBC scan. Clicked Blast Bias. PCI reset but no bad frame counters. CPU ~ 14%.
  5. Did Default scan. OK
  6. Did DBC scan. OK
  7. Clicked Blast Bias. Bad frame coutners. CPU -> 0%. Interrupts were OK. Set break points in PARBuffer & DataCollect but nothing broke.
  8. Set a break point in Cryocontrol - broke as expected and resumed as expected.
  9. Restarted with ddd and verified that breakpoints should work in PARBuffer & DataCollect while parsing worked.
  10. Did DBC scan. Selected Default scanType, CPU -> 0% with bad frame counters.
  11. Did DBC scan. Did Default scan. Clicked BlastBias. No PCI reset. No problems.
  12. Did DBC scan. Clicked Blast Bias. Did Default Scan. Clicked BlastBias. No problems.
  13. Did both scan types, did many conform parameters. No problems. Can't stop parsing. Grrr.
  14. Joe mentioned pthread_cleanup_push. Did a round of testing focusing on threads. Got parsing to stop but all the threads look OK.
  15. Started using strace -p PID where PID is the PID of the parsing thread, not the main program. Could not reproduce the problem.
  16. Summary: Different things cause parsing to stop, it's not the same thing every time. Usually PCI errors are involved but that could be coincidental. Break points not firing suggests a deadlock condition in PARBuffer, DataCollect or DataPublisher. When the deadlock occurs, all of the parsing functions stop, data subscribers get no data and the only remaining functionality is Ygor parameters and cryogenics which is why CPU% -> 0. The goal is to use strace to determine where the deadlock is.

CLEO Info for 2/13/2009

  • cleo -cleodir ~rmaddale/Tcl rcvr_PAR

  • You'll see all the high-level widgets now have blue foreground values.
Blue now indicate widgets that contain 'high-level' values that are internal to that instance of CLEO. The initial values of these widgets come from the first value of the array of whatever parameter the widget is summarizing. They are color coded blue to indicate that they may or may not reflect all of the values of the underlying fields in the arrays.

When you try changing a value, the widget turns magenta to indicate that the value has not yet been passed down to the actual underlying parameter array. (Thus, magenta is used the same here as for any widget that directly acts on a parameter). When you hit "Enter", the value of the high level is used to set the underlying fields in the parameter array and a setParam, regchange, and prepare are executed. The widget then turns back to blue.

From then on, these widgets retain the value last specified by the user of that instance of the CLEO application. These blue widgets Do Not reflect any changes made to the underlying fields by users of DevExplorer or another instance of the CLEO application.

Even these changes required a pretty deep hack of the CLEO infrastructure with possible consequences for the non-Mustang applications. So a release of the changes is extremely risky.

  • CLEO doesn't change border colors directly. Instead, borders change
colors when the underlying manager tells CLEO that these parameters have been touched/activated. In this way, the borders indicate changes that other users of CLEO, Astrid, etc. are making to the system. Thus, if you are seeing border changing for widgets other than the ones you are altering, then it's some interaction within the manager (or something else like Astrid) that is touching these parameter.

Additionally, BlastBias is pretty much a dumb function. It:
  1. stores away the current 8 values for detBias
  2. changes the value of the 8 DetBias fields to 60000
  3. Asks the manager to do a:
    • Rcvr_PAR setParam DetBias
    • Rcvr_PAR regChange
    • Rcvr_PAR prepare
  4. waits 1 sec
  5. Restores the old values and repeats (2)

Step (2) is the same step that is performed every time you modify anything in CLEO (when you have the "Auto Prepare" checkbox selected at the bottom of the application.) Cleo's blastBias is not touching or doing anything explicitly with the other parameters whose borders are turning yellow/green.

CLEO Info for 2/17/2009

  • Added cout's to PARPublisher for _par_data_fifo and _par_coadd_fifo <<REMOVED>>
  • Set debug_parse_config_data to 1 in Parser.cc.
  • Did DetBiasScan, set scanType to Default, CLICKED PREPARE, saw lots of parameter output, config data output shows garbage config data.
  • Commented out conform in Rcvr_PARMgr.cc (line 1118). Could not reproduce parsing problem.
  • Included conform. It gets set after a DBC has been collected but does not get activated until the Prepare button is clicked. Saw problem.
  • Commented out conform and added comment indicating a conform should not be executed.
  • Set debug_parse_config_data back to 0 in Parser.cc. Retested since debug output could affect timing.
  • fixed bug with DBC txt file names.

-- MarkWhitehead - 2009-10-29
Topic revision: r1 - 2009-10-29, MarkWhitehead
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback