Az/El System Watchdogs and Failure Scenarios

Requirements

CCU Task Overrun Checks

  • The CCU task sequencer shall monitor its controlled tasks for overrun conditions. The tolerance for overruns shall be configurable ( TBD ).
  • The CCU task sequencer shall be able to clearly identify the overrunning task.
  • The CCU sequencer tasks shall be verified to be running by a check performed periodically in the IRIG interrupt handler.
  • The IRIG driver shall enforce an 'exclusive' one at a time device open policy. (This prevents inadvertently running more than one CCU process.)

PLC Watching CCU

The PLC periodically checks for CCU updates to a 32 bit register at V3000, and activity on a bit register at PLC address C301. If activity on either register ceases for more than TBD seconds, the PLC watchdog shall expire disabling any active axes through a brake-sequenced stop. The CCU shall update the register at V3000 at least once a second (every 500ms typical) with a new value (e.g. a counter or 32 bit millisecond value). The CCU shall toggle the register at C301 at least once a second (every 200ms typical).

CCU Watching ACU

The CCU checks for updates to a 'deadman' token which is passed from ACU to the CCU, incremented, and then sent back with the next request for status. If the activity ceases (no updates for a period of 10 seconds) while the ACU is in control, the CCU will reject control and command any active axes to disable using a servoed-stop.

CCU Watching OCU

The CCU checks for updates to a 'deadman' token which is passed from ACU to the CCU, incremented, and then sent back with the next request for status. If the activity ceases (no updates for a period of 5 seconds) while the OCU is in control, the CCU will reject control and command any active axes to disable using a servoed-stop.

CCU Watching MCI/PEI

Each EtherCAT message sent by the CCU includes a sequence number, which is used by the PEI/MCI's to signify a successfully completed I/O cycle by returning the most recent command sequence number with the feedback data.

PLC watching MCI

The PLC monitors the state of the PEI/MCI watchdog status and MC Inhibit lines from the MCI. This provides a cross-check to verify the embedded processor watchdog has not fired, and the inhibit line is not active.

MCI watching CCU

The MCI monitors EtherCAT activity. If EtherCAT activity ceases, a timeout currently set at ~40msec expires and the MCI will zero the armature current DAC outputs ( Does MC_Inhibit become asserted? Is the PEI connected to the PLC in a similar way? )

Failure Modes

CCU Host Software Failures

Thread(s) hang accessing a resource or deadlock conditions

Sequencers can detect overrunning threads. If a sequencer thread deadlocks, the event ISR should detect it. What can/should the ISR check do if it fails?

Death Grip: A thread starts running in a tight loop

Depending upon the task's base priority, this may be caught by either the sequencer checks or the ISR check.

Thread(s) exit prematurely

I think it would cause effects similar to deadlocks/overruns, but with a more permanent status.

Segmentation violation, bus error, abort, out of memory, memory error etc.

I is likely that the process would exit, which in itself is probably ok, as it would trigger the 'PLC watching CCU' watchdog. (The condition we must guard against is having some tasks running (satisfying watchdogs) and other tasks not doing their part.)

OS-Kernel context 'OOPS' (in kernel Seg-V)

A kernel OOPS sometimes takes out the calling process, although sometimes a system freeze will result.

Duplicate CCU Process Running

One easy method to prevent this is to implement the 'exclusive open' policy in the IRIG driver. (Used elsewhere with good results.)

System Misconfigured

  • Sim vs. Real EtherCAT (e.g. using real PLC to run with Simulated ECAT or vise-versa)
  • Incorrect PLC host (Modbus/PLC interface)
Probably the worst case would be the CCU talking to the real PLC (satisfying the PLC watches CCU dog) but with one or both EtherCAT modules in simulation mode. The CCU would be happy until the axes where enabled. Somewhere down the line, the PLC would note that the system is enabled, but most/all of the MC's are not enabled. [Seems to me that the MCI's should be toggling a line to the PLC in response to EtherCAT activity.]

CCU Host Hardware Failures

  • Memory Error
  • Processor Failure
  • Processor Thermal Slowdown
    • I don't know if the OS will catch this first and attempt a system shutdown or the system just slows down. If it just slows down to the point where tasks are overruning, the usual monitors should catch it. If the system does a shutdown, the PLC watching CCU watchdog should shut the telescope down. A third possibility is that the OS doesn't detect it, and tasks don't overrun (so the system is not impacted).
  • Host Component Failure (e.g. processor support chips, bus, disk cable, disk drive, cooling fans, power supply)
    • Disk failure. This is one mode where almost nothing can be trusted. Failure modes where disks slowly fail by requiring a number of 'retries' means that I/O operations take more and more time. Buffers fill-up (both in programs and in the OS), further draining system resources.

  • CCU-AZ MCI/PEI EtherCAT
  • CCU-EL MCI/PEI EtherCAT
  • PLC-CCU Ethernet
  • CCU-OCU Ethernet
  • CCU-ACU Ethernet
  • PLC-MCI Digital signals
  • CCU-IRIG Clock signal

PLC Failure

  • PLC component loss (I/O module, Network module)
  • PLC CPU loss
  • Power supply loss
  • Processor watchdog/reboot (due to PLC error)

MCI Failure

  • MC fuse blown
  • MCI supply loss
  • Processor watchdog/reboot (due to MCI wd or reset)

PEI/Encoder Failure

  • Encoder power loss
  • PEI supply loss
  • Processor watchdog/reboot (due to PEI wd or reset)

Summary of Failures vs. Detection Mechanisms

Component Class Failure Detection Mechanism
CCU Host SW One or more thread(s) hang accessing a resource or thread deadlock conditions CCU-overrun,CCU-seqchk,...
CCU Host SW A thread starts running in a tight loop (Death grip) CCU-overrun,CCU-seqchk,...
CCU Host SW A thread exits prematurely CCU-overrun,CCU-seqchk,...
CCU Host SW Segmentation violation (wild pointer, or memory error) CCU Host OS, CCU-overrun,CCU-seqchk,...
CCU Host SW Kernel-context Segmentation violation (wild pointer, or memory error) CCU-overrun,CCU-seqchk,...
CCU Host SW Duplicate CCU process running ??, CCU-internal,...
CCU Host SW System misconfigured (e.g. real PLC with Sim ECAT) ??
CCU Host SW System crash PLC watching CCU
CCU Host HW Memory Error (non-catastrophic) ??
CCU Host HW Processor/Total System Failure PLC watching CCU
CCU Host HW Processor thermal slow-down, (slow) disk drive failure CCU-overrun,CCU-seqchk,PLC...
CCU Host HW IRIG card failure CCU-internal
Inter-system Link AZ MCI/PEI ring broken, link status change, MCI/PEI/COMX reset/failure CCU-internal,CCU-overrun,CCU-seqchk,PLC watching MCI/PEI
Inter-system Link EL MCI/PEI ring broken, link status change, MCI/PEI/COMX reset/failure CCU-internal,CCU-overrun,CCU-seqchk,PLC watching MCI/PEI
Inter-system Link CCU IRIG loss CCU-internal
Inter-system Link CCU ACU link down CCU watching ACU
Inter-system Link CCU OCU link down CCU watching OCU
Inter-system Link PLC-MCI digital signal loss PLC watching MCI/PEI

Detection Mechanisms:
  • CCU-internal - Program logic (e.g. Fault Analysis, basic return code checks etc.)
  • CCU-overrun - Sequencers check coordinated threads
  • CCU-seqchk - Interrupt level check on sequencer threads
  • PLC watching CCU - PLC deadman
  • PLC watching MCI/PEI
  • CCU watching PLC -
  • CCU watching MCI - Sequence numbers, link status
  • CCU watching PEI - Sequence numbers, link status
  • CCU watching ACU - Mlink check
  • CCU watching OCU - Olink check
  • MCI watching CCU - No command received timeout
  • CCU Host OS - [Usually results in process termination]

Available Recovery Actions:
  1. log event
  2. display event
  3. disable axis (soft-stop)
  4. disable axis (brake-stop)
  5. disable both axes (soft-stop) and reject control
  6. disable both axes (brake-stop) and reject control
  7. halt updating PLC WD
  8. exit process
  9. reboot system

How to test any of this?

  • Insert test code which overruns (on purpose) every N cycles
  • Can we use the scheduling freezer? (taskSuspend() equivalent)
  • Pulling ECAT/ENET cables (CCU-PLC, PLC-OCU, AZECAT, ELECAT, CCU-ACU)
  • Test code or suspend sequencer thread(s)
  • Test code in rtdm module to cause kernel OOPS

-- JoeBrandt - 2011-05-18
Topic revision: r3 - 2011-05-19, JoeBrandt
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback