Az/El System Watchdogs and Failure Scenarios
Requirements
CCU Task Overrun Checks
- The CCU task sequencer shall monitor its controlled tasks for overrun conditions. The tolerance for overruns shall be configurable ( TBD ).
- The CCU task sequencer shall be able to clearly identify the overrunning task.
- The CCU sequencer tasks shall be verified to be running by a check performed periodically in the IRIG interrupt handler.
- The IRIG driver shall enforce an 'exclusive' one at a time device open policy. (This prevents inadvertently running more than one CCU process.)
PLC Watching CCU
The PLC periodically checks for CCU updates to a 32 bit register at V3000, and activity on a bit register at PLC address C301. If activity on either register ceases for more than
TBD seconds, the PLC watchdog shall expire disabling any active axes through a brake-sequenced stop. The CCU shall update the register at V3000 at least once a second (every 500ms typical) with a new value (e.g. a counter or 32 bit millisecond value). The CCU shall toggle the register at C301 at least once a second (every 200ms typical).
CCU Watching ACU
The CCU checks for updates to a 'deadman' token which is passed from ACU to the CCU, incremented, and then sent back with the next request for status. If the activity ceases (no updates for a period of 10 seconds) while the ACU is in control, the CCU will reject control and command any active axes to disable using a servoed-stop.
CCU Watching OCU
The CCU checks for updates to a 'deadman' token which is passed from ACU to the CCU, incremented, and then sent back with the next request for status. If the activity ceases (no updates for a period of 5 seconds) while the OCU is in control, the CCU will reject control and command any active axes to disable using a servoed-stop.
CCU Watching MCI/PEI
Each EtherCAT message sent by the CCU includes a sequence number, which is used by the PEI/MCI's to signify a successfully completed I/O cycle by returning the most recent command sequence number with the feedback data.
PLC watching MCI
The PLC monitors the state of the PEI/MCI watchdog status and MC Inhibit lines from the MCI. This provides a cross-check to verify the embedded processor watchdog has not fired, and the inhibit line is not active.
MCI watching CCU
The MCI monitors EtherCAT activity. If EtherCAT activity ceases, a timeout currently set at ~40msec expires and the MCI will zero the armature current DAC outputs (
Does MC_Inhibit become asserted? Is the PEI connected to the PLC in a similar way? )
Failure Modes
CCU Host Software Failures
Thread(s) hang accessing a resource or deadlock conditions
Sequencers can detect overrunning threads. If a sequencer thread deadlocks, the event ISR should detect it.
What can/should the ISR check do if it fails?
Death Grip: A thread starts running in a tight loop
Depending upon the task's base priority, this may be caught by either the sequencer checks or the ISR check.
Thread(s) exit prematurely
I think it would cause effects similar to deadlocks/overruns, but with a more permanent status.
Segmentation violation, bus error, abort, out of memory, memory error etc.
I is likely that the process would exit, which in itself is probably ok, as it would trigger the 'PLC watching CCU' watchdog. (The condition we must guard against is having some tasks running (satisfying watchdogs) and other tasks not doing their part.)
OS-Kernel context 'OOPS' (in kernel Seg-V)
A kernel OOPS sometimes takes out the calling process, although sometimes a system freeze will result.
Duplicate CCU Process Running
One easy method to prevent this is to implement the 'exclusive open' policy in the IRIG driver. (Used elsewhere with good results.)
- Sim vs. Real EtherCAT (e.g. using real PLC to run with Simulated ECAT or vise-versa)
- Incorrect PLC host (Modbus/PLC interface)
Probably the worst case would be the CCU talking to the real PLC (satisfying the PLC watches CCU dog) but with one or both EtherCAT modules in simulation mode. The CCU would be happy until the axes where enabled. Somewhere down the line, the PLC would note that the system is enabled, but most/all of the MC's are not enabled. [Seems to me that the MCI's should be toggling a line to the PLC in response to EtherCAT activity.]
CCU Host Hardware Failures
- Memory Error
- Processor Failure
- Processor Thermal Slowdown
- I don't know if the OS will catch this first and attempt a system shutdown or the system just slows down. If it just slows down to the point where tasks are overruning, the usual monitors should catch it. If the system does a shutdown, the PLC watching CCU watchdog should shut the telescope down. A third possibility is that the OS doesn't detect it, and tasks don't overrun (so the system is not impacted).
- Host Component Failure (e.g. processor support chips, bus, disk cable, disk drive, cooling fans, power supply)
- Disk failure. This is one mode where almost nothing can be trusted. Failure modes where disks slowly fail by requiring a number of 'retries' means that I/O operations take more and more time. Buffers fill-up (both in programs and in the OS), further draining system resources.
Inter-System Link Failures
- CCU-AZ MCI/PEI EtherCAT
- CCU-EL MCI/PEI EtherCAT
- PLC-CCU Ethernet
- CCU-OCU Ethernet
- CCU-ACU Ethernet
- PLC-MCI Digital signals
- CCU-IRIG Clock signal
PLC Failure
- PLC component loss (I/O module, Network module)
- PLC CPU loss
- Power supply loss
- Processor watchdog/reboot (due to PLC error)
MCI Failure
- MC fuse blown
- MCI supply loss
- Processor watchdog/reboot (due to MCI wd or reset)
PEI/Encoder Failure
- Encoder power loss
- PEI supply loss
- Processor watchdog/reboot (due to PEI wd or reset)
Summary of Failures vs. Detection Mechanisms
Component Class |
Failure |
Detection Mechanism |
CCU Host SW |
One or more thread(s) hang accessing a resource or thread deadlock conditions |
CCU-overrun,CCU-seqchk,... |
CCU Host SW |
A thread starts running in a tight loop (Death grip) |
CCU-overrun,CCU-seqchk,... |
CCU Host SW |
A thread exits prematurely |
CCU-overrun,CCU-seqchk,... |
CCU Host SW |
Segmentation violation (wild pointer, or memory error) |
CCU Host OS, CCU-overrun,CCU-seqchk,... |
CCU Host SW |
Kernel-context Segmentation violation (wild pointer, or memory error) |
CCU-overrun,CCU-seqchk,... |
CCU Host SW |
Duplicate CCU process running |
??, CCU-internal,... |
CCU Host SW |
System misconfigured (e.g. real PLC with Sim ECAT) |
?? |
CCU Host SW |
System crash |
PLC watching CCU |
CCU Host HW |
Memory Error (non-catastrophic) |
?? |
CCU Host HW |
Processor/Total System Failure |
PLC watching CCU |
CCU Host HW |
Processor thermal slow-down, (slow) disk drive failure |
CCU-overrun,CCU-seqchk,PLC... |
CCU Host HW |
IRIG card failure |
CCU-internal |
Inter-system Link |
AZ MCI/PEI ring broken, link status change, MCI/PEI/COMX reset/failure |
CCU-internal,CCU-overrun,CCU-seqchk,PLC watching MCI/PEI |
Inter-system Link |
EL MCI/PEI ring broken, link status change, MCI/PEI/COMX reset/failure |
CCU-internal,CCU-overrun,CCU-seqchk,PLC watching MCI/PEI |
Inter-system Link |
CCU IRIG loss |
CCU-internal |
Inter-system Link |
CCU ACU link down |
CCU watching ACU |
Inter-system Link |
CCU OCU link down |
CCU watching OCU |
Inter-system Link |
PLC-MCI digital signal loss |
PLC watching MCI/PEI |
Detection Mechanisms:
- CCU-internal - Program logic (e.g. Fault Analysis, basic return code checks etc.)
- CCU-overrun - Sequencers check coordinated threads
- CCU-seqchk - Interrupt level check on sequencer threads
- PLC watching CCU - PLC deadman
- PLC watching MCI/PEI
- CCU watching PLC -
- CCU watching MCI - Sequence numbers, link status
- CCU watching PEI - Sequence numbers, link status
- CCU watching ACU - Mlink check
- CCU watching OCU - Olink check
- MCI watching CCU - No command received timeout
- CCU Host OS - [Usually results in process termination]
Available Recovery Actions:
- log event
- display event
- disable axis (soft-stop)
- disable axis (brake-stop)
- disable both axes (soft-stop) and reject control
- disable both axes (brake-stop) and reject control
- halt updating PLC WD
- exit process
- reboot system
How to test any of this?
- Insert test code which overruns (on purpose) every N cycles
- Can we use the scheduling freezer? (taskSuspend() equivalent)
- Pulling ECAT/ENET cables (CCU-PLC, PLC-OCU, AZECAT, ELECAT, CCU-ACU)
- Test code or suspend sequencer thread(s)
- Test code in rtdm module to cause kernel OOPS
--
JoeBrandt - 2011-05-18