Nagios

Nagios is a watchdog daemon that can be configured to check on various properties of a computer system and issue warnings and/or alarms if one of the properties goes out of spec. Among the things it can check for are disk free space, CPU load, ability to ping other hosts, etc.

Nagios is installed in /usr/local/nagios. The main configuration files are located below that directory in etc/objects.

The server-1 nagios daemon can also run checks on other hosts via its nrpe service. The SWC firewall allows the "nrpe" service through its firewall. Some fairly simple checks are run on each host. Each SWC runs a systemd nrpe service which executes requests from the nagios daemon on the server.

Acknowledging alerts

Most nagios alerts will periodically reissue an alert until the underlying condition is fixed. For a problem that cannot be immediately fixed this can generate a lot of annoying spam. To fix this problem, it is possible to acknowledge the problem which will cause nagios to refrain from issuing any more email messages until the problem is fixed.

A set of scripts to assist in acknowledging problems can be found in /opt/services/bin and start with "nagios". The alert email will provide the name of the host and the name of the service. It is possible to acknowledge all alerts for a given host (nagios-ack-host-problem) or for a particular service on a host (nagios-ack-service-problem). It usually generates a final "alert" email to state that the problem has been acknowledged.

-- JimJacobs - 2020-12-11

This topic: HPC > WebHome > USNODifxCorrelator > UsnoProjectBook2 > UsnoDifxNagios2
Topic revision: 2020-12-11, JimJacobs
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback