Neo Correlator Patching

The patch process for the Neo correlator (installed at USNO on November 2019) is outlined below. A set of "checkboxes" is provided to assist in progressing along the patching process (see below for checkbox legend and reset control).

Using This Page

Draft

Check List Legend and Reset

Reset here --> (todo - todo doing - doing done - done ) Reset all checkboxes

The actions below are split into tables with two columns. The left column contains the hosts where a particular command/operation is to be performed (or at least launched). The right column contains some description and one or more actions which are in green and boldface . The host names are suffixed with a "t" or a "p" to indicate it is located on either the test ("t") or production ("p") correlator (the correlator DNSs do not recognize the suffixes: they are for documentation purposes only or other trivial use only).

Checkboxes (see the legend just above) are for keeping track of the status of the patch. They have three states: todo, doing, done. All of them can be reset to "todo" using the legend. A single icon moves from one state to the next by clicking on it; if a "done" icon is clicked it will change back to "todo".

Patch Eve

This is usually Monday.

Update Patch Variables

Server-1t As root on server-1t:

edit-patch-variables

  1. Increase both versions by one. (RHEL-7.6.0.7 goes to RHEL-7.6.0.8)
  2. Correct NRAO user name, if needed (required to get into gygax).
  3. Set date to the patch day (e.g., 31-October-2020)

spread-patch-variabes

This copies the patch variables to server-2t, server-1p and server-2p. Since the BGFS and SWC hosts mount the script directory on server-1t and server-1p, they'll get them from there.

Create New Diskless Image

Server-1t

Server-1p
One each of the two primary servers:

usno_create-diskless-image.sh

This will use rsync to create a copy of the current diskless image (using ${OLDVER}) named /opt/services/diskless_boot/${OLDVER}

Update Secondary Servers (Patch Eve)

Server-2t

Server-2p
On each of the two secondary servers:

nohup usno_sync-server-2.sh > /opt/services/bin/log/sync-server2.log & tail -f /opt/services/bin/log/sync-server2.log

This will synchronize important files from the primary server onto the corresponding secondary server. This can be fairly slow, so it's done on Prepatch Monday and then repeated on Patch day to pick up any changes.

Patch Day Morning—AOC Activites

This is usually Tuesday.

Check for patches

Server-1t Check for patches using:

check-for-patches

This will indicate the number of patches available for the various system types (server, SWCs, file servers, etc.)

Fetch Patches on Server-1t

Server-1t

Server-1p
Get patches using:

usno_patch.sh get-patches

Complete patch fetch on s1t before starting patch fetch on s1p.

On the test cluster this will pull down the patches from Red Hat for the servers, the SWCs and the bgfs servers. It will also bundle up the relevant parts of the home directory trees for root and admin along with /opt/services/bin (the idea is that server-1t is the reference copy for these files). On the production cluster, the script will copy over the update files prepared on server-1t (with current DC network setup this is especially awkward). See Remote access of server-1t from server-1p

Redo Update Secondary Test Server

Server-2p On each of the two secondary servers:

nohup usno_sync-server-2.sh > /opt/services/bin/log/sync-server2.log & tail -f /opt/services/bin/log/sync-server2.log

This will synchronize important files from the primary server onto the corresponding secondary server. Since this only picks up mods since the previous day the operation should be fairly quick.

Baseline Test Correlator

Server-1t Baseline the system:

host-status --all

The output of this command provides a quick status summary for the correlator systems. It can be handy to see how things were before system modifications are performed (e.g., an swc might be down prior to a diskless image page so if it won't come back up it's probably not the diskless image, etc.)

Patch BGFS Servers—at AOC

Server-1p usno_patch.sh apply-bgfs

After the patching appears to succeed:

bgfs-restart-or-halt restart

This tries to restart the BGFS hosts in an order that should allow them to come un synchronized. If there is a kernel patch, it's possible that the Infiniband network device will be down after reboot. Test for this by:

do-on-hosts --bgs ip link show dev ib0

It this fails then the drivers need to be installed:

do-on-hosts --timeout=180 --bgs bgfs-install-mellanox-drivers

This should install the drivers that were build on bg-mds-1t and copied over. No reboot should be required. 10/8/20 (JHJ):It's possible that the drivers could be installed as part of the BGFS patching operation.

Repeat the ip command above to double check.

Patch Diskless Test Image, Reboot and Validate

Server-1t

Server-1p
One each of the two primary servers:

usno_patch.sh apply-diskless

This will apply the patches to the recently created diskless image.

Server-1t su admin -c "ssh swc-005 reboot"; ping-diskless swc-005

After it's back up check its status:

host-status swc-005

Note:The SWCs become unreachable via pinging after they start shutting down, echo a few pings and then become unreachable again until they finally reboot.

Server-1t do-on-hosts --swcs reboot; ping-diskless

Check correlator status:

host-status --all

Note:The SWCs become unreachable via pinging after they start shutting down, echo a few pings and then become unreachable again until they finally reboot.

Patch Test Admin Servers

Patch Server-2t

Server-2t

Server-2p
Apply the patches to the two secondary servers:

usno_patch.sh apply-host

Reboot.

Patch Server-1t

Server-1t Apply the patches to the test primary server:

usno_patch.sh apply-host

Reboot.

Validate Test Correlator

Server-1t host-status --all

Patch Day Morning—DC Activites

These activities are best done after all of the AOC activities since those may generate additional patch files.

Fetch Patches on Server-1p

Server-1p Get patches using:

usno_patch.sh get-patches

Complete patch fetch on s1t before starting patch fetch on s1p.

On the test cluster this will pull down the patches from Red Hat for the servers, the SWCs and the bgfs servers. It will also bundle up the relevant parts of the home directory trees for root and admin along with /opt/services/bin (the idea is that server-1t is the reference copy for these files). On the production cluster, the script will copy over the update files prepared on server-1t (with current DC network setup this is especially awkward). See Remote access of server-1t from server-1p

Redo Update of Secondary Production Server

Server-2p On the production secondary server:

nohup usno_sync-server-2.sh > /opt/services/bin/log/sync-server2.log & tail -f /opt/services/bin/log/sync-server2.log

This will synchronize important files from the primary server onto the corresponding secondary server. Since this only picks up mods since the previous day the operation should be fairly quick.

Patch Server-2p

Server-2p Apply the patches to the two secondary servers:

usno_patch.sh apply-host

Reboot.

Patch Day Afternoon

During the maintenance window (3-5 NM time on Tuesdays or Thursdays).

Baseline Systems

Server-1p Baseline the system:

host-status --all

The output of this command provides a quick status summary for the correlator systems. It can be handy to see how things were before system modifications are performed (e.g., an swc might be down prior to a diskless image page so if it won't come back up it's probably not the diskless image, etc.)

Patch BGFS Servers—at DC

Server-1p usno_patch.sh apply-bgfs

After the patching appears to succeed:

bgfs-restart-or-halt restart

This tries to restart the BGFS hosts in an order that should allow them to come un synchronized. If there is a kernel patch, it's possible that the Infiniband network device will be down after reboot. Test for this by:

do-on-hosts --bgs ip link show dev ib0

It this fails then the drivers need to be installed:

do-on-hosts --timeout=180 --bgs bgfs-install-mellanox-drivers

This should install the drivers that were build on bg-mds-1t and copied over. No reboot should be required. 10/8/20 (JHJ):It's possible that the drivers could be installed as part of the BGFS patching operation.

Repeat the ip command above to double check.

Reboot and Validate Correlator Nodes

Server-1p su admin -c "ssh swc-016 reboot"; ping-diskless swc-016

After it's back up check its status:

host-status swc-032

swc-016 was chose because it has dual Infiniband NICs.

Note:The SWCs become unreachable via pinging after they start shutting down, echo a few pings and then become unreachable again until they finally reboot.

Server-1p do-on-hosts --swcs reboot; ping-diskless

After the systems stabilize:

host-status --all

Note:The SWCs become unreachable via pinging after they start shutting down, echo a few pings and then become unreachable again until they finally reboot.

Patch Server-1p

Server-1p Apply the patches to the DC primary server:

usno_patch.sh apply-host

Reboot.

Do one final sanity check:

hostStatus.py --all

Validate Test Correlator

Server-1p host-status --all

Email Patch Notice

Send out an email to interested parties. Clicking on the link below will trigger a browser mailto link. This should open an email window on your desktop with the message partially prefilled. Edit the message as needed and then send it from your mail widget.

Announce by email to: jjacobs@nrao.edu,jacobsj@q.com,usno-admins@nrao.edu

Edit the wiki text of this page to alter the addressees, etc. (note the comment in page source).

Patching Overview

Per USNO requirements, the systems are patched weekly. There are two maintenance windows for patching: the primary window is Tuesday 3-5 MT (5-7 ET) and secondary window is Thursdays 3-5 MT (5-7 ET). Patching is potentially a two day process. On Patch Day Eve, new diskless images are created on server-1{p,t} and then the secondary servers are then synced with the primary. Each of these operations can take 1-2 hours depending on the size of the diskless image. On Patch Day, the patches are fetched if there are any. Patches are potentially fetched for the servers, the diskless systems and eventually the BGFS file servers. Thus each system type can evolve at its own rate. Patches are only fetched from the test system at the AOC and these are then pulled over to the production cluster via server-1p.

Nomenclature

Anytime there are patches, the USNO system version will update. The USNO system takes the form RHEL-7.6.x.x; the first two components track the RedHat system version in use on the diskless systems. The fourth component will update with each patching process, and the third component will be incremented if something significant occurs; the second version will increment if the operating system version updates and might even go to Rhel8 some day. Because of the independent patching for each system type, it's possible that the diskless version will be increment but effectively remain unchanged.

Software support

Many of the patching operations are carried out by scripts located in /opt/services/bin and many of those scripts use subscripts located in the patching subdirectory below the top-level scripts. One important top-level script is patch-variables.sh which is used to specify the old and new USNO system versions as well as the person performing the update (it uses the NRAO user name of the admin) and the date the patching is being performed (use the date of Patch Day itself).

Patch File Locations

The get-patches operation downloads patches for the admin servers, the SWCs and the BGFS hosts. To maintain the independence of the installations for these three hosts types, the patches pulled down for them are indepedent. The patches are downloaded below /export/home/usno-serv/usno/updates. In that directory will be a directory of the form RHEL-7.6.x.y which will contain the patches for the most recent patch versions. Below that directory are three more: bgfs, server, and swc. These hold the patches for a particular hosts type. Each of these three type directories has two subdirectories: basic and kernel; kernel holds all of the kernel-related patches and basic holds all of the rest. In weeks where there are no kernel patches, that directory will be empty. They are spearated out this way because the kernel patches are applied differently (installed rather than updated); in addition, when the kernel is updated other actions must potentially be take (e.g., moving updated diskless kernel to the tftp directory, rebuilding the bgfs kernel modules, etc.).

Patches are downloaded from server-1t, a diskless image chroot (effectively anyway via the yum --installroot option) and from bg-mds-1t.

-- JimJacobs - 2019-12-16

This topic: HPC > UsnoProjectBook > UsnoProjectBook2 > UsnoDifxOsPatch2 > UsnoDifxPatch2Draft
Topic revision: 2020-11-02, JimJacobs
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback