NAASC Lustre 2.x Activities

Background

The version of Lustre we run on (1.8.9) is very ancient in "Internet Years" and is no longer actively maintained by Sun, lustre.org, Whamcloud, I mean Intel.

Some of the recent hardware work was to also get us ready for an upgrade of the Lustre software itself.

We need to do 2 upgrades, 2 times each. This is 4 outages. Below is an optimized path, from a technology perspective.

Upgrade the computer room from 1.8.9 to 2.4.3. Upgrade all the clients from 1.8.9 to 2.4.3. Upgrade the computer room to 2.5.3. Upgrade the clients to 2.5.3.

June 2 +/-
Computer room downtime - 3 hours starting at 0600 on Thursday June 2nd. We (Jessica and I) did an activity just like this at USNO in January on behalf of James et al. in NM. We will make new system drives with updated software and then clone the old configuration information.
Clients - 100% reboot of all NAASC Linux desktops and client machines like POLARIS, ELWOOD, CARTMAN (Request Handler web application impact).

July 7 +/-

Upgrade the computer room from 2.4.3 to 2.5.3. Upgrade all the clients to 2.5.3.

Computer room downtime - more like 60-90 minutes to update the machines and reboot them on Thursday July 7th at 0600.
Clients - same as above - 100% reboot of Desktops etc. as above, on Monday July 11th at 0600 and POLARIS, ELWOOD, CARTMAN (Request Handler web application impact), just like above.

"How come we can't just stay on 2.4.3 and save 2 outages?"

2.4.3 is the "bridge" version between Lustre 1.x and Lustre 2.x.

2.5.3 has many bug fixes / enhancements and is the "stable" Production release.

These dates and activities are balanced between an aggressive but stable schedule and trying to minimize end-user aggravation. We can compress the gaps between the activities if that makes it any better. We don't see any real benefit in stretching the gaps AKA "pulling the Band-Aid off slowly".

TL;DR

This summer, NAASC Lustre will be unavailable 2 times. Any machines that connect to NAASC Lustre will be rebooted 2 times.

Technical Blah-blah Starts Here (it's actually living on 3 whiteboards and 6,217 e-mails)

Hardware

  • 1 MDS
  • 8 OSSs
  • 2 LNET Routers

Software

Kernels

Lustre RPMs

Testing

TLMDS

TLOSS1

TLCLIENT

TLHPC

Topic revision: r2 - 2016-05-11, MikeHatz
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback