BGFS File Servers
Overview
The new correlator is using BEEGFS (bgfs) in place of Lustre to provide a high speed, high capacity file system. The bgfs servers are of two types: metadata servers (named bg-mds-x) which provide directory-like service for the filesystem and storage servers (named bg-ss-x) which each store files on several RAID arrays. At both sites, bg-mds-1 also hosts an administration service which supports BGFS operation. The test correlator has a one MDS server and one SS server while the production correlator has two MDSs and four SSs. The first MDS server also hosts administration and monitoring services for the filesystem.
The BGFS hosts were installed and configured by the vendor. As such they differ somewhat from the NRAO-installed hosts and from each other. Some aspects of this are persistent (e.g., the BGFS hosts boot via BIOS while the NRAO-installed hosts use UEFI0) while others are slowly convering as software patching occurs on the systems.
Network Access
All the bgfs hosts reside on the admin subnet (10.1.36.*). This is used for administration purposes and also for some high-level communications operations during correlation. The heavy network communication tasks are carried out over an Infiniband subnet.
On the test cluster, all of the SWCs will access the data over a single infiniband subnet (10.1.37)—called the "data" network (to access via this net use a DNS name with a "-data" suffix). On the production cluster only the first sixteen correlator nodes, swc-001..016, will directly access the file servers; access will be on a separate IB subnet (10.1.38)—called the "disk" subnet (use the DNS name suffixed with a "-disk" to access host via the IBG network.
Host Filesystems
Ideally the test and production systems should be parallel with the production system having more instances of the hosts. However, that turns out to not be the case as is shown below from the output of
lsblk -f
on all BGFS hosts on the two clusters. Even within a host type (e.g., ss) on the production clusters, there are differences between the hosts; this makes some device-aware options (e.g., adding UUID to grub command line in
/etc/default/grub
differentacross boxes.)
Output of lsblk -f on each host:
--------- bg-mds-1
NAME FSTYPE LABEL UUID MOUNTPOINT
sda
├─sda1 xfs a9907d98-424c-47b5-926a-0bc224eff39a /boot
└─sda2 LVM2_member yw92iV-Kg0B-j0dc-Tfok-buFS-zHyH-KDMBWt
├─rhel-root xfs d92e260a-752b-48ea-b970-77d798784e79 /
├─rhel-swap swap fc0a3be4-b927-4020-9e67-9a96772f95ba [SWAP]
└─rhel-home xfs a3d81701-5a60-4f3d-8240-b35b57e0dca8 /home
sdb ext4 7905d090-87f4-4037-ae51-f33ce566c3f0 /data/metavol01
--------- bg-ss-1
NAME FSTYPE LABEL UUID MOUNTPOINT
sda xfs 788b6129-1686-4137-bdaf-406d49d1238c /bgfsvol01
sdb
├─sdb1 xfs d7a804ac-dc74-4a7f-8b0c-47681284044d /boot
└─sdb2 LVM2_member SXy4bL-isoe-ENjH-qKmb-NANr-roPm-oAdHpm
├─rhel-root xfs 52d0852f-208f-4775-a3f4-2013a88cc6c7 /
├─rhel-swap swap 7677f438-d040-4293-98e0-8f964d73677f [SWAP]
└─rhel-home xfs cffce913-db6f-4497-ae05-5c8a16988033 /home
sdc
└─sdc1 LVM2_member Isilv4-p0c2-Tbd0-H2gk-EJnR-3lLa-VJvTHq
└─rhel-home xfs cffce913-db6f-4497-ae05-5c8a16988033 /home
Output from lsblk -f on all production cluster BGFS hosts
--------- bg-mds-1
NAME FSTYPE LABEL UUID MOUNTPOINT
sda
├─sda1 xfs 7e1a1e19-9435-48dc-8662-33bb71b6cf37 /boot
└─sda2 LVM2_member cu3oGl-EOBZ-inSl-Vnvm-lWjd-sH04-Ca0kWV
├─rhel-root xfs 041fbdb9-0296-431b-b61d-5d87b035a62d /
├─rhel-swap swap 27cc7d28-5331-4470-bc73-6f710e9e3453 [SWAP]
└─rhel-home xfs 2c94ed21-491f-4fff-aeea-2f27dbe5ff35 /home
sdb ext4 52e3c366-3890-4158-aef0-cd81ca00d838 /data/metavol01
--------- bg-mds-2
NAME FSTYPE LABEL UUID MOUNTPOINT
sda
├─sda1 xfs a936f8d6-0e78-4f25-bdbd-c0e6c046c59c /boot
└─sda2 LVM2_member 9Tmffn-hpu2-N5AH-POTR-VA3O-sTfv-Rq6AYN
├─rhel-root xfs 7209a162-aab6-4afc-92b2-44d01f82cb0e /
├─rhel-swap swap 091d121c-66f8-440b-bbe7-d566defb0d15 [SWAP]
└─rhel-home xfs 0a1e36d1-3952-48a4-b856-49b22903f5a7 /home
sdb ext4 9da9882f-b74f-4b54-9bd9-3ab069a98a75 /data/metavol02
--------- bg-ss-1
NAME FSTYPE LABEL UUID MOUNTPOINT
sda xfs af5393cd-b14c-4ef2-9974-04057ff60eec /bgfsvol01
sdb xfs 3798da59-0760-4603-80a6-5b02aa4a982a /bgfsvol02
sdc xfs d911e023-de26-41d5-a909-01865551b78e /bgfsvol03
sdd
├─sdd1 linux_raid_member bg-ss-1:root c3ecc5b4-b7a0-c43e-0c24-cc1a6fc3c117
│ └─md127 xfs ca4cfa14-d617-49d9-a869-59c1a209e30d /
├─sdd2 linux_raid_member bg-ss-1:swap 9c5d3d5a-507c-2846-becd-86ac608eda66
│ └─md126 swap a111fbb1-bd76-41d0-ba15-15b0028f752a [SWAP]
└─sdd3 linux_raid_member bg-ss-1:boot 33276865-ed02-377e-ed89-7bae2309c6cd
└─md125 xfs 39a3c48e-130e-4430-b6e5-6b44f6dacdee /boot
sde
├─sde1 linux_raid_member bg-ss-1:root c3ecc5b4-b7a0-c43e-0c24-cc1a6fc3c117
│ └─md127 xfs ca4cfa14-d617-49d9-a869-59c1a209e30d /
├─sde2 linux_raid_member bg-ss-1:swap 9c5d3d5a-507c-2846-becd-86ac608eda66
│ └─md126 swap a111fbb1-bd76-41d0-ba15-15b0028f752a [SWAP]
└─sde3 linux_raid_member bg-ss-1:boot 33276865-ed02-377e-ed89-7bae2309c6cd
└─md125 xfs 39a3c48e-130e-4430-b6e5-6b44f6dacdee /boot
--------- bg-ss-2
NAME FSTYPE LABEL UUID MOUNTPOINT
sda xfs 67cb97a6-92dd-46c6-9315-35398acb1b89 /bgfsvol04
sdb xfs 0be779b4-d285-4e1e-9b34-93f3ad855a6f /bgfsvol05
sdc xfs 0d1a517e-f45e-447c-8090-418ba3fd1e28 /bgfsvol06
sdd
├─sdd1 xfs 7e4328f3-ffe0-4b0a-a693-07b13551e723 /boot
├─sdd2 linux_raid_member bg-ss-2:root 5932dc4f-a99b-a708-7385-d4e4f10ddc6f
│ └─md127 xfs 9ddc6ce3-65cb-4ed7-a9ff-a49a62017ee9 /
└─sdd3 linux_raid_member bg-ss-2:swap 3de966fe-a68c-660a-3ac3-4cf93753d1bb
└─md126 swap 8d73c18e-78e0-4a16-9451-c799389f70b8 [SWAP]
sde
├─sde1 linux_raid_member bg-ss-2:root 5932dc4f-a99b-a708-7385-d4e4f10ddc6f
│ └─md127 xfs 9ddc6ce3-65cb-4ed7-a9ff-a49a62017ee9 /
└─sde2 linux_raid_member bg-ss-2:swap 3de966fe-a68c-660a-3ac3-4cf93753d1bb
└─md126 swap 8d73c18e-78e0-4a16-9451-c799389f70b8 [SWAP]
--------- bg-ss-3
NAME FSTYPE LABEL UUID MOUNTPOINT
sda xfs 45632d39-05e3-4982-acdf-af81b7799cf7 /bgfsvol07
sdb xfs fa0ee31f-e9c9-4fa6-a6e9-757da78ab9e4 /bgfsvol08
sdc xfs 93697c9c-980f-48b4-932e-168b17836ec0 /bgfsvol09
sdd
├─sdd1 linux_raid_member bg-ss-3:root c618f0c1-6222-0de1-3fdd-5e15d52539e7
│ └─md126 xfs 1a6de6ca-4e34-4923-b66e-958fe3a14fbb /
├─sdd2 linux_raid_member bg-ss-3:boot 020637c4-a53b-b359-524e-c98479edf1b7
│ └─md125 xfs 03c6742a-6c44-4673-8b53-b17129a99e23 /boot
└─sdd3 linux_raid_member bg-ss-3:swap 1d37de78-b19c-6aa6-0ddb-e4215d26681e
└─md127 swap 08435f2a-5d46-4bf5-bae7-c77b29fa6958 [SWAP]
sde
├─sde1 linux_raid_member bg-ss-3:root c618f0c1-6222-0de1-3fdd-5e15d52539e7
│ └─md126 xfs 1a6de6ca-4e34-4923-b66e-958fe3a14fbb /
├─sde2 linux_raid_member bg-ss-3:boot 020637c4-a53b-b359-524e-c98479edf1b7
│ └─md125 xfs 03c6742a-6c44-4673-8b53-b17129a99e23 /boot
└─sde3 linux_raid_member bg-ss-3:swap 1d37de78-b19c-6aa6-0ddb-e4215d26681e
└─md127 swap 08435f2a-5d46-4bf5-bae7-c77b29fa6958 [SWAP]
--------- bg-ss-4
NAME FSTYPE LABEL UUID MOUNTPOINT
sda xfs 01edac31-33ae-48dd-8136-fe1ca160228d /bgfsvol10
sdb xfs 4e235b6e-1ed5-4cc1-a7c3-b8324bf9decf /bgfsvol11
sdc xfs d51f4ad8-adca-4a43-8efc-d33c2564fff6 /bgfsvol12
sdd
├─sdd1 linux_raid_member bg-ss-4:root 390bd5ec-c817-bae9-665e-098c8ca5d433
│ └─md125 xfs 28c951cf-fa67-4d1d-8d23-cb02353e7709 /
├─sdd2 linux_raid_member bg-ss-4:swap a11c7b96-7e85-9ee9-8f95-5df0cf54519b
│ └─md126 swap ac3b8196-9b67-4aba-a5c7-6bde2e0e036f [SWAP]
└─sdd3 linux_raid_member bg-ss-4:boot 054ac831-5038-367a-e625-7c3d9cd7c5f8
└─md127 xfs 6e2b4061-09cc-4eef-8e02-8f80a689cccb /boot
sde
├─sde1 linux_raid_member bg-ss-4:root 390bd5ec-c817-bae9-665e-098c8ca5d433
│ └─md125 xfs 28c951cf-fa67-4d1d-8d23-cb02353e7709 /
├─sde2 linux_raid_member bg-ss-4:swap a11c7b96-7e85-9ee9-8f95-5df0cf54519b
│ └─md126 swap ac3b8196-9b67-4aba-a5c7-6bde2e0e036f [SWAP]
└─sde3 linux_raid_member bg-ss-4:boot 054ac831-5038-367a-e625-7c3d9cd7c5f8
└─md127 xfs 6e2b4061-09cc-4eef-8e02-8f80a689cccb /boot
[13:33 root@S-1-p admin]#
Patching
The BGFS boxes are being configured by the vendor and have their own RHEL 7; the various hosts were configured at different times and appear to have different software installed on them; this is especially true between the MDSs and the SSs. The patching process will allow for this so that the system images of the BGFS boxes will evolve independently of the admin servers. The test cluster's bg-mds-1 downloads patches on behalf of all BGFS hosts on both correlators; at this writing (11/13/20), bg-mds-1t appears to be more current than the DC-located hosts. This can sometimes cause updates to not be applied to the DC-resident hosts.
Another concern is the Mellanox Infiniband drivers. Although updated drivers are often available, Mellanox does not guarantee this and the cluster has experienced a situation where the BGFS kernel update broke the drivers and no new ones were available. Since the cluster undergoes weekly security scans, postponing or rolling back an update has visible consequences and is to be avoided. Because of this, the patching process specified doing the updates to the BGFS systems and then observing whether the Mellanox drivers are broken (the
ib0
link devices will vanish) and if so rebuild the drivers on the test cluster and propagate them to the DC correlator.
Start/Stop
A script,
/opt/services/bin/bgfs-restart-or-halt
, can be used to shutdown all of the bgfs hosts or to restart them. Restarting them is the most usual operation and is done after patches. Halt should be served for times when the system needs to be shut down (e.g., loss of air conditioning, work on the site electrical system, etc.); halting the hosts will require someone on-site to manually press the power switch as toggling the PDU outlet will not cause a halted system to come back on.
Shutdown
bg-ss-1..4 --> bg-mds-2 --> bg-mds-1
Power the storage servers (bg-ss-x) down first, then bg-mds-2 (production) and finally bg-mds-1 (that's where the admin service resides.
Startup
bg-mds-1 --> bg-mds-2 --> bg-ss-1..4
Power the primary mds first (bg-mds-1), then the secondary bg-mds-2 (production) and finally turn on bg-ss-x hosts. The other hosts contact the admin service on bg-mds-1 as they come up and it registers them for access.
--
JimJacobs - 2020-01-23