BGFS File Servers

Overview

The new correlator is using BEEGFS (bgfs) in place of Lustre to provide a high speed, high capacity file system. The bgfs servers are of two types: metadata servers (named bg-mds-x) which provide directory-like service for the filesystem and storage servers (named bg-ss-x) which each store files on several RAID arrays. At both sites, bg-mds-1 also hosts an administration service which supports BGFS operation. The test correlator has a one MDS server and one SS server while the production correlator has two MDSs and four SSs. The first MDS server also hosts administration and monitoring services for the filesystem.

The BGFS hosts were installed and configured by the vendor. As such they differ somewhat from the NRAO-installed hosts and from each other. Some aspects of this are persistent (e.g., the BGFS hosts boot via BIOS while the NRAO-installed hosts use UEFI0) while others are slowly convering as software patching occurs on the systems.

Network Access

All the bgfs hosts reside on the admin subnet (10.1.36.*). This is used for administration purposes and also for some high-level communications operations during correlation. The heavy network communication tasks are carried out over an Infiniband subnet.

On the test cluster, all of the SWCs will access the data over a single infiniband subnet (10.1.37)—called the "data" network (to access via this net use a DNS name with a "-data" suffix). On the production cluster only the first sixteen correlator nodes, swc-001..016, will directly access the file servers; access will be on a separate IB subnet (10.1.38)—called the "disk" subnet (use the DNS name suffixed with a "-disk" to access host via the IBG network.

Host Filesystems

Ideally the test and production systems should be parallel with the production system having more instances of the hosts. However, that turns out to not be the case as is shown below from the output of lsblk -f on all BGFS hosts on the two clusters. Even within a host type (e.g., ss) on the production clusters, there are differences between the hosts; this makes some device-aware options (e.g., adding UUID to grub command line in /etc/default/grub differentacross boxes.)

Output of lsblk -f on each host:
--------- bg-mds-1
NAME          FSTYPE      LABEL UUID                                   MOUNTPOINT
sda                                                                    
├─sda1        xfs               a9907d98-424c-47b5-926a-0bc224eff39a   /boot
└─sda2        LVM2_member       yw92iV-Kg0B-j0dc-Tfok-buFS-zHyH-KDMBWt 
  ├─rhel-root xfs               d92e260a-752b-48ea-b970-77d798784e79   /
  ├─rhel-swap swap              fc0a3be4-b927-4020-9e67-9a96772f95ba   [SWAP]
  └─rhel-home xfs               a3d81701-5a60-4f3d-8240-b35b57e0dca8   /home
sdb           ext4              7905d090-87f4-4037-ae51-f33ce566c3f0   /data/metavol01
--------- bg-ss-1
NAME          FSTYPE      LABEL UUID                                   MOUNTPOINT
sda           xfs               788b6129-1686-4137-bdaf-406d49d1238c   /bgfsvol01
sdb                                                                    
├─sdb1        xfs               d7a804ac-dc74-4a7f-8b0c-47681284044d   /boot           
└─sdb2        LVM2_member       SXy4bL-isoe-ENjH-qKmb-NANr-roPm-oAdHpm                 
  ├─rhel-root xfs               52d0852f-208f-4775-a3f4-2013a88cc6c7   /               
  ├─rhel-swap swap              7677f438-d040-4293-98e0-8f964d73677f   [SWAP]          
  └─rhel-home xfs               cffce913-db6f-4497-ae05-5c8a16988033   /home           
sdc                                                                                    
└─sdc1        LVM2_member       Isilv4-p0c2-Tbd0-H2gk-EJnR-3lLa-VJvTHq                 
  └─rhel-home xfs               cffce913-db6f-4497-ae05-5c8a16988033   /home           

Output from lsblk -f on all production cluster BGFS hosts
--------- bg-mds-1
NAME          FSTYPE      LABEL UUID                                   MOUNTPOINT
sda                                                                    
├─sda1        xfs               7e1a1e19-9435-48dc-8662-33bb71b6cf37   /boot
└─sda2        LVM2_member       cu3oGl-EOBZ-inSl-Vnvm-lWjd-sH04-Ca0kWV 
  ├─rhel-root xfs               041fbdb9-0296-431b-b61d-5d87b035a62d   /
  ├─rhel-swap swap              27cc7d28-5331-4470-bc73-6f710e9e3453   [SWAP]
  └─rhel-home xfs               2c94ed21-491f-4fff-aeea-2f27dbe5ff35   /home
sdb           ext4              52e3c366-3890-4158-aef0-cd81ca00d838   /data/metavol01
--------- bg-mds-2
NAME          FSTYPE      LABEL UUID                                   MOUNTPOINT
sda                                                                    
├─sda1        xfs               a936f8d6-0e78-4f25-bdbd-c0e6c046c59c   /boot
└─sda2        LVM2_member       9Tmffn-hpu2-N5AH-POTR-VA3O-sTfv-Rq6AYN 
  ├─rhel-root xfs               7209a162-aab6-4afc-92b2-44d01f82cb0e   /
  ├─rhel-swap swap              091d121c-66f8-440b-bbe7-d566defb0d15   [SWAP]
  └─rhel-home xfs               0a1e36d1-3952-48a4-b856-49b22903f5a7   /home
sdb           ext4              9da9882f-b74f-4b54-9bd9-3ab069a98a75   /data/metavol02
--------- bg-ss-1
NAME      FSTYPE            LABEL        UUID                                 MOUNTPOINT
sda       xfs                            af5393cd-b14c-4ef2-9974-04057ff60eec /bgfsvol01
sdb       xfs                            3798da59-0760-4603-80a6-5b02aa4a982a /bgfsvol02
sdc       xfs                            d911e023-de26-41d5-a909-01865551b78e /bgfsvol03
sdd                                                                           
├─sdd1    linux_raid_member bg-ss-1:root c3ecc5b4-b7a0-c43e-0c24-cc1a6fc3c117 
│ └─md127 xfs                            ca4cfa14-d617-49d9-a869-59c1a209e30d /
├─sdd2    linux_raid_member bg-ss-1:swap 9c5d3d5a-507c-2846-becd-86ac608eda66 
│ └─md126 swap                           a111fbb1-bd76-41d0-ba15-15b0028f752a [SWAP]
└─sdd3    linux_raid_member bg-ss-1:boot 33276865-ed02-377e-ed89-7bae2309c6cd 
  └─md125 xfs                            39a3c48e-130e-4430-b6e5-6b44f6dacdee /boot
sde                                                                           
├─sde1    linux_raid_member bg-ss-1:root c3ecc5b4-b7a0-c43e-0c24-cc1a6fc3c117 
│ └─md127 xfs                            ca4cfa14-d617-49d9-a869-59c1a209e30d /
├─sde2    linux_raid_member bg-ss-1:swap 9c5d3d5a-507c-2846-becd-86ac608eda66 
│ └─md126 swap                           a111fbb1-bd76-41d0-ba15-15b0028f752a [SWAP]
└─sde3    linux_raid_member bg-ss-1:boot 33276865-ed02-377e-ed89-7bae2309c6cd 
  └─md125 xfs                            39a3c48e-130e-4430-b6e5-6b44f6dacdee /boot
--------- bg-ss-2
NAME      FSTYPE            LABEL        UUID                                 MOUNTPOINT
sda       xfs                            67cb97a6-92dd-46c6-9315-35398acb1b89 /bgfsvol04
sdb       xfs                            0be779b4-d285-4e1e-9b34-93f3ad855a6f /bgfsvol05
sdc       xfs                            0d1a517e-f45e-447c-8090-418ba3fd1e28 /bgfsvol06
sdd                                                                           
├─sdd1    xfs                            7e4328f3-ffe0-4b0a-a693-07b13551e723 /boot
├─sdd2    linux_raid_member bg-ss-2:root 5932dc4f-a99b-a708-7385-d4e4f10ddc6f 
│ └─md127 xfs                            9ddc6ce3-65cb-4ed7-a9ff-a49a62017ee9 /
└─sdd3    linux_raid_member bg-ss-2:swap 3de966fe-a68c-660a-3ac3-4cf93753d1bb 
  └─md126 swap                           8d73c18e-78e0-4a16-9451-c799389f70b8 [SWAP]
sde                                                                           
├─sde1    linux_raid_member bg-ss-2:root 5932dc4f-a99b-a708-7385-d4e4f10ddc6f 
│ └─md127 xfs                            9ddc6ce3-65cb-4ed7-a9ff-a49a62017ee9 /
└─sde2    linux_raid_member bg-ss-2:swap 3de966fe-a68c-660a-3ac3-4cf93753d1bb 
  └─md126 swap                           8d73c18e-78e0-4a16-9451-c799389f70b8 [SWAP]
--------- bg-ss-3
NAME      FSTYPE            LABEL        UUID                                 MOUNTPOINT
sda       xfs                            45632d39-05e3-4982-acdf-af81b7799cf7 /bgfsvol07
sdb       xfs                            fa0ee31f-e9c9-4fa6-a6e9-757da78ab9e4 /bgfsvol08
sdc       xfs                            93697c9c-980f-48b4-932e-168b17836ec0 /bgfsvol09
sdd                                                                           
├─sdd1    linux_raid_member bg-ss-3:root c618f0c1-6222-0de1-3fdd-5e15d52539e7 
│ └─md126 xfs                            1a6de6ca-4e34-4923-b66e-958fe3a14fbb /
├─sdd2    linux_raid_member bg-ss-3:boot 020637c4-a53b-b359-524e-c98479edf1b7 
│ └─md125 xfs                            03c6742a-6c44-4673-8b53-b17129a99e23 /boot
└─sdd3    linux_raid_member bg-ss-3:swap 1d37de78-b19c-6aa6-0ddb-e4215d26681e 
  └─md127 swap                           08435f2a-5d46-4bf5-bae7-c77b29fa6958 [SWAP]
sde                                                                           
├─sde1    linux_raid_member bg-ss-3:root c618f0c1-6222-0de1-3fdd-5e15d52539e7 
│ └─md126 xfs                            1a6de6ca-4e34-4923-b66e-958fe3a14fbb /
├─sde2    linux_raid_member bg-ss-3:boot 020637c4-a53b-b359-524e-c98479edf1b7 
│ └─md125 xfs                            03c6742a-6c44-4673-8b53-b17129a99e23 /boot
└─sde3    linux_raid_member bg-ss-3:swap 1d37de78-b19c-6aa6-0ddb-e4215d26681e 
  └─md127 swap                           08435f2a-5d46-4bf5-bae7-c77b29fa6958 [SWAP]
--------- bg-ss-4
NAME      FSTYPE            LABEL        UUID                                 MOUNTPOINT
sda       xfs                            01edac31-33ae-48dd-8136-fe1ca160228d /bgfsvol10
sdb       xfs                            4e235b6e-1ed5-4cc1-a7c3-b8324bf9decf /bgfsvol11
sdc       xfs                            d51f4ad8-adca-4a43-8efc-d33c2564fff6 /bgfsvol12
sdd                                                                           
├─sdd1    linux_raid_member bg-ss-4:root 390bd5ec-c817-bae9-665e-098c8ca5d433 
│ └─md125 xfs                            28c951cf-fa67-4d1d-8d23-cb02353e7709 /
├─sdd2    linux_raid_member bg-ss-4:swap a11c7b96-7e85-9ee9-8f95-5df0cf54519b 
│ └─md126 swap                           ac3b8196-9b67-4aba-a5c7-6bde2e0e036f [SWAP]
└─sdd3    linux_raid_member bg-ss-4:boot 054ac831-5038-367a-e625-7c3d9cd7c5f8 
  └─md127 xfs                            6e2b4061-09cc-4eef-8e02-8f80a689cccb /boot
sde                                                                           
├─sde1    linux_raid_member bg-ss-4:root 390bd5ec-c817-bae9-665e-098c8ca5d433 
│ └─md125 xfs                            28c951cf-fa67-4d1d-8d23-cb02353e7709 /
├─sde2    linux_raid_member bg-ss-4:swap a11c7b96-7e85-9ee9-8f95-5df0cf54519b 
│ └─md126 swap                           ac3b8196-9b67-4aba-a5c7-6bde2e0e036f [SWAP]
└─sde3    linux_raid_member bg-ss-4:boot 054ac831-5038-367a-e625-7c3d9cd7c5f8 
  └─md127 xfs                            6e2b4061-09cc-4eef-8e02-8f80a689cccb /boot
[13:33 root@S-1-p admin]# 

Patching

The BGFS boxes are being configured by the vendor and have their own RHEL 7; the various hosts were configured at different times and appear to have different software installed on them; this is especially true between the MDSs and the SSs. The patching process will allow for this so that the system images of the BGFS boxes will evolve independently of the admin servers. The test cluster's bg-mds-1 downloads patches on behalf of all BGFS hosts on both correlators; at this writing (11/13/20), bg-mds-1t appears to be more current than the DC-located hosts. This can sometimes cause updates to not be applied to the DC-resident hosts.

Another concern is the Mellanox Infiniband drivers. Although updated drivers are often available, Mellanox does not guarantee this and the cluster has experienced a situation where the BGFS kernel update broke the drivers and no new ones were available. Since the cluster undergoes weekly security scans, postponing or rolling back an update has visible consequences and is to be avoided. Because of this, the patching process specified doing the updates to the BGFS systems and then observing whether the Mellanox drivers are broken (the ib0 link devices will vanish) and if so rebuild the drivers on the test cluster and propagate them to the DC correlator.

Start/Stop

A script, /opt/services/bin/bgfs-restart-or-halt, can be used to shutdown all of the bgfs hosts or to restart them. Restarting them is the most usual operation and is done after patches. Halt should be served for times when the system needs to be shut down (e.g., loss of air conditioning, work on the site electrical system, etc.); halting the hosts will require someone on-site to manually press the power switch as toggling the PDU outlet will not cause a halted system to come back on.

Shutdown

bg-ss-1..4 --> bg-mds-2 --> bg-mds-1

Power the storage servers (bg-ss-x) down first, then bg-mds-2 (production) and finally bg-mds-1 (that's where the admin service resides.

Startup

bg-mds-1 --> bg-mds-2 --> bg-ss-1..4

Power the primary mds first (bg-mds-1), then the secondary bg-mds-2 (production) and finally turn on bg-ss-x hosts. The other hosts contact the admin service on bg-mds-1 as they come up and it registers them for access.

-- JimJacobs - 2020-01-23
Topic revision: r6 - 2020-12-08, JimJacobs
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback