USNO Correlator II Overview
The USNO uses a computing cluster built and maintained by NRAO for performing correlations using DiFX. In 2019 the NRAO will install a second cluster using newer hardware to replace the existing correlator. During a transition period the two clusters will be running side by side and then the most components of the existing cluster will be removed from NRAO maintenance.
The new cluster will consist of:
- 64 compute hosts (named swc-001 to swc-064)
- 2 administration file servers (server-1 and server-2)
- A BGFS file server cluster having two metadata server and four storage servers.
- Auxiliary hardware such as network switches, PDUs and KVMs.
The main file/administration server will be server-1. It provides overall network services to the other devices including NFS, DNS, RADIUS, DHCP, etc. In addition, it serves as the boot server to provide a boot image for the diskless compute nodes (swc-xxx). The other server, server-2, is a warm backup for the main file/admin server. It also serves as a guinea pig for patching, etc., to reduce the likelihood that changes to the primary server will bring the correlator down.
The various devices all reside on the administration network, 10.1.36.*, (1G ethernet) to allow administration from server-1 and server-2. High speed 40G Infiniband connections are used to allow the nodes to transfer data between themselves and the BGFS file system; this is necessary because the DiFX
correlation is I/O bound. Two Infiniband nets were created, 10.1.37.*, for data transmission between the SWCs and the second, 10.1.38.*, between the BGFS hosts and the first 16 SWCs for reading of BGFS data.
Only the two servers and one of the compute nodes (swc-001) can be directly accessed from outside the cluster networks. An administrator must SSH into one of these three hosts to access any of the other hardware on the network. Since the servers do not have Infiniband NICs, the state of the high speed networks cannot be directly assessed from the two servers; it is necessary to SSH onto one of the SWCs or BGFS nodes to assess the health of the 10.1.37 and 10.1.38 nets.
In addition to the correlator-related hosts there are several USNO hosts which have direct access to the corellator subnets. Two hosts nominally named ISAN and OSAN are on the Infiniband nets to allow data to be transferred to/from the correlator. A third host, referred to as ACAS lives on the admin network (10.1.36) to allow it to perform security scans of every host on the network.
The 64 swc-xxx nodes perform the actual correlation. For ease of administration, all of these nodes are diskless and boot off an image served up by server-1. There is a provision for node-specific data to reside on the server, but it's use is kept to a minimum.
At the SOC in Socorro, the NRAO maintains a much smaller version of the correlator (called the "test correlator") installed at the USNO (the "production correlator"). It consists of two servers, around 5 compute nodes, a BGFS system and support hardware. The internal networks also mirror those of the production system, although there is only one Infiniband network. The test correlator allows testing and prototyping without interfering with the essential operations of the production USNO cluster; it can also serve as a backup system in the event that the USNO correlator goes off line (e.g., facility issues, etc.). This cluster will also be upgraded so that it consists of the same types of hardware and nearly identical software as the production cluster.