Lustre Overview

Lustre is a parallel distributed filesystem used in most large scale computing facilities. It allows NRAO/NM desktops, public machines and clusters to share a large file space thus removing the need for repeatedly copying data between systems for processing. It's primarily designed for performance which is achieved by aggregating individual disk throughput across a large number of disks. As a side effect the resulting storage volume is typically large compared to desktop storage. In NRAO/NM's case the current Lustre filesystem is 90TB of storage and capable of sustaining ~4GB/s reads or writes. For similarly designed systems each OSS contributes 30TB of storage and ~1.4GB/s I/O.

The described Lustre configuration is designed to produce maximum throughput and storage volume for minimal money. The cost per node is only 60% greater than the raw cost of disks. It is not a suitable design for high availability of a large number of nodes or large volumes of small I/Os. The configuration attempts to balance disk spindle speed limits (125MB/s per disk), RAID card limits (~500MB/s per card), chassis volume (24 disks) and uniform distribution of data across 2^n data disks such that 1MB I/O's stripe uniformly and network throughput via infiniband (>10Gbit).

The final design consists of 2 OSSes each of which host four OSTs 4+2 RAID 6 arrays (24 total disks). Each RAID group reach spindle and RAID card limits of around 500MB/s raw, 375 to 400MB/s formatted. The 4 OSTs provide ~1.4GB of total I/O which is transmitted via 40Gbit QDR Infiniband to clients.

A schematic type drawing which shows the physical layout of the OSSes and network connectivity to the MDS, post processing and archive can be found here: lustre-schematic.pdf

Full documentation for Lustre can be found at http://wiki.lustre.org/index.php/Use:Use

Definitions

OSS:
Object Storage Server, consists of 1 or more OSTs, stores actual block data.
OST:
Object Storage Target, physical disks, consists of 1 or more disks in a RAID configuration
MDS:
Metadata Server. consists of MDT and MGS, stores file metadata (owner, date stamps, permissions etc)
MDT:
Metadata Target, physical disk which contains metadata '>dt>MGS:
Message Server, communications server for OSS/MDS/client traffic

-- JamesRobnett - 2011-07-15
Topic revision: r3 - 2011-08-06, JamesRobnett
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback