GBO / LBO / JAO / NRAO HPC (High Performance Computing) Support Wiki

Introduction

This wiki is to track open tasks and to collect detailed logs for HPC related issues. It will augment the HPC mailing list when we need a place to track more verbose topics.

Relevant Topics

  1. High Performance Computing (HPC) Cluster Operations
  2. Lustre Filesystem Operations
  3. Cluster Software - TORQUE / Maui / Robinhood
  4. Diagnostics - CASA / TORQUE / Maui
  5. "Lustre Lab" in Charlottesville
  6. Hardware Planning
  7. HPC Troubleshooting Log
  8. HPC Group Meetings
  9. NAASC Lustre FID-in-direct activity, April 2017

Configurations

  1. High Level
  2. Server Specific
  3. Client Specific

Advice and How-tos

  1. File distribution across OSTs (2.5.3)
  2. Manual balancing of Lustre filesystem ("catch and release")
  3. Using nagios to monitor OST capacity
  4. Tracking Lustre bugs we care about
  5. Upgrade guidance, including writeconf and fsck

1 - High Level Site Overview

Location
Name
Client Mount Point
Size
Lustre
Redhat
Kernel
Inter-connect
Notes
 
NRAO - Charlottesville
CV Lustre
/.lustre/cv
656 TB
2.5.5
6.8
2.6.32-642.6.2.el6_lustre.x86_64
10 Gb Ethernet
Non-HPC Lustre for Charlottesville
 
NRAO - Charlottesville
NAASC Lustre
/.lustre/naasc
651 TB
2.5.5
6.8
2.6.32-642.6.2.el6_lustre.x86_64
40 Gb InfiniBand
NAASC HPC Lustre
 
GBO - Green Bank
VEGAS Lustre
./lustre/vegas
786 TB
2.5.3
6.8
2.6.32-431.23.3.el6_lustre.x86_64
40 Gb InfiniBand
VEGAS+Pulsar Lustre
 
NRAO - Socorro
aoclst03
/.lustre/aoc
1,600 TB
2.5.5
6.8/6.9
2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
40 Gb Infiniband
 
 
NRAO - Socorro
evla
/.lustre/evla
59 TB
2.5.5
6.8
2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
40 Gb Infiniband
 
 
JAO - Santiago
sciops
/mnt/sciops
350 TB
2.5.3
6.6
2.6.32-504.12.2.el6.x86_64
40 Gb Infiniband
PI reduction
 
JAO - Santiago
jaosco
/mnt/jaosco
270 TB
2.5.3
6.5
2.6.32-431.23.3.el6.x86_64
40 Gb Infiniband
ALMA Pipeline
 
NAOJ - Tokyo
 
/lustre0
80TB
2.1.5
6.3
2.6.32-279.19.1.el6_lustre.x86_64
40 Gb Infiniband
ALMA Pipeline
 
ESO - Garching
?
 
 
 
 
 
 
 
 

2 - Server Specific

Location
Name
RAM
Lustre
Redhat
Kernel
Inter-connect
Notes
 
NRAO - Charlottesville
lnet0[1-2].cv.nrao.edu
32 GB
2.5.3
6.8
2.6.32-642.6.2.el6.x86_64
IB / 10g
LNET Routers for CV and NAASC Lustres
 
 
NRAO - Charlottesville
cv-mds.cv.nrao.edu & naasc-mds.cv.nrao.edu
24 GB / 48 GB
2.5.3
6.8
2.6.32-431.23.3.el6_lustre.x86_64
IB (NAASC) and 10g (CV)
CV and NAASC Lustre MDSs
 
 
 
NRAO - Charlottesville
cv-oss-[4-16].cv.nrao.edu
4 GB - 64 GB
2.5.3
6.8
2.6.32-431.23.3.el6_lustre.x86_64
10 Gb Ethernet
CV Lustre OSSs
 
 
 
NRAO - Charlottesville
naasc-oss-[1-9].cv.nrao.edu
16 GB - 64 GB
2.5.3
6.8
2.6.32-431.23.3.el6_lustre.x86_64
40 Gb InfiniBand
NAASC Lustre OSSs
 
 
 
NRAO - Socorro
aocmds.aoc.nrao.edu
64 GB
2.4.3
6.8
2.6.32-358.23.2.el6_lustre.x86_64
IB
 
 
 
 
NRAO - Socorro
aocoss[01-09].aoc.nrao.edu
64 GB/128 GB
2.5.5
6.8/6.9
2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
IB
 
 
 
 
NRAO - Socorro
aoclet[01-02].aoc.nrao.edu
24 GB
2.5.5
6.8
2.6.32-642.4.2.el6.x86_64
IB / 10Gb / 1Gb
 
 
 
 
NRAO - Socorro
evlamds
24 GB
2.5.5
6.8
2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
IB
 
 
 
 
NRAO - Socorro
evlaoss01
32 GB
2.5.5
6.8
2.6.32-642.6.2.el6_lustre.2.5.5.x86_64
IB
 
 
 
 
NRAO - Socorro
evlalnet[01-02]
24 GB/8 GB
2.5.5
6.8
2.6.32-642.11.1.el6.x86_64
IB / 1Gb
 
 
 
 
GBO - Green Bank
vegas-mds
16 GB
2.5.3
6.8
2.6.32-431.23.3.el6_lustre.x86_64
IB
LNET Routers for 10g and 1g access
 
 
 
GBO - Green Bank
vegas-oss-[01-15]
12 GB
2.5.3
6.8
2.6.32-431.23.3.el6_lustre.x86_64
IB
LNET Routers for 10g and 1g access
 
 
 
JAO - Santiago
jaopost0[01-40]
64 GB/128 GB
2.5.3
6.5
2.6.32-431.23.3.el6.x86_64
IB
ALMA Pipeline
 
 
 
JAO - Santiago
jaopost-oss[01-06]
64 GB
2.5.3
6.6
2.6.32-431.23.3.el6_lustre.x86_64
IB
ALMA Pipeline
 
 
 
JAO - Santiago
jaopost-mds
49 GB
2.5.3
6.7
2.6.32-431.23.3.el6_lustre.x86_64
IB
ALMA Pipeline
 
 
 
JAO - Santiago
sciops[01-04]
64 GB
2.5.3
6.6
2.6.32-504.12.2.el6.x86_64
IB
PI reduction
 
 
 
JAO - Santiago
lst-oss-[01-02]
128 GB
2.5.3
6.6
2.6.32-431.23.3.el6_lustre.x86_64
IB
PI reduction
 
 
 
JAO - Santiago
lst-mds-[01-02]
64 GB
2.5.3
6.6
2.6.32-431.23.3.el6_lustre.x86_64
IB
PI reduction
 
 
 
NAOJ - Tokyo
lustre-mds
32 GB
2.1.5
6.3
2.6.32-279.19.1.el6_lustre.x86_64
40 Gb InfiniBand
 
 
 
 
NAOJ - Tokyo
lustre-oss0[1-5]
32 GB
2.1.5
6.3
2.6.32-279.19.1.el6_lustre.x86_64
40 Gb InfiniBand
 
 
 
 
ESO - Garching
?
 
 
 
 
 
 
 
 
 

We have 1 tweak in /etc/rc.local on NM's OSSs, where "blockdev" is set for each OSTs device:

# Lustre
/sbin/blockdev --setra 16384 /dev/sda
/sbin/blockdev --setra 16384 /dev/sdb
/sbin/blockdev --setra 16384 /dev/sdc

lctl set_param obdfilter.*.readcache_max_filesize=6M

GB does the same thing.

JAOPOST tweaks: Server:

/usr/sbin/lctl set_param osp.*.max_rpcs_in_progress=65536
/usr/sbin/lctl set_param osp.*.max_rpcs_in_flight=32
/usr/sbin/lctl set_param timeout=600
/usr/sbin/lctl set_param ldlm_timeout=200
/usr/sbin/lctl set_param at_min=250
/usr/sbin/lctl set_param at_max=600

JAOPOST tweaks: Client:

/usr/sbin/lctl set_param osc.*.checksums=0
/usr/sbin/lctl set_param timeout=600
/usr/sbin/lctl set_param at_min=250
/usr/sbin/lctl set_param at_max=600
/usr/sbin/lctl set_param osc.*OST*.max_rpcs_in_flight=32 
/usr/sbin/lctl set_param osc.*OST*.max_dirty_mb=128

3 - Client Specific

Location Name Client Mount Point RAMSorted ascending Swap Lustre Redhat Kernel Inter-connect Notes
ESO - Garching
?
 
 
 
 
 
 
 
 
 
 
NRAO - Charlottesville
Desktop
/.lustre/cv & /.lustre/naasc
8 GB - 64 GB
50%-100% of RAM
2.5.3
6.8
2.6.32-642.6.2.el6.x86_64
1 Gb Ethernet
Linux Desktops
 
 
GBO - Green Bank
Data Reduction machines
/.lustre/vegas
16 GB - 256 GB
 
2.5.3
6.8
2.6.32-642.6.2.el6.x86_64
IB / 10g / 1g
 
 
 
JAO - Santiago
sciops
/mnt/sciops
64 GB
 
2.5.3
6.6
2.6.32-504.12.2.el6.x86_64
IB
PI reduction
 
 
JAO - Santiago
jaopost
/mnt/jaosco
64 GB/128 GB
 
2.5.3
6.5
2.6.32-431.23.3.el6.x86_64
IB
ALMA Pipeline
 
 
NAOJ - Tokyo
arc-pl-proc[01-14]
/lustre0
64 GB/128 GB
2GB for 64GB RAM / 4GB for 128GB RAM
2.1.5
6.4
2.6.32-279.19.1.el6.x86_64
40 Gb InfiniBand
ALMA Pipeline
 
 
NRAO - Socorro
nmpost{001..090}
/.lustre/aoc
128 GB - 512 GB
240GB
2.5.5
6.9
2.6.32-696.13.2.el6.x86_64
IB
 
 
 
NRAO - Charlottesville
cvpost0[02-65]
/.lustre/cv & /.lustre/naasc
256 GB
128 GB
2.5.3
6.8
2.6.32-642.4.2.el6.x86_64
40 Gb InfiniBand
NAASC HPC Compute Nodes
 
 

HPC Troubleshooting Requests

Rather than bogging down the HPC mailing list with megabytes of log files, please create a new entry in the table and attach your logs here. At a minimum troubleshooting logs should include:
  1. MDS log
  2. OSS log(s)
  3. Client log(s)
  4. What lustre are you having a problem with? Example NRAO - /lustre/naasc?
  5. Which version of Lustre / Redhat are you running on the MDS/OSS/Clients?
  6. InfiniBand or 10Gb interconnects?

Questions / Tasks List

  1. [X] 2016-10-27 - HPC mailing list - created. All NRAO and JAO and ARC personnel contacted. Management notification sent out too - Mike Hatz / José Parra
  2. [ ] 2016-10-31 - JAO access to NRAO. abarrien, bmalet, msimmonds are active. Verify group / NIS permissions are correct at NRAO - Mike Hatz / Pat Murphy
  3. [X] 2016-11-03 - Wiki area created and initially populated - Mike Hatz / Pat Murphy
  4. [ ] 2016-11-04 - NRAO access to JAO - in progress. cjallen, jotey, mhatz are active - Mike Hatz / Christian Saldias
  5. [ ] 2016-11-04 - What commands are going to be needed to be added to /etc/sudoers at JAO / NRAO for diagnostics?
    1. lctl

    1. ibping

-- MikeHatz - 2016-11-03
Topic revision: r27 - 2018-10-22, MikeHatz
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback