Introduction
This wiki is to track open tasks and to collect detailed logs for HPC related issues. It will augment the HPC mailing list when we need a place to track more verbose topics.
Relevant Topics
- High Performance Computing (HPC) Cluster Operations
- Lustre Filesystem Operations
- Cluster Software - TORQUE / Maui / Robinhood
- Diagnostics - CASA / TORQUE / Maui
- "Lustre Lab" in Charlottesville
- Hardware Planning
- HPC Troubleshooting Log
- HPC Group Meetings
- NAASC Lustre FID-in-direct activity, April 2017
Configurations
- High Level
- Server Specific
- Client Specific
Advice and How-tos
- File distribution across OSTs (2.5.3)
- Manual balancing of Lustre filesystem ("catch and release")
- Using nagios to monitor OST capacity
- Tracking Lustre bugs we care about
- Upgrade guidance, including writeconf and fsck
1 - High Level Site Overview
Location |
Name |
Client Mount Point |
Size |
Lustre |
Redhat |
Kernel |
Inter-connect |
Notes |
|
NRAO - Charlottesville |
CV Lustre |
/.lustre/cv |
656 TB |
2.5.5 |
6.8 |
2.6.32-642.6.2.el6_lustre.x86_64 |
10 Gb Ethernet |
Non-HPC Lustre for Charlottesville |
|
NRAO - Charlottesville |
NAASC Lustre |
/.lustre/naasc |
651 TB |
2.5.5 |
6.8 |
2.6.32-642.6.2.el6_lustre.x86_64 |
40 Gb InfiniBand |
NAASC HPC Lustre |
|
GBO - Green Bank |
VEGAS Lustre |
./lustre/vegas |
786 TB |
2.5.3 |
6.8 |
2.6.32-431.23.3.el6_lustre.x86_64 |
40 Gb InfiniBand |
VEGAS+Pulsar Lustre |
|
NRAO - Socorro |
aoclst03 |
/.lustre/aoc |
1,600 TB |
2.5.5 |
6.8/6.9 |
2.6.32-642.6.2.el6_lustre.2.5.5.x86_64 |
40 Gb Infiniband |
|
|
NRAO - Socorro |
evla |
/.lustre/evla |
59 TB |
2.5.5 |
6.8 |
2.6.32-642.6.2.el6_lustre.2.5.5.x86_64 |
40 Gb Infiniband |
|
|
JAO - Santiago |
sciops |
/mnt/sciops |
350 TB |
2.5.3 |
6.6 |
2.6.32-504.12.2.el6.x86_64 |
40 Gb Infiniband |
PI reduction |
|
JAO - Santiago |
jaosco |
/mnt/jaosco |
270 TB |
2.5.3 |
6.5 |
2.6.32-431.23.3.el6.x86_64 |
40 Gb Infiniband |
ALMA Pipeline |
|
NAOJ - Tokyo |
|
/lustre0 |
80TB |
2.1.5 |
6.3 |
2.6.32-279.19.1.el6_lustre.x86_64 |
40 Gb Infiniband |
ALMA Pipeline |
|
ESO - Garching |
? |
|
|
|
|
|
|
|
|
2 - Server Specific
Location |
Name |
RAM |
Lustre |
Redhat |
Kernel |
Inter-connect |
Notes |
|
NRAO - Charlottesville |
lnet0[1-2].cv.nrao.edu |
32 GB |
2.5.3 |
6.8 |
2.6.32-642.6.2.el6.x86_64 |
IB / 10g |
LNET Routers for CV and NAASC Lustres |
|
|
NRAO - Charlottesville |
cv-mds.cv.nrao.edu & naasc-mds.cv.nrao.edu |
24 GB / 48 GB |
2.5.3 |
6.8 |
2.6.32-431.23.3.el6_lustre.x86_64 |
IB (NAASC) and 10g (CV) |
CV and NAASC Lustre MDSs |
|
|
|
NRAO - Charlottesville |
cv-oss-[4-16].cv.nrao.edu |
4 GB - 64 GB |
2.5.3 |
6.8 |
2.6.32-431.23.3.el6_lustre.x86_64 |
10 Gb Ethernet |
CV Lustre OSSs |
|
|
|
NRAO - Charlottesville |
naasc-oss-[1-9].cv.nrao.edu |
16 GB - 64 GB |
2.5.3 |
6.8 |
2.6.32-431.23.3.el6_lustre.x86_64 |
40 Gb InfiniBand |
NAASC Lustre OSSs |
|
|
|
NRAO - Socorro |
aocmds.aoc.nrao.edu |
64 GB |
2.4.3 |
6.8 |
2.6.32-358.23.2.el6_lustre.x86_64 |
IB |
|
|
|
|
NRAO - Socorro |
aocoss[01-09].aoc.nrao.edu |
64 GB/128 GB |
2.5.5 |
6.8/6.9 |
2.6.32-642.6.2.el6_lustre.2.5.5.x86_64 |
IB |
|
|
|
|
NRAO - Socorro |
aoclet[01-02].aoc.nrao.edu |
24 GB |
2.5.5 |
6.8 |
2.6.32-642.4.2.el6.x86_64 |
IB / 10Gb / 1Gb |
|
|
|
|
NRAO - Socorro |
evlamds |
24 GB |
2.5.5 |
6.8 |
2.6.32-642.6.2.el6_lustre.2.5.5.x86_64 |
IB |
|
|
|
|
NRAO - Socorro |
evlaoss01 |
32 GB |
2.5.5 |
6.8 |
2.6.32-642.6.2.el6_lustre.2.5.5.x86_64 |
IB |
|
|
|
|
NRAO - Socorro |
evlalnet[01-02] |
24 GB/8 GB |
2.5.5 |
6.8 |
2.6.32-642.11.1.el6.x86_64 |
IB / 1Gb |
|
|
|
|
GBO - Green Bank |
vegas-mds |
16 GB |
2.5.3 |
6.8 |
2.6.32-431.23.3.el6_lustre.x86_64 |
IB |
LNET Routers for 10g and 1g access |
|
|
|
GBO - Green Bank |
vegas-oss-[01-15] |
12 GB |
2.5.3 |
6.8 |
2.6.32-431.23.3.el6_lustre.x86_64 |
IB |
LNET Routers for 10g and 1g access |
|
|
|
JAO - Santiago |
jaopost0[01-40] |
64 GB/128 GB |
2.5.3 |
6.5 |
2.6.32-431.23.3.el6.x86_64 |
IB |
ALMA Pipeline |
|
|
|
JAO - Santiago |
jaopost-oss[01-06] |
64 GB |
2.5.3 |
6.6 |
2.6.32-431.23.3.el6_lustre.x86_64 |
IB |
ALMA Pipeline |
|
|
|
JAO - Santiago |
jaopost-mds |
49 GB |
2.5.3 |
6.7 |
2.6.32-431.23.3.el6_lustre.x86_64 |
IB |
ALMA Pipeline |
|
|
|
JAO - Santiago |
sciops[01-04] |
64 GB |
2.5.3 |
6.6 |
2.6.32-504.12.2.el6.x86_64 |
IB |
PI reduction |
|
|
|
JAO - Santiago |
lst-oss-[01-02] |
128 GB |
2.5.3 |
6.6 |
2.6.32-431.23.3.el6_lustre.x86_64 |
IB |
PI reduction |
|
|
|
JAO - Santiago |
lst-mds-[01-02] |
64 GB |
2.5.3 |
6.6 |
2.6.32-431.23.3.el6_lustre.x86_64 |
IB |
PI reduction |
|
|
|
NAOJ - Tokyo |
lustre-mds |
32 GB |
2.1.5 |
6.3 |
2.6.32-279.19.1.el6_lustre.x86_64 |
40 Gb InfiniBand |
|
|
|
|
NAOJ - Tokyo |
lustre-oss0[1-5] |
32 GB |
2.1.5 |
6.3 |
2.6.32-279.19.1.el6_lustre.x86_64 |
40 Gb InfiniBand |
|
|
|
|
ESO - Garching |
? |
|
|
|
|
|
|
|
|
|
We have 1 tweak in /etc/rc.local on NM's OSSs, where "blockdev" is set for each OSTs device:
# Lustre
/sbin/blockdev --setra 16384 /dev/sda
/sbin/blockdev --setra 16384 /dev/sdb
/sbin/blockdev --setra 16384 /dev/sdc
lctl set_param obdfilter.*.readcache_max_filesize=6M
GB does the same thing.
JAOPOST tweaks: Server:
/usr/sbin/lctl set_param osp.*.max_rpcs_in_progress=65536
/usr/sbin/lctl set_param osp.*.max_rpcs_in_flight=32
/usr/sbin/lctl set_param timeout=600
/usr/sbin/lctl set_param ldlm_timeout=200
/usr/sbin/lctl set_param at_min=250
/usr/sbin/lctl set_param at_max=600
JAOPOST tweaks: Client:
/usr/sbin/lctl set_param osc.*.checksums=0
/usr/sbin/lctl set_param timeout=600
/usr/sbin/lctl set_param at_min=250
/usr/sbin/lctl set_param at_max=600
/usr/sbin/lctl set_param osc.*OST*.max_rpcs_in_flight=32
/usr/sbin/lctl set_param osc.*OST*.max_dirty_mb=128
3 - Client Specific
Location |
Name |
Client Mount Point |
RAM |
Swap |
Lustre |
Redhat |
Kernel |
Inter-connect |
Notes |
ESO - Garching |
? |
|
|
|
|
|
|
|
|
|
|
NRAO - Charlottesville |
Desktop |
/.lustre/cv & /.lustre/naasc |
8 GB - 64 GB |
50%-100% of RAM |
2.5.3 |
6.8 |
2.6.32-642.6.2.el6.x86_64 |
1 Gb Ethernet |
Linux Desktops |
|
|
GBO - Green Bank |
Data Reduction machines |
/.lustre/vegas |
16 GB - 256 GB |
|
2.5.3 |
6.8 |
2.6.32-642.6.2.el6.x86_64 |
IB / 10g / 1g |
|
|
|
JAO - Santiago |
sciops |
/mnt/sciops |
64 GB |
|
2.5.3 |
6.6 |
2.6.32-504.12.2.el6.x86_64 |
IB |
PI reduction |
|
|
JAO - Santiago |
jaopost |
/mnt/jaosco |
64 GB/128 GB |
|
2.5.3 |
6.5 |
2.6.32-431.23.3.el6.x86_64 |
IB |
ALMA Pipeline |
|
|
NAOJ - Tokyo |
arc-pl-proc[01-14] |
/lustre0 |
64 GB/128 GB |
2GB for 64GB RAM / 4GB for 128GB RAM |
2.1.5 |
6.4 |
2.6.32-279.19.1.el6.x86_64 |
40 Gb InfiniBand |
ALMA Pipeline |
|
|
NRAO - Socorro |
nmpost{001..090} |
/.lustre/aoc |
128 GB - 512 GB |
240GB |
2.5.5 |
6.9 |
2.6.32-696.13.2.el6.x86_64 |
IB |
|
|
|
NRAO - Charlottesville |
cvpost0[02-65] |
/.lustre/cv & /.lustre/naasc |
256 GB |
128 GB |
2.5.3 |
6.8 |
2.6.32-642.4.2.el6.x86_64 |
40 Gb InfiniBand |
NAASC HPC Compute Nodes |
|
|
HPC Troubleshooting Requests
Rather than bogging down the HPC mailing list with megabytes of log files, please
create a new entry in the table and attach your logs here.
At a minimum troubleshooting logs should include:
- MDS log
- OSS log(s)
- Client log(s)
- What lustre are you having a problem with? Example NRAO - /lustre/naasc?
- Which version of Lustre / Redhat are you running on the MDS/OSS/Clients?
- InfiniBand or 10Gb interconnects?
Questions / Tasks List
- [X] 2016-10-27 - HPC mailing list - created. All NRAO and JAO and ARC personnel contacted. Management notification sent out too - Mike Hatz / José Parra
- [ ] 2016-10-31 - JAO access to NRAO. abarrien, bmalet, msimmonds are active. Verify group / NIS permissions are correct at NRAO - Mike Hatz / Pat Murphy
- [X] 2016-11-03 - Wiki area created and initially populated - Mike Hatz / Pat Murphy
- [ ] 2016-11-04 - NRAO access to JAO - in progress. cjallen, jotey, mhatz are active - Mike Hatz / Christian Saldias
- [ ] 2016-11-04 - What commands are going to be needed to be added to /etc/sudoers at JAO / NRAO for diagnostics?
- lctl
-
- ibping
--
MikeHatz - 2016-11-03