File distribution across OSTs (Applies to Lustre 2.5.3)
1 There's a bug...
1.1 About the bug
- Wolfgang reports that there is a bug in 2.5.3 in the round robin allocation which caused lustre to always pick the same set of (incorrect) OSTs in a filesystem with unbalanced OSTs.
- You can use the one-liner he wrote to see this in action on your system/
- MAKE SURE you don't have any files starting with "t." in the directory where you run it!
#You can test if you suffer from a OST allocation issue with this one-liner:
touch t.{1..2000}; lfs getstripe t.*|fgrep -A1 obdidx|fgrep -v obdidx|fgrep -v -- --|awk '{ print $1 }'|sort|uniq -c; rm -f t.*
# slight variation--will just COUNT the # of OSTs being roundrobined
touch t.{1..2000}; lfs getstripe t.*|fgrep -A1 obdidx|fgrep -v obdidx|fgrep -v -- --|awk '{ print $1 }'|sort|uniq -c | wc -l; rm -f t.*
# shows how many being hit out of total on given filesystem -- replace FSNAME with name of your lustre filesystem
touch t.{1..2000}; lfs getstripe t.*|fgrep -A1 obdidx|fgrep -v obdidx|fgrep -v -- --|awk '{ print $1 }'|sort|uniq -c | wc -l; echo of; lfs df -h | grep -v MDT: | grep ^FSNAME | wc -l; rm -f t.*
# # # OMG, I'm seeing double! # # #
# If you are doing this on a node where the automounter will print everything twice, use this
# as before, replace FSNAME with name of your lustre filesystem
touch t.{1..2000}; lfs getstripe t.*|fgrep -A1 obdidx|fgrep -v obdidx|fgrep -v -- --|awk '{ print $1 }'|sort|uniq -c | wc -l; echo of; lfs df -h | grep -v MDT: | grep /.lustre | grep ^FSNAME | wc -l; rm -f t.*
1.2 Workaround
- Set
qos_threshold_rr to 100 (this forces Lustre to round-robin because it believes the OSTs are balanced)
- See how below
2 Two settings that control round robin in Lustre
2.1 qos_threshold_rr
- This setting controls the threshold at which Lustre should consider the OSTs balanced (at which point, it will round robin)
Assume $max is the maximum amount of free space on any OST in the file system and $min is the minimum amount of free space on any OST.
If ($max - $min) <= (qos_rr_threshold/100)*($max), then the OSTs are considered balanced.
Basically, this means that all the OST usages are within some small window of each other (which by default is 17%).
If qos_threshold_rr=100, then the previous equation is always satisfied and Lustre thinks the OSTs are always balanced.
- The higher this number, the more imbalance lustre will tolerate before it stops round-robining.
- When set to 100%, the equation for balance is always satisfied, so it will simply round-robin.
- Get and set as follows:
[root@asimov proc]# lctl get_param lod.naaschpc-MDT0000-mdtlov.qos_threshold_rr
lod.naaschpc-MDT0000-mdtlov.qos_threshold_rr=95%
[root@asimov proc]# lctl set_param lod.naaschpc-MDT0000-mdtlov.qos_threshold_rr=0
lod.naaschpc-MDT0000-mdtlov.qos_threshold_rr=0
2.2 qos_prio_free
- This setting controls how much Lustre prioritizes free space (versus location) in allocation.
- The higher this number, the more Lustre takes empty space on an OST into consideration for its allocation.
- When set to 100%, Lustre uses ONLY empty space as the deciding factor for writes.
- Remember, this setting is only taken into consideration when Lustre believes the OSTs to be imbalanced
- If you have set qos_threshold_rr to 100, this setting will have no effect.
- Get and set as follows:
[root@asimov proc]# lctl get_param lod.naaschpc-MDT0000-mdtlov.qos_prio_free
lod.naaschpc-MDT0000-mdtlov.qos_prio_free=90%
[root@asimov proc]# lctl set_param lod.naaschpc-MDT0000-mdtlov.qos_prio_free=100
lod.naaschpc-MDT0000-mdtlov.qos_prio_free=100
- Note that the syntax to address the /proc files can vary across file systems:
[root@sauron ~]# lctl get_param lod.cvlustre-mdtlov.qos_prio_free
lod.cvlustre-mdtlov.qos_prio_free=91%
[root@sauron ~]# lctl get_param lod.cvlustre-mdtlov.qos_threshold_rr
lod.cvlustre-mdtlov.qos_threshold_rr=17%