File distribution across OSTs (Applies to Lustre 2.5.3)

1 There's a bug...

1.1 About the bug

  • Wolfgang reports that there is a bug in 2.5.3 in the round robin allocation which caused lustre to always pick the same set of (incorrect) OSTs in a filesystem with unbalanced OSTs.
  • You can use the one-liner he wrote to see this in action on your system/
  • MAKE SURE you don't have any files starting with "t." in the directory where you run it!
#You can test if you suffer from a OST allocation issue with this one-liner:
touch t.{1..2000}; lfs getstripe t.*|fgrep -A1 obdidx|fgrep -v obdidx|fgrep -v -- --|awk '{ print $1 }'|sort|uniq -c; rm -f t.*

# slight variation--will just COUNT the # of OSTs being roundrobined
touch t.{1..2000}; lfs getstripe t.*|fgrep -A1 obdidx|fgrep -v obdidx|fgrep -v -- --|awk '{ print $1 }'|sort|uniq -c | wc -l; rm -f t.*

# shows how many being hit out of total on given filesystem -- replace FSNAME with name of your lustre filesystem
touch t.{1..2000}; lfs getstripe t.*|fgrep -A1 obdidx|fgrep -v obdidx|fgrep -v -- --|awk '{ print $1 }'|sort|uniq -c | wc -l; echo of; lfs df -h | grep -v MDT: | grep ^FSNAME | wc -l; rm -f t.*

# # # OMG, I'm seeing double! # # #
# If you are doing this on a node where the automounter will print everything twice, use this
# as before, replace FSNAME with name of your lustre filesystem
touch t.{1..2000}; lfs getstripe t.*|fgrep -A1 obdidx|fgrep -v obdidx|fgrep -v -- --|awk '{ print $1 }'|sort|uniq -c | wc -l; echo of; lfs df -h | grep -v MDT: | grep /.lustre | grep ^FSNAME | wc -l; rm -f t.*

1.2 Workaround

  • Set qos_threshold_rr to 100 (this forces Lustre to round-robin because it believes the OSTs are balanced)
  • See how below

2 Two settings that control round robin in Lustre

2.1 qos_threshold_rr

  • This setting controls the threshold at which Lustre should consider the OSTs balanced (at which point, it will round robin)
Assume $max is the maximum amount of free space on any OST in the file system and $min is the minimum amount of free space on any OST.  
If ($max - $min) <= (qos_rr_threshold/100)*($max), then the OSTs are considered balanced.  
Basically, this means that all the OST usages are within some small window of each other (which by default is 17%).  
If qos_threshold_rr=100, then the previous equation is always satisfied and Lustre thinks the OSTs are always balanced.
  • The higher this number, the more imbalance lustre will tolerate before it stops round-robining.
  • When set to 100%, the equation for balance is always satisfied, so it will simply round-robin.
  • Get and set as follows:
[root@asimov proc]# lctl get_param lod.naaschpc-MDT0000-mdtlov.qos_threshold_rr
lod.naaschpc-MDT0000-mdtlov.qos_threshold_rr=95%

[root@asimov proc]# lctl set_param lod.naaschpc-MDT0000-mdtlov.qos_threshold_rr=0
lod.naaschpc-MDT0000-mdtlov.qos_threshold_rr=0

2.2 qos_prio_free

  • This setting controls how much Lustre prioritizes free space (versus location) in allocation.
  • The higher this number, the more Lustre takes empty space on an OST into consideration for its allocation.
  • When set to 100%, Lustre uses ONLY empty space as the deciding factor for writes.
    • Remember, this setting is only taken into consideration when Lustre believes the OSTs to be imbalanced
    • If you have set qos_threshold_rr to 100, this setting will have no effect.
  • Get and set as follows:
[root@asimov proc]# lctl get_param lod.naaschpc-MDT0000-mdtlov.qos_prio_free
lod.naaschpc-MDT0000-mdtlov.qos_prio_free=90%

[root@asimov proc]# lctl set_param lod.naaschpc-MDT0000-mdtlov.qos_prio_free=100
lod.naaschpc-MDT0000-mdtlov.qos_prio_free=100

3 Your performance may vary

  • Note that the syntax to address the /proc files can vary across file systems:
[root@sauron ~]# lctl get_param lod.cvlustre-mdtlov.qos_prio_free
lod.cvlustre-mdtlov.qos_prio_free=91%

[root@sauron ~]# lctl get_param lod.cvlustre-mdtlov.qos_threshold_rr
lod.cvlustre-mdtlov.qos_threshold_rr=17%
Topic revision: r6 - 2017-03-13, JessicaOtey
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding NRAO Public Wiki? Send feedback