User Manual

OpenSM – Subnet ManagerRev 2.1-1.0.6
Mellanox Technologies
160
Thus, on a pristine 3D torus, i.e., in the absence of failed fabric switches, torus-2QoS consumes 8
SL values (SL bits 0-2) and 2 VL values (VL bit 0) per QoS level to provide deadlock-free rout-
ing on a 3D torus. Torus-2QoS routes around link failure by "taking the long way around" any 1D
ring interrupted by a link failure. For example, consider the 2D 6x5 torus below, where switches
are denoted by [+a-zA-Z]:
For a pristine fabric the path from S to D would be S-n-T-r-D. In the event that either link S-n or
n-T has failed, torus-2QoS would use the path S-m-p-o-T-r-D.
Note that it can do this without changing the path SL value; once the 1D ring m-S-n-T-o-p-m has
been broken by failure, path segments using it cannot contribute to deadlock, and the x-direction
dateline (between, say, x=5 and x=0) can be ignored for path segments on that ring. One result of
this is that torus-2QoS can route around many simultaneous link failures, as long as no 1D ring is
broken into disjoint segments. For example, if links n-T and T-o have both failed, that ring has
been broken into two disjoint segments, T and o-p-m-S-n. Torus-2QoS checks for such issues,
reports if they are found, and refuses to route such fabrics.
Note that in the case where there are multiple parallel links between a pair of switches, torus-
2QoS will allocate routes across such links in a round-robin fashion, based on ports at the path
destination switch that are active and not used for inter-switch links. Should a link that is one of
severalsuch parallel links fail, routes are redistributed across the remaining links. When the last
of such a set of parallel links fails, traffic is rerouted as described above.
Handling a failed switch under DOR requires introducing into a path at least one turn that would
be otherwise "illegal", i.e. not allowed by DOR rules. Torus-2QoS will introduce such a turn as
close as possible to the failed switch in order to route around it. n the above example, suppose
switch T has failed, and consider the path from S to D. Torus-2QoS will produce the path S-n-I-r-
D, rather than the S-n-T-r-D path for a pristine torus, by introducing an early turn at n. Normal
DOR rules will cause traffic arriving at switch I to be forwarded to switch r; for traffic arriving
from I due to the "early" turn at n, this will generate an "illegal" turn at I.
Torus-2QoS will also use the input port dependence of SL2VL maps to set VL bit 1 (which
would be otherwise unused) for y-x, z-x, and z-y turns, i.e., those turns that are illegal under
DOR. This causes the first hop after any such turn to use a separate set of VL values, and pre-
vents deadlock in the presence of a single failed switch. For any given path, only the hops after a
turn that is illegal under DOR can contribute to a credit loop that leads to deadlock. So in the
example above with failed switch T, the location of the illegal turn at I in the path from S to D
requires that any credit loop caused by that turn must encircle the failed switch at T. Thus the
second and later hops after the illegal turn at I (i.e., hop r-D) cannot contribute to a credit loop