User Manual

Rev 2.3-1.0.1
Mellanox Technologies
85
Note that it can do this without changing the path SL value; once the 1D ring m-S-n-T-o-p-m has
been broken by failure, path segments using it cannot contribute to deadlock, and the x-direction
dateline (between, say, x=5 and x=0) can be ignored for path segments on that ring. One result of
this is that torus-2QoS can route around many simultaneous link failures, as long as no 1D ring is
broken into disjoint segments. For example, if links n-T and T-o have both failed, that ring has
been broken into two disjoint segments, T and o-p-m-S-n. Torus-2QoS checks for such issues,
reports if they are found, and refuses to route such fabrics.
Note that in the case where there are multiple parallel links between a pair of switches, torus-
2QoS will allocate routes across such links in a round-robin fashion, based on ports at the path
destination switch that are active and not used for inter
-switch links. Should a link that is one of
severalsuch parallel links fail, routes are redistributed across the remaining links. When the last
of such a set of parallel links fails, traffic is rerouted as described above.
Handling a failed switch under DOR requires introducing into a path at least one turn that would
be otherwise "illegal", i.e. not allowed by DOR rules.
Torus-2QoS will introduce such a turn as
close as possible to the failed switch in order to route around it. n the above example, suppose
switch T has failed, and consider the path from S to D. Torus-2QoS will produce the path S-n-I-r-
D, rather than the S-n-T-r-D path for a pristine torus, by introducing an early turn at n. Normal
DOR rules will cause traffic arriving at switch I to be forwarded to switch r; for traffic arriving
from I due to the "early" turn at n, this will generate an "illegal" turn at I.
Torus-2QoS will also use the input port dependence of SL2VL maps to set VL bit 1 (which
would be otherwise unused) for y-x, z-x, and z-y turns, i.e., those turns that are illegal under
DOR.
This causes the first hop after any such turn to use a separate set of VL values, and pre-
vents deadlock in the presence of a single failed switch. For any given path, only the hops after a
turn that is illegal under DOR can contribute to a credit loop that leads to deadlock.
So in the
example above with failed switch T, the location of the illegal turn at I in the path from S to D
requires that any credit loop caused by that turn must encircle the failed switch at T. Thus the
second and later hops after the illegal turn at I (i.e., hop r-D) cannot contribute to a credit loop
because they cannot be used to construct a loop encircling T. The hop I-r uses a separate VL, so
it cannot contribute to a credit loop encircling T. Extending this argument shows that in addition
to being capable of routing around a single switch failure without introducing deadlock, torus-
2QoS can also route around multiple failed switches on the condition they are adjacent in the last
dimension routed by DOR. For example, consider the following case on a 6x6 2D torus:
Suppose switches T and R have failed, and consider the path from S to D. Torus-2QoS will gen-
erate the path S-n-q-I-u-D, with an illegal turn at switch I, and with hop I-u using a VL with bit 1