Specifications
© IBM Copyright, 2012 Version: January 26, 2012
www.ibm.com/support/techdocs 28
Summary of Best Practices for Storage Area Networks
6 High Availability
In general, the size of the SAN as measured in the number of physical switches will
normally determine the design that is most effective to employ. A single switch
operates as a stand-alone with no connections to other switches. Two switches are
cascaded, or simply connected to each other with a sufficient number of ISLs to meet
oversubscription ratios. As the switch count increases, mesh or partial mesh designs
can be considered until six or seven total switches. At that point, a core-edge design
should be considered.
In a core-edge design, the core can consist of either one or two core switches or
directors with cross connections to all of the edge switches. The core switch acts as the
focal point for the SAN with connections to the high-throughput devices (typically
storage and high-end servers) while the edge switches provides connectivity for the rest
of the initiators. Traffic that must traverse across an ISL (inter-switch link) should
adhere to oversubscription ratios.
An often overlooked item with the layout of port connections within a fabric is the
grouping or clumping of connections to systems with a high port count. If a director is
used with multiple line cards, do not place all of the connections to a storage system on
the same line module. If a given line module should fail, then accessibility to the
storage system in the fabric has been reduced and not completely removed. The same
concept should be applied to the deployment of ISLs between switches as well as
servers with four or more HBA ports capable of high traffic levels. Modern switches will
automatically attempt some degree of traffic balancing across multiple ISLs or trunks (or
port-channels). From a resiliency point of view, having two, or more, smaller trunk
groups across multiple line modules is better than having a single big trunk using just
one line module.
The peak loads should always be taken into consideration and not just the average
loads. For instance, while a database server may only use 20 MBps during regular
production workloads, it may perform a backup at significantly higher data rates.
Congestion to one switch in a large fabric can cause performance issues throughout the
entire fabric, including traffic between hosts and their associated storage resources,
even if they are not directly attached to the congested switch.
The reasons for this are inherent to Fibre Channel flow control mechanisms, which are
simply not designed to handle fabric congestion). This means that any estimates for
required bandwidth prior to implementation should have a safety factor built in. On top