Specifications

www.ibm.com/support/techdocs 28

Summary of Best Practices for Storage Area Networks

6 High Availability

In general, the size of the SAN as measured in the number of physical switches will

normally determine the design that is most effective to employ. A single switch

operates as a stand-alone with no connections to other switches. Two switches are

cascaded, or simply connected to each other with a sufficient number of ISLs to meet

oversubscription ratios. As the switch count increases, mesh or partial mesh designs

can be considered until six or seven total switches. At that point, a core-edge design

should be considered.

In a core-edge design, the core can consist of either one or two core switches or

directors with cross connections to all of the edge switches. The core switch acts as the

focal point for the SAN with connections to the high-throughput devices (typically

storage and high-end servers) while the edge switches provides connectivity for the rest

of the initiators. Traffic that must traverse across an ISL (inter-switch link) should

adhere to oversubscription ratios.

An often overlooked item with the layout of port connections within a fabric is the

grouping or clumping of connections to systems with a high port count. If a director is

used with multiple line cards, do not place all of the connections to a storage system on

the same line module. If a given line module should fail, then accessibility to the

storage system in the fabric has been reduced and not completely removed. The same

concept should be applied to the deployment of ISLs between switches as well as

servers with four or more HBA ports capable of high traffic levels. Modern switches will

automatically attempt some degree of traffic balancing across multiple ISLs or trunks (or

port-channels). From a resiliency point of view, having two, or more, smaller trunk

groups across multiple line modules is better than having a single big trunk using just

one line module.

The peak loads should always be taken into consideration and not just the average

loads. For instance, while a database server may only use 20 MBps during regular

production workloads, it may perform a backup at significantly higher data rates.

Congestion to one switch in a large fabric can cause performance issues throughout the

entire fabric, including traffic between hosts and their associated storage resources,

even if they are not directly attached to the congested switch.

The reasons for this are inherent to Fibre Channel flow control mechanisms, which are

simply not designed to handle fabric congestion). This means that any estimates for

required bandwidth prior to implementation should have a safety factor built in. On top