Specifications

www.ibm.com/support/techdocs 39

Summary of Best Practices for Storage Area Networks

per week. Each update item will have a very brief description as well as a link to

browse directly to the information.

9 Monitoring

One of the most often overlooked things that need monitoring in a SAN is host path

monitoring. A very common reason for server outages in general is missing paths that

turn what should be non-disruptive maintenance (or failures) into a real problem. For

example: For whatever reason, a server loses connectivity over one of its HBAs. The

multi-pathing utility is designed to re-route traffic over to a different path (typically across

the redundant fabric), and no symptoms are noticed on the host. If the original path

meets certain criteria, the path will be flagged as dead and not usable until manual

intervention by an administrator. Sometime later (it could be hours, days, or months);

the fail-over path for this host fails for some reason such as a damaged fiber cable or

routine maintenance. The host now experiences a full outage, which could have been

prevented with routine path monitoring. Since paths may fail due to internal software

errors, not all path loss will be discovered via SAN switch monitoring, or by dedicated

management applications. A host-based system therefore is also a useful backup to

monitoring of the SAN itself.

Server logs can be an important bellwether of storage problems, since after all it is the

server that ultimately must process the data requests. There are many different kinds of

storage problems that will show up only in server logs, and never in disk or fabric logs.

For instance, a bad transmitter in a switch port will raise no alarm in the switch itself, as

the switch is not capable of monitoring the integrity of data that it transmits; only the

receiving port can perform that function.

Ideally, every error in a host error log needs to be understood, if not eliminated. Even

“temp” disk errors can be important advance indicators of future problems. If an

innocuous system error is repeated many times, measures should be taken to eliminate

the error from occurring in the future.

When monitoring switches, ports of particular interest to monitor should be those for

disk storage system and SVC connections. The setting of appropriate thresholds

mandates that some degree of performance monitoring must take place over a period of

time to develop meaningful thresholds that will generate true alerts and not numerous

false alarms. As painful as it is to say “it depends”, there really are not any appropriate