Specifications
© IBM Copyright, 2012 Version: January 26, 2012
www.ibm.com/support/techdocs 39
Summary of Best Practices for Storage Area Networks
per week. Each update item will have a very brief description as well as a link to
browse directly to the information.
9 Monitoring
One of the most often overlooked things that need monitoring in a SAN is host path
monitoring. A very common reason for server outages in general is missing paths that
turn what should be non-disruptive maintenance (or failures) into a real problem. For
example: For whatever reason, a server loses connectivity over one of its HBAs. The
multi-pathing utility is designed to re-route traffic over to a different path (typically across
the redundant fabric), and no symptoms are noticed on the host. If the original path
meets certain criteria, the path will be flagged as dead and not usable until manual
intervention by an administrator. Sometime later (it could be hours, days, or months);
the fail-over path for this host fails for some reason such as a damaged fiber cable or
routine maintenance. The host now experiences a full outage, which could have been
prevented with routine path monitoring. Since paths may fail due to internal software
errors, not all path loss will be discovered via SAN switch monitoring, or by dedicated
management applications. A host-based system therefore is also a useful backup to
monitoring of the SAN itself.
Server logs can be an important bellwether of storage problems, since after all it is the
server that ultimately must process the data requests. There are many different kinds of
storage problems that will show up only in server logs, and never in disk or fabric logs.
For instance, a bad transmitter in a switch port will raise no alarm in the switch itself, as
the switch is not capable of monitoring the integrity of data that it transmits; only the
receiving port can perform that function.
Ideally, every error in a host error log needs to be understood, if not eliminated. Even
“temp” disk errors can be important advance indicators of future problems. If an
innocuous system error is repeated many times, measures should be taken to eliminate
the error from occurring in the future.
When monitoring switches, ports of particular interest to monitor should be those for
disk storage system and SVC connections. The setting of appropriate thresholds
mandates that some degree of performance monitoring must take place over a period of
time to develop meaningful thresholds that will generate true alerts and not numerous
false alarms. As painful as it is to say “it depends”, there really are not any appropriate