HP-UX 11i v3 Native Multi-Pathing for Mass Storage (August 2012)

messages in the system log, then the SCSI subsystem resorts to path failover and recovery mechanisms

to provide applications with continuous access to the LUN end devices; only when certain critical

errors happen, is administrator intervention required.

In summary, after detecting a SCSI component error, the operating system reports the error to system

administrators and offers a palette of pro-active recovery actions: automatic path failover or dynamic

replacement of the failing component.

Path error reporting

HP-UX 11i v3 offers the following new and enhanced mechanisms to report failures on SCSI

components. The goal is to assist administrators in performing quick and efficient diagnostic to take

the most appropriate action.

• Error messages -— A comprehensive set of error messages of various severity levels are used to

report a wide range of errors. These messages can be monitored in syslog and STM.

• Statistics — A detailed set of statistics is available for each SCSI component to help troubleshooting,

and to quickly identify a faulting component. Administrators can use scsimgr to display these

statistics.

• EVM events — The mass storage subsystem generates events to which other modules in the kernel

or user space can subscribe, to get notified about changes on a LUN and lunpath properties. The

SCSI stack monitors every LUN and every lunpath availability change. It also monitors LUN property

changes such as LUN size.

• I/O error triggered events — The mass storage subsystem also reports failures on the lunpaths and

LUN upon detection of certain I/O errors.

Path failover

When a lunpath goes offline, I/O operations issued on that lunpath fail. The policy of how the mass

storage subsystem deals with this scenario is dependent upon the following factors:

• Path bound I/O operations

User applications can request the mass storage subsystem to issue I/O operations on a specific

lunpath. Such I/O operations are called path bound I/O operations. When a path bound I/O fails,

the SCSI stack retries it a certain number of times on the same lunpath before failing it back to the

upper layer (for example applications, file systems, volume managers). There is no path failover.

I/O operations sent to a LUN using the path lock down load balancing policy are path bound I/O

operations.

• I/O operations not bound to a path

Path failover is applied to I/O operations that are not bound to a lunpath . When an not bound to

a path I/O operation fails, if the I/O can be retried (see I/O retry policy below), the I/O is failed

over the next lunpath selected by the LUN load balancing policy path selection algorithm.

• I/O retry policy

The mass storage subsystem retries a failing I/O operation a certain number of times before

returning failure to the application using one of the following retry policies:

– Time based — The mass storage subsystem retries the I/O operation within a certain time interval

which is either set by upper layer modules such as the volume manager or the file system, or

determined by a default LUN attribute. For disk LUNs, the esd_secs LUN attribute holds the time

credit for an I/O operation across different retries. For tape LUNs, the read_secs attribute and

write_secs attribute hold the read and write time credits.

– Count based — The I/O operation is retried less than a high-water mark threshold . For disk

LUNs, I/O operations can be retried indefinitely (if disk LUN infinite_retries_enable

attribute is set), or a finite number of times (corresponding to the disk LUN max_retries

attribute value).