HP-UX System Administrator's Guide: Configuration Management

What is PCI Error Recovery?
If PCI Error Recovery is enabled and an error occurs on a PCI bus containing an I/O
card that supports PCI Error Recovery, the following steps are taken:
1. The PCI bus is quarantined to isolate the system from further I/O and prevent the
error from damaging the system.
2. The PCI Error Recovery feature attempts to recover from the error and re-initialize
the bus so I/O can resume.
If an error occurs during the automated error recovery process, the bus and I/O card
will remain quiesced.
If the bus contains a card that supports online addition, replacement, or deletion (OL*)
and the card is in a hot pluggable slot, you can use the olrad command (or the attention
button) to manually recover from the error by replacing the card.
For information on OL* operations, see the Interface Card OL* Support Guide. To
determine if OL* is supported, see the documentation or support matrix for the specific
I/O card.
If the PCI Error Recovery feature is disabled and an error occurs on a PCI bus, a Machine
Check Abort (MCA) or a High Priority Machine Check (HPMC) will occur and the
system will crash.
CAUTION: If you use HP Serviceguard, HP recommends that you enable the PCI
Error Recovery feature only if your storage devices are configured with multiple paths
and you have not disabled HP-UX native multipathing. If PCI Error Recovery is enabled,
but your storage devices are configured with only a single path, HP Serviceguard may
not detect when connectivity is lost. HP Serviceguard will not cause a failover unless
it detects a loss of connectivity.
Controlling PCI Error Recovery
PCI Error Recovery is controlled by two tunables that you can configure, using HP
SMH, kcweb, or kctune. See “Managing Kernel Tunable Parameters with kctune”
(page 170) and “Managing Kernel Tunable Parameters with HP SMH” (page 175).
pci_eh_enable
This tunable enables or disables the PCI Error Recovery feature. It is enabled by
default. Since pci_eh_enable is not a dynamic tunable, a reboot is required for
changes to take effect.
pci_error_tolerance_time
This tunable determines whether an automatic PCI error recovery will occur on
an I/O slot, based on the time interval between two PCI errors. If two PCI errors
occur on a PCI slot within the time interval specified by
134 Configuring Peripherals