Building Disaster Recovery Serviceguard Solutions Using Metrocluster with Continuous Access for P9000 and XP A.11.00
Package starts on the adoptive node at the remote site, it detects that the active complex workload's
packages have failed. Consequently, the Site Controller package performs a site failover and starts
the corresponding complex workload's packages on the site where the cluster has reformed.
Disk array and SAN failure
When a disk array or the host access SAN at a site fails, the active complex workload database
running on the site might hang or fail based on the component that has failed. If the SAN failure
causes the complex workload database processes to fail and consequently the complex-workload
packages also fail, the Site Controller Package initiates a site failover.
Replication link failure
A failure in a replication link between sites stalls the replication from the active complex-workload
package configuration to the remote site. The impact of a replication link failure on the running
complex-workload packages is based on the configured replication mode.
On a synchronized replication mode, with fence level set to Data, the primary site disk array starts
failing I/Os. This causes the active complex workload configuration to fail. The Site Controller
package then performs a site failover, if a complex-workload package is configured as a
critical_package.
If the fence level is set to Never, the I/O on the PVOL side is not failed, and the active complex
workload continues to run successfully.
On an asynchronous replication mode, there is no interruption at the complex workload's
configuration and it continues to run uninterrupted.
When the complex workload is mounted as read only or is idle, or is completing read-only
transactions when the replication link fails, it might not encounter any failure and continues to be
available from the site.
Site Controller package failure
The Site Controller package can fail for many reasons, such as node crash, while the active
complex-workload package stack on the site is up and running. The Site Controller package fails
over to an adoptive node, which can be a node on the same site or a node on the remote site.
The Site Controller package behavior is different under every scenario so that the complex workload
availability is not disrupted.
NOTE: When the adoptive node is a node in the same site, where the current active complex
workload stack is running, it is considered as a local failover for the Site Controller package.
On a Site Controller package local failover, the disaster tolerant complex workload remains
uninterrupted on that site. The Site Controller package continues to monitor the managed packages
or the critical packages on the site, as configured from the current node.
When the Site Controller package fails over to an adoptive node at the remote site, it is considered
a failover across sites for the Site Controller package. When the Site Controller package fails over
across sites while the active complex-workload package stack is running in the site, the Site
Controller package fails on the remote site adoptive node without affecting the running active
complex workload configuration stack in the cluster. The complex workload configuration continues
to be available in the cluster. However, as the Site Controller package has failed in the cluster,
the complex workload configuration can no longer automatically failover to the remote site.
Site failure
A site failure is a scenario where a disaster or an equivalent failure results in the failure of all the
nodes in a site. The Serviceguard cluster detects this failure, and reforms the cluster without the
nodes from the failed site. The Site Controller Package that was running on a node on the failed
site fails over to an adoptive node in the remote site.
Complex workload failover/failback scenarios 67