1.1

Table Of Contents
Configuring suspect-member Alerts
When any member of the distributed system fails, it is important for other services to detect the loss quickly and
transition application clients to other members. Any peer or server in the cluster can detect a problem with another
member of the cluster, which initiates "SUSPECT" processing with the membership coordinator. The membership
coordinator then determines whether the suspect member should remain in the distributed system or should be
removed.
Use the ack-wait-threshold property to congure how long a SQLFire peer or server waits to receive an
acknowledgment from other members that are replicating a table's data. The default value is 15 seconds; you
specify a value from 0 to 2147483647 seconds. After this period, the replicating peer sends a severe alert warning
to other members in the distributed system, raising a "suspect_member" alert in the cluster.
To congure how long the cluster waits for this alert to be acknowledged, set the
ack-severe-alert-threshold property. The default value is zero, which disables the property.
How Replication Failure Occurs
Failures during replication can occur in the following ways:
A replica fails before sending an acknowledgment.
The most common failure occurs when a member process is terminated during replication. When this occurs,
the TCP connection from all members is terminated, and the membership view is updated quickly to reect
the change. The member who initiated replication continues replicating to other members.
If instead of terminating, the process stays alive (but fails to respond) the initiating member waits for a period
of time and then raises an alert with the distributed system membership coordinator. The membership coordinator
evaluates the health of the suspect member based on heartbeats and health reports from other members in the
distributed system. The coordinator may decide to evict the member from the distributed system, in which case
it communicates this change in the membership view to all members. At this point, the member that initiated
replication proceeds and completes replication using available peers and servers. In addition, clients connected
to this member are automatically re-routed to other members.
An "Owning" member fails.
If the designated owner of data for a certain key fails, the system automatically chooses another replica to
become the owner for the key range that the failed member managed. The updating thread is blocked while
this transfer takes place. If at least one replica is available, the operations always succeeds from the application's
viewpoint.
vFabric SQLFire User's Guide68
Managing Your Data in vFabric SQLFire