1.1

Table Of Contents

Configuring suspect-member Alerts

When any member of the distributed system fails, it is important for other services to detect the loss quickly and

transition application clients to other members. Any peer or server in the cluster can detect a problem with another

member of the cluster, which initiates "SUSPECT" processing with the membership coordinator. The membership

coordinator then determines whether the suspect member should remain in the distributed system or should be

removed.

Use the ack-wait-threshold property to conﬁgure how long a SQLFire peer or server waits to receive an

acknowledgment from other members that are replicating a table's data. The default value is 15 seconds; you

specify a value from 0 to 2147483647 seconds. After this period, the replicating peer sends a severe alert warning

to other members in the distributed system, raising a "suspect_member" alert in the cluster.

To conﬁgure how long the cluster waits for this alert to be acknowledged, set the

ack-severe-alert-threshold property. The default value is zero, which disables the property.

How Replication Failure Occurs

Failures during replication can occur in the following ways:

• A replica fails before sending an acknowledgment.

The most common failure occurs when a member process is terminated during replication. When this occurs,

the TCP connection from all members is terminated, and the membership view is updated quickly to reﬂect

the change. The member who initiated replication continues replicating to other members.

If instead of terminating, the process stays alive (but fails to respond) the initiating member waits for a period

of time and then raises an alert with the distributed system membership coordinator. The membership coordinator

evaluates the health of the suspect member based on heartbeats and health reports from other members in the

distributed system. The coordinator may decide to evict the member from the distributed system, in which case

it communicates this change in the membership view to all members. At this point, the member that initiated

replication proceeds and completes replication using available peers and servers. In addition, clients connected

to this member are automatically re-routed to other members.

• An "Owning" member fails.

If the designated owner of data for a certain key fails, the system automatically chooses another replica to

become the owner for the key range that the failed member managed. The updating thread is blocked while

this transfer takes place. If at least one replica is available, the operations always succeeds from the application's

viewpoint.

vFabric SQLFire User's Guide68

Managing Your Data in vFabric SQLFire