2.7

Table Of Contents
Table 14-2. Aurora_mon Parameters (Continued)
Parameter Description
app_stop_cmd (required) Command you use, such as any program, script, or executable file, to stop the
application. You use this command typically during system shutdown or when
restarting applications. The stop command is successful if it exits with a zero
exit code. The command must shut down the application cleanly (remove all
processes, files, locks, and so on) so that a subsequent start command executes
without problems. If the command does not complete in 300 seconds, it is
forcibly terminated.
If required, you can use the -o and –e options to have the aurora_mon daemon
capture stdout/stderr, otherwise it is redirected to /dev/null. To run the
command as a specified user, you must have an su –c wrapper or have set the
setuid bit of the application.
If you do not require a stop command, you can use a command that exits with
a zero exit code, for example /bin/true. An example of this is where the
application is monitoring the amount of disk space on a mount point. There is
no application to stop.
heartbeat_check_cmd (required) Command (any program, script, or executable) to check, the aliveness of the
application. The ping is successful (the application is considered alive) if the
command exits with zero exit code.
If required, you can use the -o and –e options to have the aurora_mon daemon
capture stdout/stderr, otherwise it is redirected to /dev/null. To run the
command as a specified user, you must have an su –c wrapper or have set the
setuid bit of the application.
heartbeat_period (optional, defaults to
30)
The time in seconds between each heartbeat ping (heart_check_cmd is issued
every heartbeat_period seconds). The value can be between 1 second and 600
seconds, and defaults to 30 seconds if not specified. A new heart_beat
command is not issued until the previous command finishes.
heartbeat_ignore_fail_count
(optional, defaults to 0)
Specifies the number of consecutive heartbeat_check_cmd failures, after
which the application is considered to have failed. For example, if
heartbeat_ignore_fail_count is 3, the application is considered to have
failed after a fourth consecutive heartbeat_check_cmd executes. The first three
failures are ignored. This reduces the possibility of a false positive due to
intermittent application problems or transient network problems that cause the
heartbeat_check_cmd to fail.
app_restart_retry_count (optional,
defaults to 3);
app_restart_retry_freq (optional,
defaults to 10 minutes)
The number of times aurora_mon attempts to restart an application after a
failure, and the period of time that elapses before aurora_mon attempts to restart
the application. For example, if app_restart_retry_count is 3 and
app_restart_retry _freq is 10 minutes, aurora_mon makes three attempts
to restart the application and waits 10 minutes before trying again.
heartbeat_fail_action (optional,
defaults to RESTART_APP)
The action taken when an application is considered to have failed (after
heartbeat_ignore_fail_count consecutive heartbeat_check_cmd
failures). The following values are acceptable:
n
JUST ALERT. Send alert only.
n
RESTART_APP. Restart the application (attempt
app_restart_retry_count times, and wait app_restart_retry_freq
time before you try again.
n
RESTART_VM. Restarts the virtual machine by stopping the virtual machine
app monitoring SDK heartbeat to the underlying VMware HA service. The
HA virtual machine properties of the cluster determine the virtual machine
restart interval and counts.
n
RESTART_APP_THEN_VM. Attempts to restart the application
app_restart_retry_count times. If the command fails to restart the
application, it resets the virtual machine using the Guest and HA
Application Monitoring SDK.
Chapter 14 Monitoring the Data Director Environment
VMware, Inc. 181