3.5.1 Matrix Server Administration Guide

Chapter 17: Advanced Monitor Topics 272

When the monitor executes the testpid script, it will first determine

whether the /var/run/application/pid file exists. If the file does not exist, the

script exits with a non-zero exit status, which the monitor interprets as a

failure.

If the file does exist, the script reads the pid from the file into the variable

pid. The kill command then determines whether the pid is running. The

exit status of the kill command is the exit status of the script.

If the kill command finds that the pid is running, it will exit with status 0,

and the script will exit with status 0. The monitor will interpret the 0 exit

status as “success” and will signal to the matrix that the application is up.

If the kill command finds that the pid is not running, it will exit with a

non-zero status, and the script will exit with that same status. The

monitor will interpret that exit status as “failure,” which will signal the

monitor that the application is down. Matrix Server will then take the

action configured for the service monitor, which is typically to fail over

the virtual host associated with the monitor.

When you create the custom service or device monitor for the probe

script, you can set both the frequency at which the probe script should be

executed and the timeout period, which is the maximum amount of time

that the monitor_agent daemon will wait for the probe to complete.

You can create more elaborate probe scripts as necessary. The key points

are to check whether the service or device is up and then to return a

corresponding exit status. The service or device monitor uses only the

exit status to determine whether the probe succeeded or failed, with 0

indicating success and any other value indicating failure.

Recovery Scripts

A Recovery script runs after a monitor probe fails. The script attempts to

restore the service and prevent failover of the virtual host(s) associated

with the monitor.

Recovery scripts are useful if there is an automatic way to recover from a

common failure mode for an application. For example, if you are

monitoring an application called myservice that is normally started at

boot time, but which is buggy and crashes occasionally, you could use a