User`s guide

System Troubleshooting and Diagnostics
5.2 Product Fault Management and Symptom-Directed Diagnosis
In the case of Direct Memory Access (DMA) transactions where the NCA or
NMC detects the error, the errors are typically signaled back to the CDAL-Bus
device, but not posted to the NVAX CPU. In these cases the CDAL-Bus device
typically posts a device level interrupt to the NVAX CPU via the NCA. In
almost all cases, error state is latched by the NMC and NCA. Although these
errors will not result in a machine check exception or high level interrupt (i.e.
results in device level IPL 14–17 versus error level IPL 1A, 1D), the OpenVMS
machine check handler has a polling routine that will search for this state at
one-second intervals. This will result in the host logging a polled error entry.
These conditions cover all of the cases that will eventually be handled by the
OpenVMS error handler. The OpenVMS error handler will generate entries
that correspond to the machine check exception, hard or soft error interrupt
type, or polled error.
5.2.2 OpenVMS Error Handling
Upon detection of a machine check exception, hard error interrupt, soft error
interrupt or polled error, the OpenVMS operating ststem will perform the
following actions:
Snapshot the state of the kernel.
In most entry points, disable the caches.
If it is a machine check and if the machine check is recoverable, determine
if instruction retry is possible.
Instruction retry is possible if one of the following conditions is true:
If PCSTS <10>PTE_ER = 0:
Check that (ISTATE2 <07>VR = 1) or (PSL <27> FPD = 1)
Otherwise crash the system or process depending on PSL <25:24>
Current Mode.
If PCSTS <10>PTE_ER = 1:
Check that (ISTATE2 <07>VR = 1) and (PSL <27>FPD = 0) and
(PCSTS <09>PTE_ER_WR = 0)
Otherwise crash the system.
ISTATE2 is a longword in the machine check stack frame at offset (SP)+24;
PSL is a longword in the machine check stack frame at offset (SP)+32; VR
is the VAX Restart flag; and FPD is the First Part Done flag.
Check to see if the threshold has been exceeded for various errors (typically
the threshold is exceeded if 3 errors occur within a 10 minute interval).
5–4 System Troubleshooting and Diagnostics