Applications paper

replay, the erroneous bit will be included in the instruction and
will continue to prevent proper execution.
The next-generation Intel Itanium processor solves this chal-
lenge by implementing a “refetch” of the affected instruction
instead of a replay. A refetch removes all instructions in the
instruction execution pipeline, the instruction buffer, and the
instruction fetch pipeline (Figure 2). It then re-reads the instruc-
tion, and the instructions after it, from cache. This cleans out
the soft error, after which program execution continues without
incident. A refetch takes only about twice as long as a replay and
serves the same function. The hardware detects the problem
and corrects it without any impact on the software.
A refetch can also be used for handling most soft errors that
occur in the instruction cache. Parity errors in the rst-level
instruction cache can be cured with a refetch, as can single-bit
or double-bit ECC errors in the mid-level instruction cache. The
erroneous cached instruction is removed from cache during the
refetch and then fetched from the next higher level of cache or
from main memory. Since these locations won’t have the tran-
sient error, the soft error is xed without software interruption
and with only a short delay of execution.
in the instruction buffer until the instruction has successfully
traversed the pipeline and is no longer needed. If necessary, an
instruction can replay multiple times.
Intel Instruction Replay Technology combines the replay process
described above with error detection mechanisms to enable fast,
automated correction of soft errors in the pipeline—with very
low performance overhead (Figure 1).
Soft error detectors are located in several stages of the instruc-
tion execution pipeline to check parity, residues, and ECC. If a
transient error occurs while the instruction is owing down the
pipeline, a simple replay of the affected instruction will correct
the error. The instruction is simply reread correctly from the
instruction buffer and restarted through the pipeline as if the
error had never occurred. The instruction executes properly,
gets the correct result, and nishes normally.
Transient errors can also occur when data is read from cache
into the pipeline; or they may be present in cache even before
the data is read. Normally, data are accessed from the single-
cycle rst level data cache. When a parity calculation identies a
soft error during a read from cache, the erroneous cache entry
is removed from cache and a replay of the affected instruction
is performed. As the instruction comes down the pipeline again,
it accesses the second- level data cache instead of the rst level
data cache for the needed data value. The error is thereby cor-
rected and the instruction completes normally.
These instruction replay mechanisms provide very quick recov-
ery for soft errors. They delay instruction execution by only
seven core clock cycles, which is the length of the instruction ex-
ecution pipeline. This delay is too short to be visible to software.
Refetch—identifying and fixing most other soft errors
Some soft errors that are detected in the instruction execution
pipeline cannot be cured by a replay. For example, a soft error
can occur in the instruction buffer itself. When the affected
instruction is read from the buffer into the pipeline during a
Instruction Buffer
IPG FET
FDC
REN
Back-End
Front-End
Re-fetch Path
REG EXE DET WRB WB2DEC
Figure 2 Poulson Refetch Pipeline. Intel Itanium processor 9500 se-
ries invalidates and refetch without impact on software
4
Ratchet Up Reliability for Mission-Critical Applications