A Detailed Look Inside the ® Intel NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor November, 2000
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor Revision History Revision Date Revision Major Changes 11/2000 1.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor Table of Contents ABOUT THIS DOCUMENT ................................................................................................................. 5 INTRODUCTION ................................................................................................................................ 6 SIMD TECHNOLOGY AND STREAMING SIMD EXTENSIONS 2 .........................................................
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor About this Document The Intel® NetBurst™ micro-architecture is the foundation for the Intel® Pentium® 4 processor. It includes several important new features and innovations that will allow the Intel Pentium 4 processor and future IA-32 processors to deliver industry leading performance for the next several years.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor Introduction The Intel® Pentium® 4 processor, utilizing the Intel® NetBurstTM micro-architecture, is a complete processor redesign that delivers new technologies and capabilities while advancing many of the innovative features, such as “out-of-order speculative execution” and “super-scalar execution”, introduced on prior Intel® micro-architectural generations.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor computations to operate on packed double-precision floating-point data elements and 128-bit packed integers. There are 144 instructions in the SSE2 that can operate on two packed double-precision floating-point data elements, or on 16 packed byte, 8 packed word, 4 doubleword, and 2 quadword integers.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor The SSE instructions are useful for 3D geometry, 3D rendering, speech recognition, video encoding and decoding. For more information on the Streaming SIMD Extensions, refer to the IA-32 Intel® Architecture Software Developer’s Manual, Volume 1, available at http://developer.intel.com/design/pentium4/manuals/.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor Intel® NetBurst™ Micro-architecture The Pentium® 4 processor is the first hardware implementation of a new micro-architecture, the Intel NetBurst micro-architecture.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor Overview of the Intel® NetBurstTM Micro-architecture Pipeline The pipeline of the Intel NetBurst micro-architecture contain three sections: § the in-order issue front end § the out-of-order superscalar execution core § the in-order retirement unit. Figure 3 The Intel® NetBurstT M Micro-architecture The front end supplies instructions in program order to the out-of-order core.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor µops called traces, which are stored in the execution trace cache. The execution trace cache stores these µops in the path of program execution flow, where the results of branches in the code are integrated into the same cache line.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor Prefetching The Intel NetBurst micro-architecture supports three prefetching mechanisms: § the first is for instructions only § the second is for data only § the third is for code or data. The first mechanism is hardware instruction fetcher that automatically prefetches instructions. The second is a software-controlled mechanism that fetches data into the caches using the prefetch instructions.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor The Static Predictor. Once the branch instruction is decoded, the direction of the branch (forward or backward) is known. If there was no valid entry in the BTB for the branch, the static predictor makes a prediction based on the direction of the branch. The static prediction mechanism predicts backward conditional branches (those with negative displacement), such as loop-closing branches, as taken.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor § selecting IA-32 instructions that can be decoded into less than 4 µops and/or have short latencies § ordering IA-32 instructions to preserve available parallelism by minimizing long dependence chains and covering long instruction latencies § ordering instructions so that their operands are ready and their corresponding issue ports and execution units are free when they reach the scheduler.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor Port 3. Port 3 supports the dispatch of one store address operation per cycle. Thus the total issue bandwidth can range from zero to six µops per cycle. Each pipeline contains several execution units. The µops are dispatched to the pipeline that corresponds to its type of operation.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor b) avoiding the need to access off-chip caches, which can increase the realized bandwidth compared to a normal load-miss, which returns data to all cache levels. The situations that are less likely to benefit from software-controlled data prefetch are the following: § In cases that are already bandwidth bound, prefetching tends to increase bandwidth demands, and thus not be effective.
A Detailed Look Inside the Intel® NetBurst™ Micro-Architecture of the Intel Pentium® 4 Processor branches are resolved. However, speculative loads cannot cause page faults. Reordering loads with respect to each other can prevent a load miss from stalling later loads. Reordering loads with respect to other loads and stores to different addresses can enable more parallelism, allowing the machine to execute more operations as soon as their inputs are ready.