🥸Advanced Computer Architecture Unit 3 – Advanced Pipelining: Techniques & Hazards

Advanced pipelining techniques are crucial for enhancing processor performance. By dividing instruction execution into stages and employing strategies like superpipelining and out-of-order execution, processors can achieve higher clock frequencies and increased throughput. However, these advanced techniques introduce challenges such as data and control hazards. To mitigate these issues, processors employ sophisticated methods like forwarding, branch prediction, and speculative execution, balancing performance gains with the complexities of managing dependencies and resource conflicts.

Study Guides for Unit 3

3.1

Data Hazards and Forwarding

5 min read

3.2

Control Hazards and Branch Prediction

6 min read

3.3

Exception Handling in Pipelined Processors

5 min read

3.4

Advanced Pipeline Optimizations

5 min read

Pipelining Basics Recap

Pipelining improves processor performance by overlapping the execution of multiple instructions
Divides instruction execution into stages (fetch, decode, execute, memory access, write back)
Each stage operates concurrently on different instructions
Enables higher clock frequencies and increased throughput
Requires careful management of dependencies and hazards to ensure correct execution
- Data dependencies occur when an instruction relies on the result of a previous instruction
- Control dependencies arise from branch instructions that alter the program flow

Advanced Pipeline Stages

Superpipelining increases the number of pipeline stages to achieve higher clock frequencies
Superpipelined processors (Intel Pentium 4) have deeper pipelines with more stages
Splitting complex stages (execute) into multiple substages allows for shorter clock cycles
Additional stages may include address generation, register read/write, and branch resolution
Introduces more opportunities for hazards and requires advanced techniques to mitigate them
- Forwarding paths become longer and more complex
- Branch prediction accuracy becomes critical to avoid pipeline stalls
Careful balancing of stage latencies is necessary to prevent performance bottlenecks

Instruction-Level Parallelism

ILP refers to the ability to execute multiple independent instructions simultaneously
Pipelining exploits ILP by overlapping the execution of instructions
Out-of-order execution allows instructions to be executed in a different order than the program sequence
- Requires hardware to track dependencies and reorder instructions
- Enables better utilization of pipeline resources and higher ILP
Superscalar architectures issue multiple instructions per clock cycle to multiple execution units
Very Long Instruction Word (VLIW) architectures bundle multiple operations into a single instruction
ILP is limited by data dependencies, control dependencies, and resource constraints

Data Hazards and Forwarding

Data hazards occur when an instruction depends on the result of a previous instruction still in the pipeline
Read After Write (RAW) hazards are the most common type of data hazard
Forwarding (bypassing) is a technique used to mitigate RAW hazards
- Forwards the result of an instruction directly to the dependent instruction, bypassing pipeline stages
- Requires additional forwarding paths and control logic
Load-use hazards occur when an instruction uses the result of a load immediately after it
- Difficult to forward due to the delay in memory access
- Can be mitigated using delayed load or load forwarding techniques
Compiler optimization techniques (instruction scheduling) can help reduce data hazards

Control Hazards and Branch Prediction

Control hazards occur due to branch instructions that alter the program flow
Pipeline stalls occur when a branch is resolved, and the fetched instructions are discarded
Branch prediction techniques are used to mitigate control hazards
- Static branch prediction uses heuristics (backward taken, forward not taken) to predict branch outcomes
- Dynamic branch prediction uses runtime information to make more accurate predictions
  - Branch history tables (BHTs) store the history of branch outcomes
  - Two-level adaptive predictors (local and global history) improve prediction accuracy
Branch delay slots allow useful instructions to be executed while a branch is being resolved
Speculative execution fetches and executes instructions based on predicted branch outcomes
- Requires mechanisms to discard speculative results if the prediction is incorrect

Pipeline Stalls and Bubbles

Pipeline stalls occur when an instruction cannot proceed to the next stage due to a hazard or resource conflict
Stalls introduce bubbles (empty stages) in the pipeline, reducing performance
Data hazards can cause stalls if the required data is not available in time
- Forwarding and bypassing techniques help reduce data hazard stalls
Control hazards lead to stalls when a branch is mispredicted, and the pipeline needs to be flushed
- Accurate branch prediction minimizes control hazard stalls
Structural hazards arise when multiple instructions compete for the same hardware resource
- Sufficient hardware replication (multiple execution units) can alleviate structural hazards
Out-of-order execution and dynamic scheduling help reduce stalls by allowing independent instructions to proceed

Performance Optimization Techniques

Instruction prefetching fetches instructions ahead of time to reduce cache misses and pipeline stalls
Branch prediction and speculative execution optimize the handling of control hazards
Out-of-order execution and dynamic scheduling maximize resource utilization and minimize stalls
Register renaming eliminates false dependencies (WAR and WAW hazards) by using a larger set of physical registers
Superscalar and VLIW architectures exploit ILP by issuing multiple instructions per cycle
Compiler optimizations (loop unrolling, software pipelining) help expose more ILP and reduce hazards
Cache optimization techniques (prefetching, cache hierarchies) reduce memory access latencies
Multithreading allows multiple threads to share pipeline resources, hiding latencies and improving throughput

Real-world Pipeline Implementations

Intel Core microarchitecture (Skylake) uses a 14-19 stage pipeline with advanced features
- Out-of-order execution, branch prediction, and speculative execution
- Supports hyper-threading (simultaneous multithreading) for improved throughput
ARM Cortex-A series processors employ deep pipelines and advanced hazard mitigation techniques
- Cortex-A77 has a 13-stage pipeline with out-of-order execution and speculation
IBM Power processors (Power9) feature a deep pipeline and aggressive optimization techniques
- Out-of-order execution, branch prediction, and simultaneous multithreading
AMD Zen microarchitecture (Ryzen) utilizes a high-performance pipeline with advanced features
- Perceptron-based branch prediction, micro-op cache, and large instruction windows
RISC-V architectures implement various pipeline designs based on the implementation
- Range from simple in-order pipelines to advanced out-of-order designs with speculation