Advanced Computer Architecture

🥸Advanced Computer Architecture Unit 10 – Multicore Architectures & Thread Parallelism

Multicore architectures and thread parallelism are key to modern computing. By integrating multiple processing cores on a single chip, these systems can execute multiple threads simultaneously, boosting performance for parallel workloads. Thread-level parallelism exploits multicore capabilities to improve efficiency. This approach involves dividing tasks into threads that run concurrently, leveraging techniques like symmetric multiprocessing and simultaneous multithreading to maximize resource utilization and enhance overall system performance.

Core Concepts and Terminology

  • Multicore architecture refers to a single computing component with two or more independent processing units called cores
  • Thread-level parallelism (TLP) exploits the availability of multiple cores to execute multiple threads simultaneously, improving overall performance
  • Symmetric multiprocessing (SMP) is a multiprocessor computer architecture where two or more identical processors are connected to a single shared main memory
  • Chip multiprocessor (CMP) integrates multiple processor cores on a single integrated circuit die
  • Simultaneous multithreading (SMT) is a technique that allows multiple independent threads to execute on a single core, sharing its resources (Intel Hyper-Threading)
  • Parallel programming involves writing software that can leverage multiple cores or processors to perform tasks concurrently
  • Scalability measures a system's ability to handle increased workload by adding more resources (cores, memory)
  • Amdahl's Law states that the speedup of a program using multiple processors is limited by the sequential fraction of the program

Evolution of Multicore Architectures

  • Early multiprocessor systems (1960s-1980s) used multiple discrete processors on separate chips, connected via a shared bus or interconnect
  • Advancements in semiconductor technology enabled the integration of multiple cores on a single chip, leading to the development of chip multiprocessors (CMPs) in the early 2000s
  • Intel introduced Hyper-Threading Technology (simultaneous multithreading) in 2002, allowing a single core to execute multiple threads concurrently
  • The number of cores per chip has steadily increased over time, with modern processors featuring up to 64 cores (AMD Epyc)
  • Heterogeneous multicore architectures, such as the Cell Broadband Engine (PlayStation 3) and ARM big.LITTLE, combine different types of cores optimized for specific tasks
  • Many-core processors, such as Intel Xeon Phi and NVIDIA GPUs, feature hundreds or thousands of simpler cores designed for highly parallel workloads
  • Future trends include the integration of specialized accelerators (AI, graphics) and the adoption of 3D chip stacking technologies to increase core density

Thread-Level Parallelism Basics

  • Thread-level parallelism (TLP) aims to improve performance by executing multiple threads simultaneously on different cores or processors
  • Threads are lightweight units of execution that share the same memory space within a process
  • TLP can be achieved through explicit threading (programmer-defined threads) or implicit threading (automatically extracted by the compiler or runtime)
  • Data parallelism involves distributing data across multiple threads, each performing the same operation on a different subset of the data (SIMD)
  • Task parallelism involves distributing different tasks or functions across multiple threads, each performing a unique operation
  • Synchronization mechanisms, such as locks and semaphores, are used to coordinate access to shared resources and prevent data races
  • Load balancing techniques, such as work stealing and dynamic scheduling, help distribute the workload evenly across threads to maximize resource utilization
  • TLP performance is influenced by factors such as the number of available cores, memory bandwidth, cache coherence, and communication overhead

Multicore Processor Design

  • Multicore processors integrate multiple processing cores on a single chip, sharing resources such as cache and memory controllers
  • Symmetric multiprocessing (SMP) designs treat all cores as equal, with each core having access to the same shared memory and I/O resources
  • Asymmetric multiprocessing (AMP) designs feature cores with different capabilities or specialized functions, such as the ARM big.LITTLE architecture
  • Cache coherence protocols (MESI, MOESI) ensure that multiple copies of shared data in different caches remain consistent
  • Interconnect topologies, such as bus, ring, mesh, and crossbar, determine how cores and memory are connected and influence scalability and performance
  • Non-Uniform Memory Access (NUMA) architectures divide memory into local and remote regions, with each core having faster access to its local memory
  • Power management techniques, such as dynamic voltage and frequency scaling (DVFS) and power gating, help reduce energy consumption in multicore processors
  • Heterogeneous architectures combine general-purpose cores with specialized accelerators (GPUs, FPGAs) to optimize performance for specific workloads

Memory Hierarchy in Multicore Systems

  • The memory hierarchy in multicore systems consists of multiple levels of cache (L1, L2, L3), main memory (DRAM), and storage (SSD, HDD)
  • Private caches (L1, L2) are dedicated to individual cores, providing fast access to frequently used data
  • Shared caches (L3) are accessible by all cores, facilitating data sharing and reducing off-chip memory accesses
  • Cache coherence protocols ensure that multiple copies of shared data in different caches remain consistent
    • The MESI protocol (Modified, Exclusive, Shared, Invalid) is commonly used in multicore systems
    • The MOESI protocol adds an Owned state to reduce memory traffic in certain scenarios
  • Non-Uniform Memory Access (NUMA) architectures partition memory into local and remote regions, with each core having faster access to its local memory
  • Memory controllers manage access to main memory (DRAM) and implement scheduling policies to optimize bandwidth utilization
  • Prefetching techniques, such as hardware prefetchers and software-guided prefetching, can hide memory latency by fetching data before it is needed
  • Memory consistency models (sequential, relaxed) define the allowable orderings of memory operations and influence programmability and performance

Parallel Programming Models

  • Parallel programming models provide abstractions and tools for writing software that can exploit thread-level parallelism
  • Shared-memory models (OpenMP, Pthreads) allow multiple threads to access and modify shared data structures
    • OpenMP is a directive-based model that supports parallel loops, sections, and tasks
    • Pthreads is a low-level API for creating and managing threads explicitly
  • Distributed-memory models (MPI) enable parallel execution across multiple nodes, with each node having its own local memory
  • Partitioned global address space (PGAS) models (UPC, Coarray Fortran) provide a shared-memory abstraction over distributed memory
  • Task-based models (Cilk, Intel TBB) express parallelism through the decomposition of a program into tasks, which are scheduled dynamically
  • Stream processing models (CUDA, OpenCL) leverage the massive parallelism of GPUs to accelerate data-parallel computations
  • Functional programming languages (Haskell, Erlang) emphasize immutability and side-effect-free functions, making them well-suited for parallel execution
  • Domain-specific languages (DSLs) provide high-level abstractions tailored to specific application domains (SQL for databases, Halide for image processing)

Performance Optimization Techniques

  • Load balancing distributes the workload evenly across threads or cores to maximize resource utilization
    • Static load balancing assigns work to threads at compile-time based on a fixed partitioning scheme
    • Dynamic load balancing adjusts the workload distribution at runtime based on factors such as system load and task progress
  • Data locality optimization involves structuring data and computation to maximize cache hits and minimize memory accesses
    • Spatial locality refers to accessing data elements that are close together in memory (array elements)
    • Temporal locality refers to reusing recently accessed data elements (loop variables)
  • Vectorization exploits data-level parallelism by performing the same operation on multiple data elements simultaneously using SIMD instructions
  • Loop transformations, such as unrolling, tiling, and fusion, can improve cache performance and enable vectorization
  • Synchronization overhead can be reduced by minimizing the use of locks and barriers and by using lock-free data structures when possible
  • False sharing occurs when multiple threads access different parts of the same cache line, leading to unnecessary invalidations and performance degradation
  • Numa-aware scheduling and memory allocation policies can minimize remote memory accesses and improve performance on NUMA systems
  • Scalability becomes increasingly difficult as the number of cores and threads grows, due to factors such as communication overhead, synchronization, and memory bandwidth limitations
  • Power and energy efficiency are critical concerns in multicore systems, requiring advanced power management techniques and energy-aware scheduling algorithms
  • Heterogeneous architectures, combining general-purpose cores with specialized accelerators (GPUs, FPGAs), introduce programming challenges and require new tools and frameworks
  • Dark silicon refers to the increasing proportion of a chip that must remain powered off due to thermal and power constraints, limiting the number of active cores
  • 3D chip stacking technologies, such as through-silicon vias (TSVs), enable the integration of multiple layers of cores, memory, and interconnects, improving performance and energy efficiency
  • Neuromorphic computing aims to emulate the brain's structure and function using specialized hardware, such as IBM TrueNorth and Intel Loihi, to achieve high energy efficiency for AI workloads
  • Quantum computing leverages the principles of quantum mechanics to perform certain computations exponentially faster than classical computers, with potential applications in cryptography, optimization, and machine learning
  • Non-volatile memory technologies, such as Intel Optane and memristors, blur the line between memory and storage, enabling new architectures and programming models


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.