The rise of GPUs

GPUs evolved from niche graphics accelerators into the engines of modern AI. This deep dive explores the architectural divergence, SIMT model, memory hierarchy, and the rise of Tensor Cores that made it all possible.

From Pixels to Planets: The Unlikely Rise of the GPU

For decades, the Central Processing Unit (CPU) was the undisputed monarch of computation. It was the generalist, the jack-of-all-trades designed to run operating systems, execute complex logic, and handle a chaotic mix of tasks with minimal latency. The Graphics Processing Unit (GPU), by contrast, was a specialist consigned to a niche: making video games look good. Its job was simple—transform 3D geometry into 2D pixels, paint them with textures, and do it over 60 times a second.

Today, the roles are reversed. The GPU is no longer just a peripheral; it’s the engine of modern artificial intelligence, a critical tool in scientific simulation, and a geopolitical asset so valuable that its export is restricted. This is the story of a profound architectural divergence, a happy accident of hardware evolution, and the rise of parallel processing from the silicon shadows to the center of the universe.

The Architectural Bet: Throughput Over Latency

The fundamental difference between a CPU and a GPU isn't speed; it's a philosophical approach to problem-solving. A CPU is optimized for latency. It's a sprinter. It must execute a single thread of complex, branching instructions as fast as possible.

To achieve this, a modern CPU devotes a staggering amount of silicon and power to features that hide latency:

Out-of-order execution: The CPU analyzes the instruction stream and executes independent instructions in a different order than specified to keep its pipelines full.

Branch prediction: It tries to guess the outcome of an if statement before it's fully resolved.

Deep, multi-level caches (L1, L2, L3): Massive caches (often over 50% of the die area) store data close to the execution units, minimizing the hundreds of cycles it takes to fetch data from main memory.

A GPU, on the other hand, is a cross-country team of ten thousand sprinters. It is optimized for throughput. Its primary goal is to maximize the total number of operations completed per second, not the time to complete any single operation.

To do this, it abandons the luxury features of a CPU. A GPU dedicates the vast majority of its transistor budget not to complex control logic or huge caches, but to a sea of relatively simple arithmetic logic units (ALUs).

The Anatomy of a Compute Unit:
Imagine a modern NVIDIA Streaming Multiprocessor (SM) or an AMD Compute Unit (CU). It's designed for a programming model called Single Instruction, Multiple Threads (SIMT) . The hardware manages hundreds of concurrent threads in groups called "warps" (NVIDIA) or "wavefronts" (AMD), typically 32 or 64 threads wide.

Here’s the critical part: every thread in a warp executes the same instruction at the same time, but on different data. If a branch instruction occurs where some threads take the if path and others take the else path, the GPU doesn't predict. It serializes the execution. All threads executing the if path go first, while the others idle (their execution is masked out), and then the process reverses. This is why divergent branching is the mortal enemy of GPU performance. The hardware's genius lies in its ability to hide this stalling not with caches, but with numbers.

When Warp A stalls waiting for a memory fetch (which can take 400-800 clock cycles), a GPU's warp scheduler doesn't wait. It performs a zero-cost context switch, immediately swapping Warp A out for Warp B, which is ready to execute. This requires an enormous register file to hold the entire state of all active warps. An NVIDIA AD102 GPU SM has a 65,536 32-bit register file. This incredible capacity allows it to maintain thousands of "in-flight" threads, turning a memory access penalty into a scheduling opportunity. The CPU hides latency with a big cache; the GPU hides it with a big thread pool.

The Memory Hierarchy: A Tale of Two Bandwidths

This architectural choice dictates a different memory hierarchy. A CPU’s connection to main memory is a narrow but relatively fast highway (e.g., a dual-channel DDR5 interface with ~100 GB/s). The GPU needs an eight-lane superhighway. High-end GPUs use a wide memory bus (up to 512 bits) and high-bandwidth memory technologies like GDDR6X or HBM (High Bandwidth Memory), achieving bandwidths of over 1 TB/s. This is essential to feed the thousands of ALUs.

But the real secret weapon is the on-chip shared memory. This is a manually managed, software-controlled L1 cache. In CUDA, it’s declared with the __shared__ keyword. By explicitly staging data from global memory into shared memory, developers can orchestrate block-level data reuse, dramatically reducing the pressure on global bandwidth. This is the core of writing high-performance kernels for operations like matrix multiply, where a single tile of data can be re-used for dozens of computations before being discarded.

The Killer App That Wasn't a Game

For years, this architecture was tailored for its "embarrassingly parallel" native task: graphics. Rendering a triangle involves an independent calculation for each pixel, then another independent calculation for each vertex—a perfect map to the SIMT model. The fixed-function hardware for texture filtering and rasterization dominated the die.

The revolution began when researchers realized that the programmable shader cores, originally designed to run tiny C-like programs for vertex and pixel lighting effects, could be tricked. They formulated non-graphics problems as if they were rendering tasks. Data was packed into textures, and computations were written as pixel shaders. This was GPGPU (General-Purpose computing on GPUs) in its infancy—an exercise in arcane hacking using OpenGL or DirectX APIs.

The watershed moment came in 2006 with NVIDIA’s release of CUDA (Compute Unified Device Architecture). It was a risk. CUDA stripped away the graphics metaphors and exposed the parallel hardware directly to the programmer with a minimal extension of the C language. Now, a developer could launch a kernel, define a grid of thread blocks, and manage memory hierarchy explicitly without pretending to render a triangle. This democratized supercomputing, putting teraflop-scale performance into a $500 PCIe card.

The Epiphany: The AI "Goldilocks" Workload

What happened next was a case of a workload finding its perfect hardware. The core of deep learning is not complex, branching logic. It’s an almost laughably simple sequence of two operations performed billions of times over dense matrices and tensors: a multiply-accumulate (MAC) operation followed by a non-linear activation function.

Backpropagation, the learning algorithm, is just a systematic application of the chain rule from calculus, also expressible as a series of matrix multiplications and element-wise operations. This is the "Goldilocks" workload for a GPU:

Massively Parallel: Millions of independent multiplications and additions.

Compute-Intensive: The ratio of arithmetic operations to data fetched (arithmetic intensity) is high, especially for large matrices where a fetched weight is reused many times.

Regular and Predictable: Dense matrix multiplication involves predictable, strided memory access patterns, making memory coalescing and shared memory tiling highly effective.

NVIDIA didn't just get lucky; they saw it coming. Starting with the Volta architecture in 2017, they introduced Tensor Cores. These are not general-purpose SIMT cores. They are specialized, systolic-array-like functional units designed to perform a single, hyper-optimized operation in one clock cycle:

D = A * B + C

This is a 4x4 matrix fused multiply-add. While a standard CUDA core might do one FMA per cycle, a single Tensor Core, processing matrices in FP16 precision, can perform 128 operations in a single cycle. This is domain-specific acceleration built on top of an already parallel architecture, a silicon embodiment of the insight that AI is the new graphics.

The Modern Superchip: A System of Nodes

The latest frontier, exemplified by NVIDIA's GH200 Grace Hopper Superchip and the Blackwell B200, breaks the GPU out of its PCIe peripheral cage. The architecture is now a system-level design. The PCIe bottleneck is bypassed with a high-speed, cache-coherent interconnect like NVLink-C2C, which provides 900 GB/s of bandwidth between a Grace CPU and a Hopper GPU, creating a single shared memory domain.

In a data center, the GPU is no longer a device; it’s a node in a high-performance computing mesh. The Blackwell B200 is not a single chip but two massive compute dies connected by a 10 TB/s NV-HBI (High-Bandwidth Interface), appearing to software as a single, monolithic GPU. These are then networked together with 800 Gb/s NVLink switches into a non-blocking, in-rack compute fabric. The unit of compute is no longer the server; it's a liquid-cooled rack-scale system with an exaflop of AI compute.

The Revenge of the Control Flow

The rise of the GPU is a testament to the power of specializing for the common case. It doesn't mean the CPU is dead; it’s been re-conscripted as the sidecar, the "control plane" responsible for launching kernels, managing irregular file I/O, and executing the serial fragments of an algorithm that Amdahl's Law insists will always remain.

The GPU conquered computing because its designers bet on the right bottleneck. CPUs assumed that transistor budgets should go to making a single thread run faster by any means necessary. GPUs assumed the future was not one big problem, but billions of tiny, similar ones. As we move from training massive AI models to running inference on energy-constrained edge devices, this architectural specialization will only accelerate. The GPU is a living blueprint for a future where our processors are less like Swiss Army knives and more like a collection of perfectly honed scalpels.