MIT 6.S894 - Accelerated Computing
Fall 2025 graduate course at MIT EECS taught by Jonathan Ragan-Kelley (creator of Halide). Performance engineering for specialized accelerators, primarily NVIDIA GPUs via CUDA.
- Course site: https://accelerated-computing.academy/fall25/
- Slides URL pattern:
https://accelerated-computing.academy/fall25/lectures/slides/Lecture<N>.pdf
The three stated goals of the class (Lec 1, slide 38):
- Better understand hardware
- Practice programming it for high performance
- Reasoning about performance from first principles
Lectures
1. Intro, the PDP-11 Problem
Framing lecture. The thesis: hardware has changed by orders of magnitude since the 1970s, but mainstream programming models still look like they target a PDP-11 (sequential scalar execution, uniform global heap, subroutines + pointers). To use modern accelerators you have to step outside that model.
The motivating workload
Training a 1B-parameter LLaMA on 30B tokens at 6 FLOPS/param (forward + backward) is roughly ExaFLOPS of work. On a small DGX node (~8 GPUs), that completes in about a day. See FLOPS for the arithmetic anchor.
The Cray-1 vs Apple M2 contrast (slides 8-12)
| Cray-1 (1975) | Apple M2 Air (2022) | |
|---|---|---|
| Peak | 160 MFLOPS | ~20 TFLOPS (~125,000x) |
| Memory | 8 MB | 16 GB |
| Power | 115 kW | a few W |
| Price | 56M today) | ~$1500 |
A Cray-1 running the LLaMA workload above takes ~35,650 years and uses ~1 billion times more energy than the M2.
The matrix multiply performance ladder
The spine of the lecture. All of these compute the same N=2048 matmul on the same M2 chip:
| Implementation | GFLOPS | % of peak |
|---|---|---|
| Python (interpreted) | 0.01 | 0.00005% |
| JavaScript (JIT) | 0.5 | 0.0025% |
C, clang -O3 -march=native | 1 | 0.005% |
| C with i,k,j loop reorder | 20 | 0.1% |
| C hand-tuned + vectorized + cache-blocked, 1 core | 40 | 0.2% |
| Same + 4+4 cores parallel | 200 | 1% |
| Apple Accelerate (AMX matrix units, bf16) | 1500 | 7.5% |
| Metal compute shaders (GPU) | 3200 | 15% |
| Neural Engine (proprietary ISA, int16) | 16000 | 80% |
Six orders of magnitude on the same chip. The naive C code touches ~0.005% of what the silicon can do. The ladder explains why the rest of the course exists: vectorization, cache blocking, parallelism, matrix units, and GPU kernels each climb one rung. See Matrix Multiplication (Compute) for the detail behind the lower rungs (loop reorder, tiling) and GPU Optimization for why the upper rungs need a different mental model.
The PDP-11 framing (slides 28-36)
Almost every primitive of modern programming was already in place in the PDP-11 era (mid-1970s): UNIX, C, virtual memory, optimizing compilers, ML with type inference, Smalltalk, networked GUIs. Hardware since then has gained: deep cache hierarchies, SIMD, multicore, heterogeneous accelerators, non-uniform memory. But the dominant programming abstraction (sequential scalar threads over a flat address space, calling subroutines through pointers) has not evolved with it. Ragan-Kelley’s slide meme: “Intel inside, why yes, I am still basically a PDP-11”. The course is about the abstractions and intuitions you need past that model.
2. Throughput Processors
Two data points from this lecture worth keeping, the rest is the standard GPU-vs-CPU story (SM, SIMT, GPU Optimization).
Raptor Cove diminishing returns: Geekbench 6 shows ~12× core area buys only ~4× single-thread perf, ~3× off linear. The silicon is there, a single OoO core just cannot absorb it. This is the quantitative justification for “many small cores” over “one fat core”.
Per-core size vs per-core speed, Ryzen 7 7700 vs RTX 4000 Ada:
| Ryzen 7 7700 (Zen 4) | RTX 4000 Ada (AD104) | |
|---|---|---|
| Cores × exec units | 8 × 16 = 128 | 192 × 32 = 6,144 |
| Peak FP32 | 0.97–1.4 TFLOPS | 26.7 TFLOPS |
| Die area / core | ~8.75 mm² | ~1.05 mm² |
- GPU core is ~8× smaller
- CPU core is ~6-7× faster on single-threaded Mandelbrot
- So the trade is real, you pay ~7× in per-core perf to buy 8× in core count, and you recover it only if the workload exposes enough parallelism to feed all 192 of them
3. Memory I
Walks up the memory hierarchy and shows how much compute you can afford per byte loaded from each level. Then asks how DRAM gets its bandwidth in the first place.
The instructions/byte budget at each level
Approximate numbers for a mid-range discrete GPU, reading 32-bit operands:
| Level | Bandwidth | Budget |
|---|---|---|
| Host main memory (over PCIe 4.0) | 32 GB/s | ~420 instr/byte (1 load per 1700 ops) |
| GPU main memory (GDDR6) | 360 GB/s | ~37 instr/byte (1 load per 150 ops) |
| L2 cache | 2.5 TB/s | ~5 instr/byte (1 load per 20 ops) |
| L1 cache | 13.4 TB/s | 1 instr/byte (1 load per 4 ops) |
| Registers | ~50 TB/s | effectively free |
The GPU in this example runs 13.4 T instr/sec with 6144 EUs. “Budget” = peak-ops / peak-bytes-at-that-level: the break-even arithmetic intensity before bandwidth at that level becomes the bottleneck. Operand placement is the lever: one operand crossing PCIe is worth ~100x more math than the same operand crossing L1. This is the same idea as the compute-to-global-memory-access ratio, generalized to every level.
How DRAM bandwidth scales (slides 10-17)
Three knobs, each with tradeoffs. Numbers are per-chip unless stated:
| Knob | DDR5-6400 | GDDR6 | HBM3 (per stack) |
|---|---|---|---|
| Bus width | 32 bits | 32 bits | 2048 bits |
| Speed | 6.4 GT/s | 18 GT/s | 3.2 GT/s |
| Per-chip/module BW | 25.6 GB/s | 72 GB/s | 819 GB/s |
- Faster signaling (DDR5 → GDDR6) costs shorter wires, fewer chips, less capacity
- Wider interface (more chips in parallel): 5 × GDDR6 = 160-bit, 360 GB/s; practical limit ~384–512 bits set by processor pins
- Package-level integration (HBM3): 4 stacks × 2048 bits = 8192-bit, 3.3 TB/s aggregate, 96 GB; only feasible because the memory sits on-package
See HBM for why HBM needs through-silicon vias.
Channels and striping
Address space is striped across N physical DRAM channels so N loads in parallel hit N different chips and get N× the single-channel bandwidth. A warp issuing 32 × 32-bit loads (1024 bits/cycle) is sized to saturate the channel width. This is why Memory Coalescing matters: a coalesced access maps onto the stripe, a scattered access serializes onto one channel.
4. Memory II
Picks up from Lec 3. Three concrete mechanisms for turning the hardware bandwidth into usable throughput, and one trade that accelerators make but CPUs don’t.
Coalescing gather/scatter into dense transactions
A warp issues 32 × 32-bit loads (1024 bits) per cycle. If the 32 addresses are contiguous, that is one dense DRAM transaction; if they are scattered, it is 32 serial transactions. Two mechanisms close the gap:
- Memory-controller coalescing: controller detects adjacency in-flight and merges into one transaction. See Memory Coalescing
- Caching as implicit coalescing: a 128 B cache line is fetched once and then 32 sector-sized (32 B) reads hit cache. Block replacement cost is amortized over all accesses to the line while it’s resident
L1 SRAM: the scratchpad, not just a cache
Numbers for the reference GPU:
- 128 KB per SM × 48 SMs = 6 MB L1 SRAM total
- 128 bytes/cycle/SM, one warp-wide load/store every 4 cycles per core
- × 48 SMs × 2.18 GHz ≈ 9.6 TB/s aggregate
Same physical SRAM serves two roles: implicit L1 cache (opt-in via PTX variants like ld.global.ca vs ld.global.cg, const __restrict__, __ldg(), textures) and explicit scratchpad (CUDA shared memory). Scratchpad is private to a block on one SM, which is why thread blocks are the grouping that can share through it.
The coherence trade
L1 is incoherent across SMs. Thread 0 on SM A writes a=1; thread 1 on SM B reads a and still sees a=0 until something flushes L2 or DRAM.
Why accelerators drop coherence
Conventional CPUs spend significant silicon on hardware coherence protocols (MESI etc., see Cache Coherency) so every core sees a consistent global memory. Accelerators explicitly forego coherence in exchange for performance and scalability, pushing the synchronization burden onto the programmer (
__syncthreads, explicit fences, kernel boundaries). This is one of the sharpest “not a PDP-11” departures.
5. Memory III + Overlapping
Finishes the hierarchy by zooming into L1 SRAM and the register file, then pivots to the central question: how do you actually keep compute and memory both busy at the same time?
L1 SRAM is 32 banks + a crossbar
A single monolithic 128 KB SRAM cannot deliver 128 B/cycle. The trick: 32 parallel banks × 4 KB each, with addresses striped % (32 × 4 bytes). Contiguous access hits 32 different banks in parallel. Gather/scatter is handled by a 32×32 all-to-all crossbar between lanes and banks. A bank conflict is when two lanes target the same bank; the accesses serialize.
Register file is also banked, but per-lane
32 × 2 KB register file per SM, banked per-lane. Two restrictions fall out of this:
- No dynamic indexing: you can write
r3but notr[i]. The register index must be an immediate in the instruction - Cross-lane moves require an explicit
shflthrough the L1 crossbar
In exchange, the register file offers effectively infinite bandwidth for in-lane operands: >100 TB/s, 3 loads + 1 store per op.
The finished hierarchy:
| Level | BW | Budget | Role |
|---|---|---|---|
| DRAM | 360 GB/s | 1 load / 150 ops | high BW, limited capacity, wide interface |
| L2 | 2.5 TB/s | 1 load / 20 ops | aggregates transactions, streaming access |
| L1 SRAM | 13.4 TB/s | 1 load / 4 ops | shared per-SM, not coherent, banked, cache or scratchpad |
| Registers | >100 TB/s | 3 loads + 1 store / 1 op | per-lane, no dynamic indexing |
Overlapping compute and I/O (slides 9-24)
Every processor has three resources: Control, Compute, Memory. The goal is to saturate Compute and Memory simultaneously. Two strategies:
- CPU-style: wide issue, out-of-order, parallelism within one instruction stream. ILP hoists loads early, branch prediction and OoO reorder around stalls in the miss shadow. Zen 4 throws huge silicon at this
- GPU-style: single issue, in-order, RISC-ish per-core. Parallelism across instruction streams via multithreading. When one warp stalls on memory, the warp scheduler switches to another ready warp. See Latency Hiding
Matmul is the canonical case where neither alone is enough, because the loop structure alternates “load a bunch into scratchpad” and “compute a bunch from scratchpad”. GPUs make the overlap explicit through three patterns:
- ILP within a thread (hoist loads) and multithreading across warps (hide load latency by switching)
- Vectorized loads (
ld.v4.f32) so bulk transfers do not burn an issue slot per element - Warp specialization + double buffering: some warps in a block load
A,B → scratchpad[i+1]while other warps compute fromscratchpad[i], then asyncthreads+ buffer swap. This is the pattern Hopper’s async copies and Tensor Memory Accelerator generalize (previews Lec 8)
6. Energy
Two-topic lecture. First half closes out the I/O overlap thread with streams. Second half is the energy argument for why accelerators keep adding specialized instructions.
Host↔GPU copies run on dedicated DMA
cudaMemcpy does not burn SMs, it runs on a dedicated DMA engine that can execute in parallel with kernels. Operations in one CUDA Stream have implicit ordering barriers between them; multiple streams let memcopy, kernel, and back-copy of different batches overlap. Each stream needs its own buffers. This is the same double-buffering pattern from Lec 5, now at the host-GPU boundary instead of the DRAM-scratchpad boundary.
Energy per operation (Bill Dally, 14nm)
| Operation | Energy |
|---|---|
| Load 32b from DRAM | 640 pJ |
| Load 32b from large SRAM | 50 pJ |
| Move 32b 10mm across chip | 32 pJ (3.2 pJ/word-mm) |
| Load 32b from local SRAM | 5 pJ |
| 64-bit FMA | 5 pJ |
| 32-bit FMA | 1.2 pJ |
| 16-bit IMUL | 0.26 pJ |
| 8-bit IADD | 0.01 pJ |
DRAM to local SRAM is 100×. More pointedly, data supply dwarfs the ALU op itself for primitive operations. Assuming a 1.2 pJ register-file access for a 32-bit operand:
| Op | Data-supply overhead |
|---|---|
| 64-bit FMA | 72% |
| 32-bit FMA | 84% |
| 16-bit FMA | 91% |
| 8-bit IMAC | 96% |
| 32-bit IADD | 99% |
The 8-bit op is 1% useful work, 99% shuffling operands. Reducing data-supply cost is the dominant efficiency lever.
Amortize data supply with complex instructions
The CISC-ish answer: do more ALU work per instruction-fetch + register-read. Primitive (RISC): Reg → ALU → Reg. Complex (“ASIC in an instruction”): Reg → ALU → ALU → ... → ALU → Reg. Examples: AES, video codec, DSP, texture filtering, and above all matrix multiply accumulate (MMA).
- x86 memory operands:
vfmadd231ps zmm2, zmm0, [rax + rcx*4 + 0x1234], FMA fused with a fullbase + index×scale + displacementload - Bulk memory ops: Hexagon DSP
l2fetch base, width, height, stridedoes a 2D box prefetch in one instruction. H100 generalizes this intocp.asyncandcp.async.bulk.tensor(the Tensor Memory Accelerator, covered in Lec 8) - Texture mapping: one instruction computes many addresses, loads, and blends
- MMA: NVIDIA PTX
mma.sync.aligned.m16n8k8= 1 instruction, 10 register operands, 2048 FLOPs / 8 cycles. H100 HMMA (fp16): 128 ops with 19% control+data overhead; IMMA (int8): 1024 ops, 12% overhead. Vector peak 67 TF vs MMA peak 990 TF = vector path is <7% of peak
The unifying concept: arithmetic intensity
Matmul: FLOPs over entries = intensity that grows with the problem. This is exactly why matmul is accelerable. The slide’s thesis: all “acceleratable” computations have high arithmetic intensity, and the hardware justification is the pJ table above. Low-intensity workloads are energy-bound by memory traffic before you get close to peak FLOPS. See Memory-Bound Program.
7. Data-Parallel Primitives I
Guest lecture by Ahmed Mahmoud. Shifts the course from “how does the hardware work” to “what high-level operations have efficient GPU implementations, and how do you express your program in terms of them?” The thesis: pick a small set of primitives, give each a fast parallel implementation, then rewrite your algorithm as a composition of those primitives.
The running toy example is a bumpiness metric over a mesh: average each node’s neighbors, take the squared difference, sum the nonzero differences. That composes into stencil → map → compaction → reduce.
Five primitives:
- Map
map(A, f): apply unaryfto every element. Embarrassingly parallel - Stencil
p = a[I]: output each element is a function of a small neighborhood of input. Two dual forms: gather (each output pulls from inputs) and scatter (each input pushes to outputs). Scatter needs atomics for collisions; gather is easier but can be load-imbalanced. P2G (particles → grid) is the canonical bad case: variable particle-per-cell count - Reduce
a0 ⊕ a1 ⊕ ... ⊕ aₙ₋₁for an associative + commutative ⊕. Tree reduction gives O(log N) step complexity, O(N) work. That is work-efficient, same ops as the serial version. Standard GPU optimizations apply: stage through shared memory, thread coarsening before the tree starts, minimize global memory traffic - Stream compaction
filter(S, P): keep elements where predicate holds. The subtlety is output addressing: each kept element needs to know where to write. Answer: exclusive prefix-sum over the predicate tells each thread its output index. So compaction reduces to scan + scatter - Scan (prefix sum)
[I, a₀, a₀⊕a₁, ..., a₀⊕...⊕aₙ₋₂](exclusive) or inclusive variant. The central primitive; compaction, sort, marching cubes, collision detection, tree building all reduce to it
Two parallel scan algorithms
| Kogge-Stone | Brent-Kung | |
|---|---|---|
| Steps | O(log N) | O(log N) (2× constants: up + down sweep) |
| Work | O(N log N) | O(N) |
| Trade | step-efficient, not work-efficient | work-efficient, higher latency |
Kogge-Stone doubles the stride each iteration: at step s every active lane combines with the element 2^s back. Brent-Kung does an up-sweep (build partial sums in a binary tree) then a down-sweep (broadcast them back out). See Kogge-Stone.
Scaling scan beyond one block
Three strategies, each a different point on the read/write vs latency frontier:
- Scan-and-add: local scan per block + scan of block sums + add back = 4n read/write
- Reduce-then-scan: reduce per block first, scan the reductions, then per-block scan using the offset = 3n read/write
- Decoupled look-back (Merrill & Garland 2013): single-pass, each block waits on the predecessor’s partial result via atomic/flag, avoids a second global pass
8. Evolving Ampere to Hopper
Guest lecture by Vijay Thakkar (NVIDIA Compute Architect, CUTLASS). The story of how NVIDIA strong-scaled the tensor core from A100 to H100 to Blackwell, and every ugly second-order problem that surfaced along the way. This lecture is the best single artifact for understanding why the Hopper programming model looks the way it does.
First principles (the constraints)
- Moore’s law slowing; Dennard scaling is dead, per-transistor energy improves very slowly now, so we are power-limited
- Data movement is sin: 100x–100,000x more expensive than compute (see Lec 6 energy table)
- Matmul has unbounded arithmetic intensity
- Amdahl caps embarrassingly-parallel scaling; strong scaling is the only path forward (2× faster on the same problem, not 2× bigger problem in the same time)
The core tension
A100 tensor core: 8 clocks per 16×8×16 instruction. Make it 4 clocks? Doubles VRF bandwidth, halves SMEM→RMEM load time, exposes more issue-latency hiding. Insanely hard at the wire/parasitic level. Can’t just crank the clock either, power scales as .
Answer: make the instruction bigger and async. H100 WGMMA spans all 4 SM sub-partitions, tile is 64×N×16 with N ∈ {8, 16, …, 256}, reads B directly from SMEM (no register load), and runs asynchronously like a copy engine. That is 8× more data per MMA than Ampere’s 16×8×16, at 2× the clock rate.
The cascade of secondary problems Hopper had to solve
Each fix creates the next bottleneck:
- Bigger MMAs exhaust RMEM → read B from SMEM directly; tensor core is async
- GMEM load latency didn’t improve, just got worse → grow SMEM (A100 164 KiB → H100 256 KiB, +56%), allow 6–7 pipeline stages by giving up 2 CTA/SM occupancy in favor of 1 CTA/SM with all 256 KiB
- LDGSTS burns too many vector registers, issue slots, and predicates → introduce TMA, a single-threaded affine-tensor load/store engine with Uniform Datapath + Uniform Registers. Programmer describes a multi-dim tile; TMA moves it and self-commits via tx-count
- Small-tile perf killed by loop overheads (TMA → MMA → TMA is serial issue) → warp specialization: 1 TMA warp only issues TMAs, 2 MMA warpgroups only issue MMAs. Needs new SMEM barriers (configurable arrive counts, tx-count for TMA, programmable in shared memory)
- Warp-specialized kernel has 384 threads/CTA but CUDA wants equal registers per warp →
round(512/3, 8) = 168, not enough for MMA accumulators. TMA warp uses ~0 registers, all 168 are wasted → register reconfiguration: start all at 168, donate from TMA warp to MMA warps in the prologue - L2 read bandwidth now the bottleneck for large problems. Can’t widen L2↔GPC crossbar. But many CTAs along A/B projections load the same data → DSMEM (distributed shared memory): CTAs in a cluster treat remote SMEM as local (PGAS model, local crossbar within GPC). Load once to the GPC, broadcast to SMs. Barriers can
arriveon remote,waitstays local - We lost 2 CTA/SM so prologue/epilogue no longer hide each other → warp-specialized persistent kernels: launch#CTAs =#SMs, software tile scheduling. Ping-pong: two consumer warpgroups alternate math/epilogue across tiles. Small K benefits more from ping-pong; large K prefers cooperative
- Cross-kernel overhead: 2–6 μs grid-launch kills chains of small kernels → CUDA Graphs and Programmatic Dependent Launch (PDL) which lets kernels launch before stream ordering, with programmer-inserted dep points (e.g., an inference kernel starts loading weights early, only waits on the activation dependency)
Workload codesign (lower precision)
Lower precision has a multiplicative perf/W gain from both fewer data-movement bytes and cheaper ops. 8-bit floats break convergence for most training recipes, so the model authors and the hardware architects co-design the training recipe to match.
Blackwell
Generalizes Hopper ping-pong via TMEM (Tensor Memory). Accumulators live in TMEM as a shared resource across threads, so separate warps can run MMA and epilogue on different work units concurrently. More overlap surface exposed.
The takeaway line (verbatim from the closing slide): “First principles are all you need to know. Everything else is all details. But the details are so so fun :D”
9. Data-Parallel Primitives II
Part 2 with Ahmed Mahmoud. Extends Lec 7 with three primitives that show up everywhere the input has variable shape, plus the punchline that reduction itself is a matmul.
Split (4.5)
Binary partition of input by a predicate. Scan again gives the output addresses:
Load-balance search
Expand/contract operations have unequal work per input ([A,B,C,D,E,F] expands by counts [2,1,0,0,7,3] to AABEEEEEEEFFF). Naive mapping of threads to inputs is badly load-imbalanced. Load-balance search assigns work-items to outputs instead, using scan-over-counts + binary search to recover which input spawned each output. ModernGPU’s central idiom.
Segmented scan (5.5)
Scan on arbitrary contiguous partitions, tracked by head flags. Same complexity as scan.
Input 3 1 7 0 4 1 6 3
Head flag 1 0 0 1 0 0 1 0
Inclusive 3 4 11 0 4 5 6 9
Enables variable-length operations in one pass: per-vertex edges, per-particle neighbors, per-document words. Blelloch’s 1990 paper shows a parallel quicksort on top of it: flag heads, broadcast pivot via max-scan, compare, segmented-split, recurse. The whole sort is just scans.
Histogram (6)
Partition → iterate → increment bin. Four GPU techniques, each addresses atomic contention:
- Atomics (baseline): one
atomicAddper input, serializes on hot bins - Privatization: each block has its own histogram in shared memory, merge at end
- Coarsening: each thread processes N items before syncing
- Aggregation: each thread keeps a private running count for the last seen bin, only commits on change
Drives top-k (rank by bin count) and SpMV row-bucketing.
Merge (7)
Two sorted sequences into one. Serial is O(n). Naive parallel (each element does binary search in the other) is O(N log N).
Merge Path (Odeh et al. IPDPS 2012): the merge path through the A[i] > B[j] matrix is the output sequence. Cross-diagonals partition the path into equal pieces; each processor does one constrained binary search, gets a balanced chunk. Hierarchical across thread/block/grid.
Reduction via tensor cores
Dakkak et al. ICS’19: reduction = matmul with a column of ones. . So the same hardware path that gives 990 TF for MMA (Lec 8) also gives you a free reduction if your element count fits a tile. Also expressible as fused-multiply-add.
Library homes for all of this: NVIDIA Thrust and CUB.
10. Beyond NVIDIA + CUDA
Final lecture. Consolidates the course’s mental model into six design levers, then surveys the accelerator landscape to show every vendor is pulling the same levers, just with different weights.
The six levers of a throughput processor
Parallelism and control:
- Explicit parallelism over implicit ILP: multicore, wide-issue SIMD
- Amortize control overhead: SIMD, VLIW (instead of burning silicon on speculation)
- Explicit concurrency (multithreading) to overlap memory and compute
Arithmetic intensity and memory model:
- Complex instructions (MMA) amortize operand fetch across many ops
- Wide memory interfaces: GDDR, HBM, highly-banked SRAM
- Software-managed memory hierarchy: explicit DMA, scratchpad vs cache
Every one of Lectures 2–8 is one of these levers. The accelerators below rebalance them.
Programming non-NVIDIA GPUs
AMD RDNA/CDNA, Intel Arc/DC GPU, Qualcomm Adreno, Arm Mali. GPUs span 100s of mW (mobile) to 10s of kW (DC). Stacks:
- Compute APIs: OpenCL, DirectCompute, Vulkan Compute
- Graphics APIs with compute shaders: Metal
- Higher-level kernel DSLs: Triton (Python-embedded, NumPy-like within a block, SPMD grids+blocks across them), cuTile, CuTe DSL. Triton makes you give up SASS-level peak in exchange for far less effort; the perf/effort curve goes SASS > CUDA > Triton > PyTorch
Google TPU (contrast class)
Same levers, very different weights:
- Few big cores vs GPU’s many small cores
- MXU (Matrix Multiply Unit) as a 128×128 systolic array, even more extreme “complex instruction” than Tensor Core
- 128×8 vector unit, scalar unit, transpose/permute
- HBM with DMA-only access, no implicit loads at all, every byte moved is explicit
- VLIW + static scheduling, no dynamic warp scheduling, compiler or framework handles all overlap
- Scales to 4 chips/board and 9216 chips/pod
- “Built to run ML models, not kernels”. The primary interface is XLA-compiled JAX or TensorFlow; Pallas (Triton-like JAX DSL) is the recent escape hatch for writing custom kernels
The proliferation
- TPU-likes: Amazon Trainium, Intel Gaudi
- NPUs: small embedded TPUs. Apple Neural Engine: 38 TOPS, 16 cores, ~1K MACs/core (32×32), fixed-point only (no training), CoreML-only interface
- DSPs, IPUs, FPGAs, CGRAs (“RDAs”), SmartNICs (“DPUs”)
The closing line summarizes the whole course: change the programming model and specialize the hardware to unlock architectural efficiency. You cannot get the last three orders of magnitude (from Lec 1’s matmul ladder) without abandoning the PDP-11 programming model.