MIT 6.S894 - Accelerated Computing

Fall 2025 graduate course at MIT EECS taught by Jonathan Ragan-Kelley (creator of Halide). Performance engineering for specialized accelerators, primarily NVIDIA GPUs via CUDA.

Course site: https://accelerated-computing.academy/fall25/
Slides URL pattern: https://accelerated-computing.academy/fall25/lectures/slides/Lecture<N>.pdf

The three stated goals of the class (Lec 1, slide 38):

Better understand hardware
Practice programming it for high performance
Reasoning about performance from first principles

Lectures

1. Intro, the PDP-11 Problem

Framing lecture. The thesis: hardware has changed by orders of magnitude since the 1970s, but mainstream programming models still look like they target a PDP-11 (sequential scalar execution, uniform global heap, subroutines + pointers). To use modern accelerators you have to step outside that model.

The motivating workload

Training a 1B-parameter LLaMA on 30B tokens at 6 FLOPS/param (forward + backward) is roughly $1 B \times 30 B \times 6 \approx 180$ ExaFLOPS of work. On a small DGX node (~8 GPUs), that completes in about a day. See FLOPS for the arithmetic anchor.

The Cray-1 vs Apple M2 contrast (slides 8-12)

	Cray-1 (1975)	Apple M2 Air (2022)
Peak	160 MFLOPS	~20 TFLOPS (~125,000x)
Memory	8 MB	16 GB
Power	115 kW	a few W
Price	$7.9 M ($ 56M today)	~$1500

A Cray-1 running the LLaMA workload above takes ~35,650 years and uses ~1 billion times more energy than the M2.

The matrix multiply performance ladder

The spine of the lecture. All of these compute the same N=2048 matmul on the same M2 chip:

Implementation	GFLOPS	% of peak
Python (interpreted)	0.01	0.00005%
JavaScript (JIT)	0.5	0.0025%
C, `clang -O3 -march=native`	1	0.005%
C with i,k,j loop reorder	20	0.1%
C hand-tuned + vectorized + cache-blocked, 1 core	40	0.2%
Same + 4+4 cores parallel	200	1%
Apple Accelerate (AMX matrix units, bf16)	1500	7.5%
Metal compute shaders (GPU)	3200	15%
Neural Engine (proprietary ISA, int16)	16000	80%

Six orders of magnitude on the same chip. The naive C code touches ~0.005% of what the silicon can do. The ladder explains why the rest of the course exists: vectorization, cache blocking, parallelism, matrix units, and GPU kernels each climb one rung. See Matrix Multiplication (Compute) for the detail behind the lower rungs (loop reorder, tiling) and GPU Optimization for why the upper rungs need a different mental model.

The PDP-11 framing (slides 28-36)

Almost every primitive of modern programming was already in place in the PDP-11 era (mid-1970s): UNIX, C, virtual memory, optimizing compilers, ML with type inference, Smalltalk, networked GUIs. Hardware since then has gained: deep cache hierarchies, SIMD, multicore, heterogeneous accelerators, non-uniform memory. But the dominant programming abstraction (sequential scalar threads over a flat address space, calling subroutines through pointers) has not evolved with it. Ragan-Kelley’s slide meme: “Intel inside, why yes, I am still basically a PDP-11”. The course is about the abstractions and intuitions you need past that model.

2. Throughput Processors

Two data points from this lecture worth keeping, the rest is the standard GPU-vs-CPU story (SM, SIMT, GPU Optimization).

Raptor Cove diminishing returns: Geekbench 6 shows ~12× core area buys only ~4× single-thread perf, ~3× off linear. The silicon is there, a single OoO core just cannot absorb it. This is the quantitative justification for “many small cores” over “one fat core”.

Per-core size vs per-core speed, Ryzen 7 7700 vs RTX 4000 Ada:

	Ryzen 7 7700 (Zen 4)	RTX 4000 Ada (AD104)
Cores × exec units	8 × 16 = 128	192 × 32 = 6,144
Peak FP32	0.97–1.4 TFLOPS	26.7 TFLOPS
Die area / core	~8.75 mm²	~1.05 mm²

GPU core is ~8× smaller
CPU core is ~6-7× faster on single-threaded Mandelbrot
So the trade is real, you pay ~7× in per-core perf to buy 8× in core count, and you recover it only if the workload exposes enough parallelism to feed all 192 of them

3. Memory I

Walks up the memory hierarchy and shows how much compute you can afford per byte loaded from each level. Then asks how DRAM gets its bandwidth in the first place.

The instructions/byte budget at each level

Approximate numbers for a mid-range discrete GPU, reading 32-bit operands:

Level	Bandwidth	Budget
Host main memory (over PCIe 4.0)	32 GB/s	~420 instr/byte (1 load per 1700 ops)
GPU main memory (GDDR6)	360 GB/s	~37 instr/byte (1 load per 150 ops)
L2 cache	2.5 TB/s	~5 instr/byte (1 load per 20 ops)
L1 cache	13.4 TB/s	1 instr/byte (1 load per 4 ops)
Registers	~50 TB/s	effectively free

The GPU in this example runs 13.4 T instr/sec with 6144 EUs. “Budget” = peak-ops / peak-bytes-at-that-level: the break-even arithmetic intensity before bandwidth at that level becomes the bottleneck. Operand placement is the lever: one operand crossing PCIe is worth ~100x more math than the same operand crossing L1. This is the same idea as the compute-to-global-memory-access ratio, generalized to every level.

How DRAM bandwidth scales (slides 10-17)

Three knobs, each with tradeoffs. Numbers are per-chip unless stated:

Knob	DDR5-6400	GDDR6	HBM3 (per stack)
Bus width	32 bits	32 bits	2048 bits
Speed	6.4 GT/s	18 GT/s	3.2 GT/s
Per-chip/module BW	25.6 GB/s	72 GB/s	819 GB/s

Faster signaling (DDR5 → GDDR6) costs shorter wires, fewer chips, less capacity
Wider interface (more chips in parallel): 5 × GDDR6 = 160-bit, 360 GB/s; practical limit ~384–512 bits set by processor pins
Package-level integration (HBM3): 4 stacks × 2048 bits = 8192-bit, 3.3 TB/s aggregate, 96 GB; only feasible because the memory sits on-package

See HBM for why HBM needs through-silicon vias.

Channels and striping

Address space is striped across N physical DRAM channels so N loads in parallel hit N different chips and get N× the single-channel bandwidth. A warp issuing 32 × 32-bit loads (1024 bits/cycle) is sized to saturate the channel width. This is why Memory Coalescing matters: a coalesced access maps onto the stripe, a scattered access serializes onto one channel.

4. Memory II

Picks up from Lec 3. Three concrete mechanisms for turning the hardware bandwidth into usable throughput, and one trade that accelerators make but CPUs don’t.

Coalescing gather/scatter into dense transactions

A warp issues 32 × 32-bit loads (1024 bits) per cycle. If the 32 addresses are contiguous, that is one dense DRAM transaction; if they are scattered, it is 32 serial transactions. Two mechanisms close the gap:

Memory-controller coalescing: controller detects adjacency in-flight and merges into one transaction. See Memory Coalescing
Caching as implicit coalescing: a 128 B cache line is fetched once and then 32 sector-sized (32 B) reads hit cache. Block replacement cost is amortized over all accesses to the line while it’s resident

L1 SRAM: the scratchpad, not just a cache

Numbers for the reference GPU:

128 KB per SM × 48 SMs = 6 MB L1 SRAM total
128 bytes/cycle/SM, one warp-wide load/store every 4 cycles per core
× 48 SMs × 2.18 GHz ≈ 9.6 TB/s aggregate

Same physical SRAM serves two roles: implicit L1 cache (opt-in via PTX variants like ld.global.ca vs ld.global.cg, const __restrict__, __ldg(), textures) and explicit scratchpad (CUDA shared memory). Scratchpad is private to a block on one SM, which is why thread blocks are the grouping that can share through it.

The coherence trade

L1 is incoherent across SMs. Thread 0 on SM A writes a=1; thread 1 on SM B reads a and still sees a=0 until something flushes L2 or DRAM.

Why accelerators drop coherence

Conventional CPUs spend significant silicon on hardware coherence protocols (MESI etc., see Cache Coherency) so every core sees a consistent global memory. Accelerators explicitly forego coherence in exchange for performance and scalability, pushing the synchronization burden onto the programmer (__syncthreads, explicit fences, kernel boundaries). This is one of the sharpest “not a PDP-11” departures.

5. Memory III + Overlapping

Finishes the hierarchy by zooming into L1 SRAM and the register file, then pivots to the central question: how do you actually keep compute and memory both busy at the same time?

L1 SRAM is 32 banks + a crossbar

A single monolithic 128 KB SRAM cannot deliver 128 B/cycle. The trick: 32 parallel banks × 4 KB each, with addresses striped % (32 × 4 bytes). Contiguous access hits 32 different banks in parallel. Gather/scatter is handled by a 32×32 all-to-all crossbar between lanes and banks. A bank conflict is when two lanes target the same bank; the accesses serialize.

Register file is also banked, but per-lane

32 × 2 KB register file per SM, banked per-lane. Two restrictions fall out of this:

No dynamic indexing: you can write r3 but not r[i]. The register index must be an immediate in the instruction
Cross-lane moves require an explicit shfl through the L1 crossbar

In exchange, the register file offers effectively infinite bandwidth for in-lane operands: >100 TB/s, 3 loads + 1 store per op.

The finished hierarchy:

Level	BW	Budget	Role
DRAM	360 GB/s	1 load / 150 ops	high BW, limited capacity, wide interface
L2	2.5 TB/s	1 load / 20 ops	aggregates transactions, streaming access
L1 SRAM	13.4 TB/s	1 load / 4 ops	shared per-SM, not coherent, banked, cache or scratchpad
Registers	>100 TB/s	3 loads + 1 store / 1 op	per-lane, no dynamic indexing

Overlapping compute and I/O (slides 9-24)

Every processor has three resources: Control, Compute, Memory. The goal is to saturate Compute and Memory simultaneously. Two strategies:

CPU-style: wide issue, out-of-order, parallelism within one instruction stream. ILP hoists loads early, branch prediction and OoO reorder around stalls in the miss shadow. Zen 4 throws huge silicon at this
GPU-style: single issue, in-order, RISC-ish per-core. Parallelism across instruction streams via multithreading. When one warp stalls on memory, the warp scheduler switches to another ready warp. See Latency Hiding

Matmul is the canonical case where neither alone is enough, because the loop structure alternates “load a bunch into scratchpad” and “compute a bunch from scratchpad”. GPUs make the overlap explicit through three patterns:

ILP within a thread (hoist loads) and multithreading across warps (hide load latency by switching)
Vectorized loads (ld.v4.f32) so bulk transfers do not burn an issue slot per element
Warp specialization + double buffering: some warps in a block load A,B → scratchpad[i+1] while other warps compute from scratchpad[i], then a syncthreads + buffer swap. This is the pattern Hopper’s async copies and Tensor Memory Accelerator generalize (previews Lec 8)

6. Energy

Two-topic lecture. First half closes out the I/O overlap thread with streams. Second half is the energy argument for why accelerators keep adding specialized instructions.

Host↔GPU copies run on dedicated DMA

cudaMemcpy does not burn SMs, it runs on a dedicated DMA engine that can execute in parallel with kernels. Operations in one CUDA Stream have implicit ordering barriers between them; multiple streams let memcopy, kernel, and back-copy of different batches overlap. Each stream needs its own buffers. This is the same double-buffering pattern from Lec 5, now at the host-GPU boundary instead of the DRAM-scratchpad boundary.

Energy per operation (Bill Dally, 14nm)

Operation	Energy
Load 32b from DRAM	640 pJ
Load 32b from large SRAM	50 pJ
Move 32b 10mm across chip	32 pJ (3.2 pJ/word-mm)
Load 32b from local SRAM	5 pJ
64-bit FMA	5 pJ
32-bit FMA	1.2 pJ
16-bit IMUL	0.26 pJ
8-bit IADD	0.01 pJ

DRAM to local SRAM is 100×. More pointedly, data supply dwarfs the ALU op itself for primitive operations. Assuming a 1.2 pJ register-file access for a 32-bit operand:

Op	Data-supply overhead
64-bit FMA	72%
32-bit FMA	84%
16-bit FMA	91%
8-bit IMAC	96%
32-bit IADD	99%

The 8-bit op is 1% useful work, 99% shuffling operands. Reducing data-supply cost is the dominant efficiency lever.

Amortize data supply with complex instructions

The CISC-ish answer: do more ALU work per instruction-fetch + register-read. Primitive (RISC): Reg → ALU → Reg. Complex (“ASIC in an instruction”): Reg → ALU → ALU → ... → ALU → Reg. Examples: AES, video codec, DSP, texture filtering, and above all matrix multiply accumulate (MMA).

x86 memory operands: vfmadd231ps zmm2, zmm0, [rax + rcx*4 + 0x1234], FMA fused with a full base + index×scale + displacement load
Bulk memory ops: Hexagon DSP l2fetch base, width, height, stride does a 2D box prefetch in one instruction. H100 generalizes this into cp.async and cp.async.bulk.tensor (the Tensor Memory Accelerator, covered in Lec 8)
Texture mapping: one instruction computes many addresses, loads, and blends
MMA: NVIDIA PTX mma.sync.aligned.m16n8k8 = 1 instruction, 10 register operands, 2048 FLOPs / 8 cycles. H100 HMMA (fp16): 128 ops with 19% control+data overhead; IMMA (int8): 1024 ops, 12% overhead. Vector peak 67 TF vs MMA peak 990 TF = vector path is <7% of peak

The unifying concept: arithmetic intensity

$arithmetic intensity = \frac{work (ops)}{data (bytes)}$

Matmul: $2 n^{3}$ FLOPs over $3 n^{2}$ entries = $O (n)$ intensity that grows with the problem. This is exactly why matmul is accelerable. The slide’s thesis: all “acceleratable” computations have high arithmetic intensity, and the hardware justification is the pJ table above. Low-intensity workloads are energy-bound by memory traffic before you get close to peak FLOPS. See Memory-Bound Program.

7. Data-Parallel Primitives I

Guest lecture by Ahmed Mahmoud. Shifts the course from “how does the hardware work” to “what high-level operations have efficient GPU implementations, and how do you express your program in terms of them?” The thesis: pick a small set of primitives, give each a fast parallel implementation, then rewrite your algorithm as a composition of those primitives.

The running toy example is a bumpiness metric over a mesh: average each node’s neighbors, take the squared difference, sum the nonzero differences. That composes into stencil → map → compaction → reduce.

Five primitives:

Map map(A, f): apply unary f to every element. Embarrassingly parallel
Stencil p = a[I]: output each element is a function of a small neighborhood of input. Two dual forms: gather (each output pulls from inputs) and scatter (each input pushes to outputs). Scatter needs atomics for collisions; gather is easier but can be load-imbalanced. P2G (particles → grid) is the canonical bad case: variable particle-per-cell count
Reduce a0 ⊕ a1 ⊕ ... ⊕ aₙ₋₁ for an associative + commutative ⊕. Tree reduction gives O(log N) step complexity, O(N) work. That is work-efficient, same ops as the serial version. Standard GPU optimizations apply: stage through shared memory, thread coarsening before the tree starts, minimize global memory traffic
Stream compaction filter(S, P): keep elements where predicate holds. The subtlety is output addressing: each kept element needs to know where to write. Answer: exclusive prefix-sum over the predicate tells each thread its output index. So compaction reduces to scan + scatter
Scan (prefix sum) [I, a₀, a₀⊕a₁, ..., a₀⊕...⊕aₙ₋₂] (exclusive) or inclusive variant. The central primitive; compaction, sort, marching cubes, collision detection, tree building all reduce to it

Two parallel scan algorithms

	Kogge-Stone	Brent-Kung
Steps	O(log N)	O(log N) (2× constants: up + down sweep)
Work	O(N log N)	O(N)
Trade	step-efficient, not work-efficient	work-efficient, higher latency

Kogge-Stone doubles the stride each iteration: at step s every active lane combines with the element 2^s back. Brent-Kung does an up-sweep (build partial sums in a binary tree) then a down-sweep (broadcast them back out). See Kogge-Stone.

Scaling scan beyond one block

Three strategies, each a different point on the read/write vs latency frontier:

Scan-and-add: local scan per block + scan of block sums + add back = 4n read/write
Reduce-then-scan: reduce per block first, scan the reductions, then per-block scan using the offset = 3n read/write
Decoupled look-back (Merrill & Garland 2013): single-pass, each block waits on the predecessor’s partial result via atomic/flag, avoids a second global pass

8. Evolving Ampere to Hopper

Guest lecture by Vijay Thakkar (NVIDIA Compute Architect, CUTLASS). The story of how NVIDIA strong-scaled the tensor core from A100 to H100 to Blackwell, and every ugly second-order problem that surfaced along the way. This lecture is the best single artifact for understanding why the Hopper programming model looks the way it does.

First principles (the constraints)

Moore’s law slowing; Dennard scaling is dead, per-transistor energy improves very slowly now, so we are power-limited
Data movement is sin: 100x–100,000x more expensive than compute (see Lec 6 energy table)
Matmul has unbounded arithmetic intensity $AI = Θ (mn / k)$
Amdahl caps embarrassingly-parallel scaling; strong scaling is the only path forward (2× faster on the same problem, not 2× bigger problem in the same time)

The core tension

A100 tensor core: 8 clocks per 16×8×16 instruction. Make it 4 clocks? Doubles VRF bandwidth, halves SMEM→RMEM load time, exposes more issue-latency hiding. Insanely hard at the wire/parasitic level. Can’t just crank the clock either, power scales as $O (f^{3})$ .

Answer: make the instruction bigger and async. H100 WGMMA spans all 4 SM sub-partitions, tile is 64×N×16 with N ∈ {8, 16, …, 256}, reads B directly from SMEM (no register load), and runs asynchronously like a copy engine. That is 8× more data per MMA than Ampere’s 16×8×16, at 2× the clock rate.

The cascade of secondary problems Hopper had to solve

Each fix creates the next bottleneck:

Bigger MMAs exhaust RMEM → read B from SMEM directly; tensor core is async
GMEM load latency didn’t improve, just got worse → grow SMEM (A100 164 KiB → H100 256 KiB, +56%), allow 6–7 pipeline stages by giving up 2 CTA/SM occupancy in favor of 1 CTA/SM with all 256 KiB
LDGSTS burns too many vector registers, issue slots, and predicates → introduce TMA, a single-threaded affine-tensor load/store engine with Uniform Datapath + Uniform Registers. Programmer describes a multi-dim tile; TMA moves it and self-commits via tx-count
Small-tile perf killed by loop overheads (TMA → MMA → TMA is serial issue) → warp specialization: 1 TMA warp only issues TMAs, 2 MMA warpgroups only issue MMAs. Needs new SMEM barriers (configurable arrive counts, tx-count for TMA, programmable in shared memory)
Warp-specialized kernel has 384 threads/CTA but CUDA wants equal registers per warp → round(512/3, 8) = 168, not enough for MMA accumulators. TMA warp uses ~0 registers, all 168 are wasted → register reconfiguration: start all at 168, donate from TMA warp to MMA warps in the prologue
L2 read bandwidth now the bottleneck for large problems. Can’t widen L2↔GPC crossbar. But many CTAs along A/B projections load the same data → DSMEM (distributed shared memory): CTAs in a cluster treat remote SMEM as local (PGAS model, local crossbar within GPC). Load once to the GPC, broadcast to SMs. Barriers can arrive on remote, wait stays local
We lost 2 CTA/SM so prologue/epilogue no longer hide each other → warp-specialized persistent kernels: launch#CTAs =#SMs, software tile scheduling. Ping-pong: two consumer warpgroups alternate math/epilogue across tiles. Small K benefits more from ping-pong; large K prefers cooperative
Cross-kernel overhead: 2–6 μs grid-launch kills chains of small kernels → CUDA Graphs and Programmatic Dependent Launch (PDL) which lets kernels launch before stream ordering, with programmer-inserted dep points (e.g., an inference kernel starts loading weights early, only waits on the activation dependency)

Workload codesign (lower precision)

Lower precision has a multiplicative perf/W gain from both fewer data-movement bytes and cheaper ops. 8-bit floats break convergence for most training recipes, so the model authors and the hardware architects co-design the training recipe to match.

Blackwell

Generalizes Hopper ping-pong via TMEM (Tensor Memory). Accumulators live in TMEM as a shared resource across threads, so separate warps can run MMA and epilogue on different work units concurrently. More overlap surface exposed.

The takeaway line (verbatim from the closing slide): “First principles are all you need to know. Everything else is all details. But the details are so so fun :D”

9. Data-Parallel Primitives II

Part 2 with Ahmed Mahmoud. Extends Lec 7 with three primitives that show up everywhere the input has variable shape, plus the punchline that reduction itself is a matmul.

Split (4.5)

Binary partition of input by a predicate. Scan again gives the output addresses:

Destination of a 1 = total#zeros +#ones before me
Destination of a 0 = key index −#ones before me

Load-balance search

Expand/contract operations have unequal work per input ([A,B,C,D,E,F] expands by counts [2,1,0,0,7,3] to AABEEEEEEEFFF). Naive mapping of threads to inputs is badly load-imbalanced. Load-balance search assigns work-items to outputs instead, using scan-over-counts + binary search to recover which input spawned each output. ModernGPU’s central idiom.

Segmented scan (5.5)

Scan on arbitrary contiguous partitions, tracked by head flags. Same complexity as scan.

Input      3  1  7  0  4  1  6  3
Head flag  1  0  0  1  0  0  1  0
Inclusive  3  4 11  0  4  5  6  9

Enables variable-length operations in one pass: per-vertex edges, per-particle neighbors, per-document words. Blelloch’s 1990 paper shows a parallel quicksort on top of it: flag heads, broadcast pivot via max-scan, compare, segmented-split, recurse. The whole sort is just scans.

Histogram (6)

Partition → iterate → increment bin. Four GPU techniques, each addresses atomic contention:

Atomics (baseline): one atomicAdd per input, serializes on hot bins
Privatization: each block has its own histogram in shared memory, merge at end
Coarsening: each thread processes N items before syncing
Aggregation: each thread keeps a private running count for the last seen bin, only commits on change

Drives top-k (rank by bin count) and SpMV row-bucketing.

Merge (7)

Two sorted sequences into one. Serial is O(n). Naive parallel (each element does binary search in the other) is O(N log N).

Merge Path (Odeh et al. IPDPS 2012): the merge path through the A[i] > B[j] matrix is the output sequence. Cross-diagonals partition the path into equal pieces; each processor does one constrained binary search, gets a balanced chunk. Hierarchical across thread/block/grid.

Reduction via tensor cores

Dakkak et al. ICS’19: reduction = matmul with a column of ones. $sum (v) = v \cdot 1$ . So the same hardware path that gives 990 TF for MMA (Lec 8) also gives you a free reduction if your element count fits a tile. Also expressible as fused-multiply-add.

Library homes for all of this: NVIDIA Thrust and CUB.

10. Beyond NVIDIA + CUDA

Final lecture. Consolidates the course’s mental model into six design levers, then surveys the accelerator landscape to show every vendor is pulling the same levers, just with different weights.

The six levers of a throughput processor

Parallelism and control:

Explicit parallelism over implicit ILP: multicore, wide-issue SIMD
Amortize control overhead: SIMD, VLIW (instead of burning silicon on speculation)
Explicit concurrency (multithreading) to overlap memory and compute

Arithmetic intensity and memory model:

Complex instructions (MMA) amortize operand fetch across many ops
Wide memory interfaces: GDDR, HBM, highly-banked SRAM
Software-managed memory hierarchy: explicit DMA, scratchpad vs cache

Every one of Lectures 2–8 is one of these levers. The accelerators below rebalance them.

Programming non-NVIDIA GPUs

AMD RDNA/CDNA, Intel Arc/DC GPU, Qualcomm Adreno, Arm Mali. GPUs span 100s of mW (mobile) to 10s of kW (DC). Stacks:

Compute APIs: OpenCL, DirectCompute, Vulkan Compute
Graphics APIs with compute shaders: Metal
Higher-level kernel DSLs: Triton (Python-embedded, NumPy-like within a block, SPMD grids+blocks across them), cuTile, CuTe DSL. Triton makes you give up SASS-level peak in exchange for far less effort; the perf/effort curve goes SASS > CUDA > Triton > PyTorch

Google TPU (contrast class)

Same levers, very different weights:

Few big cores vs GPU’s many small cores
MXU (Matrix Multiply Unit) as a 128×128 systolic array, even more extreme “complex instruction” than Tensor Core
128×8 vector unit, scalar unit, transpose/permute
HBM with DMA-only access, no implicit loads at all, every byte moved is explicit
VLIW + static scheduling, no dynamic warp scheduling, compiler or framework handles all overlap
Scales to 4 chips/board and 9216 chips/pod
“Built to run ML models, not kernels”. The primary interface is XLA-compiled JAX or TensorFlow; Pallas (Triton-like JAX DSL) is the recent escape hatch for writing custom kernels

The proliferation

TPU-likes: Amazon Trainium, Intel Gaudi
NPUs: small embedded TPUs. Apple Neural Engine: 38 TOPS, 16 cores, ~1K MACs/core (32×32), fixed-point only (no training), CoreML-only interface
DSPs, IPUs, FPGAs, CGRAs (“RDAs”), SmartNICs (“DPUs”)

The closing line summarizes the whole course: change the programming model and specialize the hardware to unlock architectural efficiency. You cannot get the last three orders of magnitude (from Lec 1’s matmul ladder) without abandoning the PDP-11 programming model.

🛠️ Steven Gong

Table of Contents

MIT 6.S894 - Accelerated Computing

Lectures

1. Intro, the PDP-11 Problem

The motivating workload

The Cray-1 vs Apple M2 contrast (slides 8-12)

The matrix multiply performance ladder

The PDP-11 framing (slides 28-36)

2. Throughput Processors

3. Memory I

The instructions/byte budget at each level

How DRAM bandwidth scales (slides 10-17)

Channels and striping

4. Memory II

Coalescing gather/scatter into dense transactions

L1 SRAM: the scratchpad, not just a cache

The coherence trade

5. Memory III + Overlapping

L1 SRAM is 32 banks + a crossbar

Register file is also banked, but per-lane

Overlapping compute and I/O (slides 9-24)

6. Energy

Host↔GPU copies run on dedicated DMA

Energy per operation (Bill Dally, 14nm)

Amortize data supply with complex instructions

The unifying concept: arithmetic intensity

7. Data-Parallel Primitives I

Two parallel scan algorithms

Scaling scan beyond one block

8. Evolving Ampere to Hopper

First principles (the constraints)

The core tension

The cascade of secondary problems Hopper had to solve

Workload codesign (lower precision)

Blackwell

9. Data-Parallel Primitives II

Split (4.5)

Load-balance search

Segmented scan (5.5)

Histogram (6)

Merge (7)

Reduction via tensor cores

10. Beyond NVIDIA + CUDA

The six levers of a throughput processor

Programming non-NVIDIA GPUs

Google TPU (contrast class)

The proliferation

Related

Graph View

Backlinks