Single Instruction Multiple Data (SIMD)
SIMD is a form of data parallelism where one instruction performs the same operation on multiple data points simultaneously, using specialized wide CPU registers.
Why use SIMD?
A single control unit drives multiple processing units, cutting overhead in the instruction stream. For vector/matrix math, image processing, and tight numerical loops it delivers parallelism inside one core, complementary to threads.
Accessed via intrinsics or assembly. Used for vector/matrix operations, image processing, computationally intensive jobs.
I ran into this trying to make Eigen matrices constexpr: https://stackoverflow.com/questions/49096618/does-there-exist-information-relating-to-eigenmatrix-constexpr-constructor
Really mastered this at Tesla.
Vector registers by architecture
| Architecture | SIMD Set | Vector Registers | Register Size |
|---|---|---|---|
| x86/x86-64 | MMX | MM0 - MM7 | 64-bit |
| x86/x86-64 | SSE | XMM0 - XMM15 | 128-bit |
| x86/x86-64 | AVX | YMM0 - YMM31 | 256-bit |
| x86/x86-64 | AVX-512 | ZMM0 - ZMM31 | 512-bit |
| ARM | NEON | Q0 - Q31 (also D0 - D31) | 128-bit |
| ARM | SVE | Z0 - Z31 (scalable) | Variable |
| RISC-V | RVV | v0 - v31 | Variable (128-bit to 2048+ bits) |
| PowerPC | AltiVec | v0 - v31 | 128-bit |
| PowerPC | VSX | vs0 - vs63 | 128-bit |
| SPARC | VIS | f0 - f31 | 64-bit (used in SIMD pairs) |
From ECE459 L17
Origins trace to 70s supercomputers. Modern examples: GPUs, x86 SSE, SPARC VIS, Power/PowerPC AltiVec.
Downsides
All units do the same thing, which isn’t always useful. Diminishing returns as width grows: the wider the vector, the less likely your problem has that many identical operations [Ton09].
Poor person’s SIMD. Pack several small values into one wide integer so a single instruction modifies many. Pay for bit shifts, sign-extension gotchas, and manual math. Use real SIMD instead.
Auto-vectorization. For for ((a,b),c) in a.iter().zip(b).zip(c) { *c = *a + *b; }, rustc defaults give scalar movsd / addsd. With -O, it emits packed movupd / addpd touching 128 bits at a time, so the loop runs half as many iterations. The compiler also synthesizes variants for odd lengths.
SIMD is not thread-level, it is complementary. Good for small data where thread startup cost would dominate. Lemire [Lem18] argues vector instructions are often a more efficient parallelization than threads: less CPU and power, faster runtime.
Using it explicitly (e.g. simdeez):
simd_runtime_generate!(pub fn add(a: &[f32], b: &[f32]) -> Vec<f32> {
let len = a.len();
let mut result: Vec<f32> = Vec::with_capacity(len);
result.set_len(len);
for i in (0..len).step_by(S::VF32_WIDTH) {
let a0 = S::loadu_ps(&a[i]);
let b0 = S::loadu_ps(&b[i]);
S::storeu_ps(&mut result[0], S::add_ps(a0, b0));
}
result
});Generates scalar / sse2 / sse41 / avx variants. Call via unsafe { add_sse2(&a, &b) }.
Alignment. Rust aligns primitives to their sizes by default. #[repr(packed(N))] or #[repr(align(N))] override. #[repr(C)] gives C’s layout rules.