Single Instruction Multiple Data (SIMD)

SIMD is a form of data parallelism where one instruction performs the same operation on multiple data points simultaneously, using specialized wide CPU registers.

Why use SIMD?

A single control unit drives multiple processing units, cutting overhead in the instruction stream. For vector/matrix math, image processing, and tight numerical loops it delivers parallelism inside one core, complementary to threads.

Accessed via intrinsics or assembly. Used for vector/matrix operations, image processing, computationally intensive jobs.

I ran into this trying to make Eigen matrices constexpr: https://stackoverflow.com/questions/49096618/does-there-exist-information-relating-to-eigenmatrix-constexpr-constructor

Really mastered this at Tesla.

Vector registers by architecture

ArchitectureSIMD SetVector RegistersRegister Size
x86/x86-64MMXMM0 - MM764-bit
x86/x86-64SSEXMM0 - XMM15128-bit
x86/x86-64AVXYMM0 - YMM31256-bit
x86/x86-64AVX-512ZMM0 - ZMM31512-bit
ARMNEONQ0 - Q31 (also D0 - D31)128-bit
ARMSVEZ0 - Z31 (scalable)Variable
RISC-VRVVv0 - v31Variable (128-bit to 2048+ bits)
PowerPCAltiVecv0 - v31128-bit
PowerPCVSXvs0 - vs63128-bit
SPARCVISf0 - f3164-bit (used in SIMD pairs)

From ECE459 L17

Origins trace to 70s supercomputers. Modern examples: GPUs, x86 SSE, SPARC VIS, Power/PowerPC AltiVec.

Downsides

All units do the same thing, which isn’t always useful. Diminishing returns as width grows: the wider the vector, the less likely your problem has that many identical operations [Ton09].

Poor person’s SIMD. Pack several small values into one wide integer so a single instruction modifies many. Pay for bit shifts, sign-extension gotchas, and manual math. Use real SIMD instead.

Auto-vectorization. For for ((a,b),c) in a.iter().zip(b).zip(c) { *c = *a + *b; }, rustc defaults give scalar movsd / addsd. With -O, it emits packed movupd / addpd touching 128 bits at a time, so the loop runs half as many iterations. The compiler also synthesizes variants for odd lengths.

SIMD is not thread-level, it is complementary. Good for small data where thread startup cost would dominate. Lemire [Lem18] argues vector instructions are often a more efficient parallelization than threads: less CPU and power, faster runtime.

Using it explicitly (e.g. simdeez):

simd_runtime_generate!(pub fn add(a: &[f32], b: &[f32]) -> Vec<f32> {
    let len = a.len();
    let mut result: Vec<f32> = Vec::with_capacity(len);
    result.set_len(len);
    for i in (0..len).step_by(S::VF32_WIDTH) {
        let a0 = S::loadu_ps(&a[i]);
        let b0 = S::loadu_ps(&b[i]);
        S::storeu_ps(&mut result[0], S::add_ps(a0, b0));
    }
    result
});

Generates scalar / sse2 / sse41 / avx variants. Call via unsafe { add_sse2(&a, &b) }.

Alignment. Rust aligns primitives to their sizes by default. #[repr(packed(N))] or #[repr(align(N))] override. #[repr(C)] gives C’s layout rules.