VByte Encoding

VByte is a variable-length integer encoding that stores small numbers in fewer bytes. Covered in ECE459 L17.

Why use it?

Many real-world integer streams (e.g. delta-encoded doc IDs in an inverted index) are dominated by small values. Fixed 4-byte storage wastes RAM and cache; VByte packs small values tightly for higher throughput.

VByte

The high-order bit of each byte is 0 on the last byte of the integer, and 1 otherwise. UTF-8 uses a similar continuation scheme for its length prefix.

  • 0 ≤ x < 2^7 (e.g. 17 = 0b10001): one byte 0xxxxxxx, i.e. 00010001
  • 2^7 ≤ x < 2^14 (e.g. 1729 = 0b11011000001): two bytes 1xxxxxxx / 0xxxxxxx, i.e. 11000001 / 00001101
  • 2^14 ≤ x < 2^21: three bytes, continuing the pattern

Gain: fewer bits in RAM/cache means higher throughput. Loss: naive decode has a branch per byte, causing lots of mispredictions.

Stream VByte [LKR18]

SIMD-friendly variant. Builds on masked VByte, varint-GB, and varint-G8IU.

Key innovation

Store the control stream and the data stream separately.

  • Control stream: 2 bits per integer indicating size
    • 00 means 1 byte, 01 means 2 bytes, 10 means 3 bytes, 11 means 4 bytes
  • Data stream: all 8 bits of every data byte are data (VByte wasted the top bit on the continuation flag)

Each decode iteration reads one byte from the control stream and 16 bytes from the data stream, then uses a precomputed shuffle mask (indexed by the control byte) to spread the data bytes into a 128-bit SIMD register.

Decoding one iteration

Control byte 0b10001100 encodes sizes (3, 1, 4, 1), consuming 9 of the 16 data bytes (advance the data pointer by 9). For data 0xf823e127 25249748 1b…, the shuffle outputs 0x00f823e1 / 00000027 / 25249748 / 0000001b.

Core of the C++ implementation (three SIMD instructions, also in simdeez):

uint8_t C = lengthTable[control];
__m128i Data = _mm_loadu_si128((__m128i*) databytes);
__m128i Shuf = _mm_loadu_si128(shuffleTable[control]);
Data = _mm_shuffle_epi8(Data, Shuf);
databytes += C; control++;

Why it’s fast [LKR18]:

  • control bytes are sequential, so the processor prefetches the next one
  • data bytes are sequential and load at high throughput
  • shuffle is 1 cycle
  • control flow is regular, no conditional jumps in the hot loop

Not strictly single-threaded (it’s SIMD), but beats prior art on one core thanks to predictable branches and good caching.