VByte Encoding
VByte is a variable-length integer encoding that stores small numbers in fewer bytes. Covered in ECE459 L17.
Why use it?
Many real-world integer streams (e.g. delta-encoded doc IDs in an inverted index) are dominated by small values. Fixed 4-byte storage wastes RAM and cache; VByte packs small values tightly for higher throughput.
VByte
The high-order bit of each byte is 0 on the last byte of the integer, and 1 otherwise. UTF-8 uses a similar continuation scheme for its length prefix.
0 ≤ x < 2^7(e.g. 17 =0b10001): one byte0xxxxxxx, i.e.000100012^7 ≤ x < 2^14(e.g. 1729 =0b11011000001): two bytes1xxxxxxx / 0xxxxxxx, i.e.11000001 / 000011012^14 ≤ x < 2^21: three bytes, continuing the pattern
Gain: fewer bits in RAM/cache means higher throughput. Loss: naive decode has a branch per byte, causing lots of mispredictions.
Stream VByte [LKR18]
SIMD-friendly variant. Builds on masked VByte, varint-GB, and varint-G8IU.
Key innovation
Store the control stream and the data stream separately.
- Control stream: 2 bits per integer indicating size
00means 1 byte,01means 2 bytes,10means 3 bytes,11means 4 bytes
- Data stream: all 8 bits of every data byte are data (VByte wasted the top bit on the continuation flag)
Each decode iteration reads one byte from the control stream and 16 bytes from the data stream, then uses a precomputed shuffle mask (indexed by the control byte) to spread the data bytes into a 128-bit SIMD register.
Decoding one iteration
Control byte
0b10001100encodes sizes(3, 1, 4, 1), consuming 9 of the 16 data bytes (advance the data pointer by 9). For data0xf823e127 25249748 1b…, the shuffle outputs0x00f823e1 / 00000027 / 25249748 / 0000001b.
Core of the C++ implementation (three SIMD instructions, also in simdeez):
uint8_t C = lengthTable[control];
__m128i Data = _mm_loadu_si128((__m128i*) databytes);
__m128i Shuf = _mm_loadu_si128(shuffleTable[control]);
Data = _mm_shuffle_epi8(Data, Shuf);
databytes += C; control++;Why it’s fast [LKR18]:
- control bytes are sequential, so the processor prefetches the next one
- data bytes are sequential and load at high throughput
- shuffle is 1 cycle
- control flow is regular, no conditional jumps in the hot loop
Not strictly single-threaded (it’s SIMD), but beats prior art on one core thanks to predictable branches and good caching.