Table of Contents
Introduction
This post is a collection of notes I took and benchmarks I run while implementing and optimizing dsi-bitstream-rs.
Theoretical preliminaries
All bit sequences will be read from left to right, from top to bottom. For ease of reading I will be using the convention that all codes start from 0 by adding a +1 where needed. This differs from the book definitions of the codes, but it's how you practically implement them.
Information
Information is define by the following axioms:
- The infomration \(I(p)\) of an event with probaility \(p\) is a monotone decreasing function of \(p\), i.e. \(p \le q \implies I(p) \ge I(q)\)
- \(I(1) = 0\): The information content of a certain event is 0
- Information is additive, i.e. the information of two independent events is the sum of the information of each event. \(I(p \cup q) = I(p) + I(q)\)
The only function that satisfies these axioms is: \[I(p) = -\log_2 p\]
Example
A simple way to derive the information function for a discrete uniform distribution is the following:
Using \(n\) bits we can store \(2^n\) different values. We can invert this relationship as: a set of \(N = 2^n\) values with uniform probability has \(I(p) = \log_2 N = n\) bits of information.
We use uniform probability because we don't know anything about the data.
By definition, each value in the set has probability \(p = \frac{1}{N}\) so \(N = \frac{1}{p}\), so we can rewrite the information as: \[I(p) = -\log_2 \frac{1}{p} = -\log_2 p\]
which is exactly the definition of Information.
Entropy
The entropy is the expected information.
Given a probability distribution \(p: S \to [0, 1]\) with support \(S\), the entropy of \(p\) in bits is defined as: \[\mathbb{H}(p) = -\sum_{s \in S} p(s) \log_2 p(s) = \mathbb{E}_p[-\log_2 p]\]
The entropy is a measure of the uncertainty of a random variable, it's maximized when all the values are equally likely and minimized when the distribution is a delta function i.e. one value has probability 1 while the other have probability 0.
For our use cases, the entropy is the minimum number of bits needed to store a value extracted from a distribution.
Instantaneous code
An instantaneous (binary) code (a.k.a. prefix-free code) for the set \(S\) is a function \(c : S \to {0, 1}^*\) such that, for all \(x, y \in S\), if \(c(x)\) is a prefix of \(c(y)\), then \(x = y\).
This implies that there is one unique way of decoding a given sequence
Kraft-McMillan tell us that there exists an instantaneous code with lengths \(|c(s)|\) (\(s \in S\)) if, and only if: \[\sum_{s \in S} 2^{-|c(s)|} \le 1\]
where \(l_s = |c(s)|\) is the length in bits of the code \(c(s)\) for value \(s \in S\).
A complete code is an instantaneous code where: \[\sum_{s \in S} 2^{-l_s} = 1\]
A complete code means that there are no unused codewords.
Furthermore, the expected length of \(c\) with respect to some probability distribution \(p: S \to [0, 1]\) with support \(S\) is:
\[\mathbb{E}_p[|c|] = \sum_{s \in S} p(s) |c(s)|\]
it follows that the optimal code \(c^*_p\) for a givend probability distribution \(P\) is the instantaneous code with minimum expected length.
If \(S\) is finite, Shannon's coding theorem tells us that the minimum expected length is the entropy of the distribution \(\mathbb{H}[p]\): \[\mathbb{E}_p[|c^*_p|] = \mathbb{H}(p)\]
moreover, the Huffman encoding is optimal.
Intended distribution
Using the same equation for the expected lenght we can also compute the intended distribution i.e. the probability \(p\) that minimize the expected length \(\mathbb{E}_p[|c|]\) for a given code \(c\).
We can derive the intended distribution starting from: \[\mathbb{E}_p[|c^*_p|] = \mathbb{H}(p)\] by applying the respective defenitions: \[\sum_{s \ in S} p(s) |c(s)| = \sum_{s \in S} -p(s) \log_2 p(s)\] a way to make this holds is to impose that the inner terms of the sum must be equal: \[\forall s \in S \quad p(s) |c(s)| = -p(s) \log_2 p(s)\] we can factor out the common \(p(s)\) term (assuming it's not 0, in which case the equation still holds): \[\forall s \quad |c(s)| = -\log_2 p(s)\] and finally we can derive \(p(s)\): \[\forall s \quad p(s) = 2^{-|c(x)|}\]
Thus, the intended distribution for \(c\) is: \[p(x) = 2^{-|c(x)|}\]
Note that this is a proper distribution if and only if the code is complete, otherwise it won't sum to 1 and we would need some renormalization constant.
Therefore, to achieve best compression, we have to choose the code which intended distribution most closely match the data distribution.
Universal code
Codes
Fixed length
It's defined on a the set of values beween \(0\) and \(N\) (excluded) i.e. \(S = {0, 1, ..., N - 1}\), it reresents each element \(s \in S\) using \(\forall s \in S \quad |c(s)| = \lceil \log_2 N \rceil\) bits.
Thus, the fixed length code is optimal for uniform distributions of values in \(\left[0, N\right)\) if, and only if, \(N\) is a power of two.
Example for N = 4
Number | Code |
---|---|
0 | 00 |
1 | 01 |
2 | 10 |
3 | 11 |
Note that any other permutation of bits is also valid.
Example for N = 8
Number | Code |
---|---|
0 | 000 |
1 | 001 |
2 | 010 |
3 | 011 |
4 | 100 |
5 | 101 |
6 | 110 |
7 | 111 |
Truncated Binary Encoding
also known as minimal binary encoding, us used for uniform distributions where \(N\) is not a power of two.
The core idea is that we can write with \(\lfloor \log_2 N \rfloor\) bits the first \(2^{\lfloor \log_2 N \rfloor}\) values i.e. the up to the biggest power of two smaller than \(N\), and the remaining \(N - 2^{\lfloor \log_2 N \rfloor}\) values using \(\lceil \log_2 N \rceil\) bits.
Example for N = 6
Number | Code |
---|---|
0 | 00 |
1 | 01 |
2 | 10 |
3 | 11 |
4 | 100 |
5 | 101 |
Note: this is not instantaneous! 2 is a prefix of 4 and 5! We will deal with this later.
This is close to be optimal, as the expected code length is: \[\mathbb{E}[|c|] = \frac{2^{\lfloor \log_2 N \rfloor}}{N} \lfloor \log_2 N \rfloor \; + \; \left(1 - \frac{2^{\lfloor \log_2 N \rfloor}}{N} \right) \lceil \log_2 N \rceil \]
while the entropy is: \[\mathbb{H}[p] = \sum_{s \in S} -\frac{1}{N} \log_2 \frac{1}{N} = \log_2 N\]
If we define \(\alpha = \frac{2^{\lfloor \log_2 N \rfloor}}{N}\) we can notice that the expected length is the convex combination of \(\lfloor \log_2 N \rfloor\) and \(\lceil \log_2 N \rceil\):
\[\mathbb{E}[|c|] = \alpha \lfloor \log_2 N \rfloor \; + \; \left(1 - \alpha \right) \lceil \log_2 N \rceil \]
When \(N\) is a power of two this is equivalent to the fixed length and is optimal because: \[\mathbb{E}[|c|] = \lfloor \log_2 N \rfloor = \log_2 N = \lceil \log_2 N \rceil\]
Otherwise, by definition, a convex combination will only take cavlues between its extremes \(\lfloor \log_2 N \rfloor\) and \(\lceil \log_2 N \rceil\) so the truncated binary encoding is always smaller or equal than the fixed length.
Unary
We can write any positive number \(s \in \mathbb{N} \) as a sequence of \(s\) zeros followed by a one, which act as a terminator.
\[s \to \overbrace{00...00}^\text{s} 1 = 0^s 1\]
Number | Code |
---|---|
0 | 1 |
1 | 01 |
2 | 001 |
3 | 0001 |
4 | 00001 |
5 | 000001 |
6 | 0000001 |
7 | 00000001 |
8 | 000000001 |
This code has length \(s + 1\) therefore, its intended distribution is: \[p(x) = \left(\frac{1}{2}\right)^{s + 1}\]
Thus, it's optimal for the geometric distribution with \(p = \frac{1}{2}\).
Most modern CPUs architectures, like x86_64 and Aarch64 (Arm), have instructions to compute the leading or traling zeros in a registers.
x86_64 has LZCNT and TZCNT instructions, while Aarch64 has only CLZ.
Using these we can decode unary codes in 1 ns on average.
fn read_unary(&mut self) -> u64 {
let value = self.peek_u64();
let result = value.leading_zeros(); // let rust do the magic
self.skip_bits(result + 1);
result
}
fn write_unary(&mut self, value: u64) {
let bits = 1 << value;
let n_bits = value + 1;
self.write_bits(bits, n_bits);
}
Elias-Gamma
The idea behind Elias-Gamma is to write a number \(s\) in binary using \(l = \lceil \log_2 (s + 1) \rceil\) bits and prefixing it with the length \(l\) in unary.
Moreover, if we encode \(s + 1\) instead of \(s\), the fixed length will always have at least a 1,so we can save 1 bit by using the unary terminator 1 as the highest bit of the fixed length.
So the final Elias-Gamma encodes a value \(s \in \mathbb{N}\) by writing \(s + 1\) in binary using \(l = \lceil \log_2 (s + 1) \rceil\) bits and prefixing it with the length \(l\) in unary.
Or equivalently, it encodes a value \(s \in \mathbb{N}\) writing \(\lfloor \log_2 (s + 1) \rfloor\) in unary and then \(s + 1 - 2^{\lceil \log_2 (s + 1) \rceil}\) in binary, using \(\lfloor \log_2 (s + 1) \rfloor\) bits.
\[s \to \text{Unary}\left(\lfloor \log_2 (s + 1) \rfloor\right) \text{Binary}\left(s + 1 - 2^{\lceil \log_2 (s + 1) \rceil}\right)\]
Therefore, the length of the gamma code for \(s\) is \(1 + 2\lfloor \log_2 (s + 1) \rfloor\).
Finally, it's intended distribution is:
\[p(s) \propto 2^{-1-2\lfloor \log_2 (s + 1) \rfloor} \approx \frac{1}{2(s + 1)^2}\]
Number | Gamma Code |
---|---|
0 | 1 |
1 | 01 0 |
2 | 01 1 |
3 | 001 00 |
4 | 001 01 |
5 | 001 10 |
6 | 001 11 |
7 | 0001 000 |
8 | 0001 001 |
While the math is a bit complicated, the code is really simple to implement:
fn read_gamma(&mut self) -> u64 {
let len = self.read_unary();
Ok(self.read_bits(len as usize) + (1 << len) - 1)
}
fn write_gamma(&mut self, mut value: u64) {
value += 1;
let number_of_bits_to_write = value.ilog2(); // floor(log2(value))
let short_value = value - (1 << number_of_bits_to_write);
self.write_unary(number_of_bits_to_write);
self.write_bits(short_value, number_of_bits_to_write);
}
Elias-Delta
Elias-Delta just writes the legth of the gamma code in gamma code and then the value in binary.
\[s \to \text{Gamma}\left(\lfloor \log_2 (s + 1) \rfloor\right) \text{Binary}\left(s + 1 - 2^{\lceil \log_2 (s + 1) \rceil}\right)\]
Therefore, its length is \(1 + 2\lfloor \log_2 \log_2 (s + 1) \rfloor + \lfloor \log (s + 1) \rfloor\).
Finally, it's intended distribution is:
\[p(s) \propto 2^{-1 + 2\lfloor \log_2 \log_2 (s + 1) \rfloor + \lfloor \log (s + 1) \rfloor} \approx \frac{1}{2(s + 1) (\log_2(s + 1))^2}\]
Number | Delta Code |
---|---|
0 | 1 |
1 | 01 0 0 |
2 | 01 1 1 |
3 | 01 10 0 |
4 | 01 10 1 |
5 | 01 11 0 |
6 | 01 11 1 |
7 | 0001 00 000 |
8 | 0001 00 001 |
While the math is a bit complicated, the code is really simple to implement:
fn read_delta(&mut self) -> u64 {
let len = self.read_gamma();
Ok(self.read_bits(len as usize) + (1 << len) - 1)
}
fn write_delta(&mut self, mut value: u64) {
value += 1;
let number_of_bits_to_write = value.ilog2(); // floor(log2(value))
let short_value = value - (1 << number_of_bits_to_write);
self.write_gamma(number_of_bits_to_write);
self.write_bits(short_value, number_of_bits_to_write);
}
Golomb
for a given b, write \(\lfloor \frac{s}{b} \rfloor \) in unary and \(s \mod b\) using truncated encoding.
The length is \(|c(s)| = \left \lfloor \frac{s}{b} \right \rfloor + \lceil \log_2 b \rceil\) thus, its
For a Geometric distribution of ratio \(p\) we can compute the \(b\) so that the Golomb code is optimal: \[b = \left \lceil \frac{\log_2 (2 - p)}{-\log(1 - p)} \right \rceil\]
fn read_golomb(&mut self, b: u64) -> u64 {
let slot = self.read_unary();
let rem = self.read_truncated_binary(b);
Ok(slot * b + rem)
}
fn write_golomb(&mut self, mut value: u64, b: u64) {
let slot = value / b;
let rem = value % b;
self.write_unary(slot);
self.write_truncated_binary(rem, b);
}
VBytes
This code, in its variation LEB128
is really common as it's present in both DWARF format and Webassembly binary format.
This code is Byte-aligned, and the core idea is to use the highest bit of every byte as a continuation bit. We wirite 7 bits and use the highest to set 0 if continue 1 if stop.
A faster variant is to have all the 0s at the start, so we write the number of bytes - 1 in unary at the start and then we can read all the bits contiguously. This avoids the need to iterate over the bytes to check the continuation bit as we can compute it at once with a single LZCNT
.
LEB128
is not complete as it has infinite encodings for each number as we can just add 0 padding on the top.
0b1000_0000 = 0
0b0000_0000 0b1000_0000 = 0
0b0000_0000 0b0000_0000 0b1000_0000 = 0
0b0000_0000 0b0000_0000 0b0000_0000 0b1000_0000 = 0
Moreover, if a value uses 2 bytes we know that it must be bigger than 127, so we can subtract it to fit more values, and this can be done recursively to make the code complete.
So our final code looks like:
\[s \to \text{Unary}(\lceil \frac{s}{7} \rceil) - 1 \text{Binary}(\text{value} = s - \text{max}, \text{bits} = 7 *\lceil \frac{s}{7} \rceil)\]
Number | VBytes code |
---|---|
0 | 0b1000_0000 = 127 |
1 | 0b1000_0001 = 128 |
2 | 0b1000_0010 = 129 |
... | ... |
127 | 0b1111_1111 = 255 |
128 | 0b0100_0000 0b0000_0000 = 64 0 |
129 | 0b0100_0000 0b0000_0001 = 64 1 |
130 | 0b0100_0000 0b0000_0010 = 64 2 |
... | ... |
65662 | 0b0111_1111 0b1111_1111 = 127 255 |
65663 | 0b0010_0000 0b0000_0000 0b0000_0000 = 32 0 0 |
Example implementation:
fn read_vbytes(&mut self) -> u64 {
let len = self.read_unary() + 1;
let mut value = self.read_bits(len as usize * 7);
let mut max = 1 << 7;
for _ in 1..len {
value += max;
max <<= 7;
}
value
}
fn write_vbytes(&mut self, mut value: u64) {
let mut max = 1 << 7;
let mut len = 0;
while value > max {
value -= max;
max <<= 7;
len += 1;
}
self.write_unary(len);
self.write_bits(value, len * 7);
}
This implementation is not optimal, unrolling the while into a series of ifs will make it faster. Moreover, everything is byte aligned so reading unary and bits is overkill. For a proper implementation see here.
Zeta
TODO!: Unary + Truncated Encoding
fn read_zeta(&mut self, k: u64) -> u64 {
let h = self.read_unary();
let u = 1 << ((h + 1) * k);
let l = 1 << (h * k);
let res = backend.read_minimal_binary(u - l);
Ok(l + res - 1)
}
fn write_zeta(&mut self, k: u64, value: u64) {
value += 1;
let h = value.ilog2() as u64 / k;
let u = 1 << ((h + 1) * k);
let l = 1 << (h * k);
self.write_unary(h);
self.write_minimal_binary(value - l, u - l);
}
Number | Zeta1 | Zeta2 | Zeta3 | Zeta4 |
---|---|---|---|---|
0 | 1 | 1 0 | 1 00 | 1 000 |
1 | 01 0 | 1 10 | 1 010 | 1 0010 |
2 | 01 1 | 1 11 | 1 011 | 1 0011 |
3 | 001 00 | 01 000 | 1 100 | 1 0100 |
4 | 001 01 | 01 001 | 1 101 | 1 0101 |
5 | 001 10 | 01 010 | 1 110 | 1 0110 |
6 | 001 11 | 01 011 | 1 111 | 1 0111 |
7 | 0001 000 | 01 1000 | 01 00000 | 1 1000 |
Note that Zeta1 == Gamma
Encoding of negative numbers
All the discussed codes are defined over some subset of \(\mathbb{N}\), to encode over \(\mathbb{Z}\), we have to define a bijection \(\phi: \mathbb{Z} \to \mathbb{N}\).
If we can assume that the values form some kind of bell shape centered around zero, as we often do, it's best to map values with small absolute value to small integers, so it's natural to use the following bijection:
\[\phi(x) = \left\{\begin{matrix} 2 x & \text{if } x \ge 0\\- 2 x - 1 & \text{otherwise}\end{matrix}\right.\] which has inverse: \[\phi^{-1}(x) = \left\{\begin{matrix} x / 2 & \text{if } x = 0 \mod 2\\ -(x + 1) / 2 & \text{otherwise}\end{matrix}\right.\]
\(x \in \mathbb{N}\) | \(\phi^{-1}(x) \in \mathbb{Z}\) |
---|---|
0 | 0 |
1 | -1 |
2 | 1 |
3 | -2 |
4 | 2 |
5 | -3 |
6 | 3 |
this can be efficiently implemented as:
#[inline]
pub const fn int2nat(x: i64) -> u64 {
(x << 1 ^ (x >> 63)) as u64
}
#[inline]
pub const fn nat2int(x: u64) -> i64 {
((x >> 1) ^ !((x & 1).wrapping_sub(1))) as i64
}
which in x86_64 assembly should look like:
int2nat:
sar rdi, 63
xor rax, rdi
nat2int:
and edi, 1
shr rax
neg rdi
xor rax, rdi
This is also called zigzag encoding (another reference).
Reading codes fast
In this section, we'll see the core ideas we had while implementing our rust crate for reading and wring bitstreams: dsi-bitstream-rs
.
Buffering?
Should we read the bitstream every time we need to decode or should we have a local buffer?
By definition, most codes we will read will take just a few bits so the idea is to have a buffer of 64 bits and read from it. This avoids having a memory access for each read, which is a common bottleneck.
But buffering needs more logic, and buffering is useful only if the buffer is kept in a register, if it's spilled to the stack, we will still need a memory access.
Bitorder, Big or Little endian?
Now we have two possible convention, we can either read the bits in a byte from the Most Significant Bit to the Least Significant Bit or viceversa.
It's not clear at priori which is the best choice, but we can easily implement both and benchmark them. In the following examples we will only discuss the Big Endian (MSB to LSB) version as is the default in the Java implementation of webgraph.
The MSB vs LSB orders are given a byte:
M L
0000_0000
MSB: 0123 4567
LSB: 7654 3210
Note that for the MSB
order, we will have to read bytes in Big endian order, thus we also call it BE
order, while for LSB
order we will have to read bytes in Little endian order, this LE
order.
A simple example is the table for unary codes, for LE, every other value will be 0 (all odds indices), as the values are "interleaved", while for BE, the 0s are at the start. Therefore, LE will use twice as much L1 Cache to store the 0s (which are the most frequently accessed).
const LE_TABLE: &[u16] = &[
u16::MAX,
0,
1,
0,
2,
0,
1,
0,
...
];
const BE_TABLE: &[u16] = &[
u16::MAX,
15,
14,
14,
13,
13,
13,
13,
...
];
How big should the buffer be?
We can choose a buffer size that fits in a register, so: 8, 16, 32 or 64 bits and 128, 256, 512 bits with SSE
, AVX
, AVX512
extensions.
512 bits buffer?
If we want to optimize memory throughtput, we can use a 64-bytes buffer, as it's the size of a cache line on most modern CPUs, so we will have a single memory access every 64-bytes.
However, to hadnle 64-bytes values we will need AVX512
instructions which are not available on all CPUs and are not as .
AVX512
has an extension AVX512 CD
that has an LZCNT
equivalent _mm512_lzcnt_epi64
which compute the number of leading zeros in a 64-bits every word inside the 512 register. _mm512_lzcnt_epi64
takes 4 cycles
to execute, while _lzcnt_u64
takes 3 cycles
to execute. Moreover, to get the total number of leading zeros, we have to do a masked horizontal sum of the 8 64-bits words, which takes even more cycles.
Finally, to put the read 256 bits inside the 512 register we can't use an alignr
instruction, but we should permute the values twice, do the alignr
and finally merge them wiht a xor.
Therefore, while bechmarks would be useful, we can assume that a 64-bytes buffer will not be optimal.
256 bits buffer?
256 registers have the same problems of 512 registers, but they are more common and they have AVX2
, but they still need AVX512
for LZCNT
.
128 bits buffer?
These buffers are basically supported everywhere, and we have a good shifting instruction _mm_srli_epi16
that takes 1 cycle to execute but we don't have a LZCNT
(expect from AVX512
extensions), so we would have to resort to either broadword programming, or split the 128 bits in 2 64-bits registers and use LZCNT
on them.
This could be explored but it doesn't seems to be promising as we will do many more LZCNT
than fills from memory.
64 bits buffer?
This is the biggest buffer that fits in a "common" register, so we can use the lzcnt
instruction on x86_64 and clz
on Aarch64.
In our benchmarks we see that from u8
to u64
the speed increase is linear, but u128
is slower than u64
.
Note: The u128 implementation uses rust primitve types which do not use SSE instructions, so it's suboptimal.
Therefore, we are going to use u64
buffers.
Buffering details
If we don't allow reads over 32-bits, we can use a 64-bits buffer, so we know that there will always be space for the next read, and operations on 64-bits words are fast on most modern CPUS.
pub struct Reader<'a> {
/// The buffer of bits we will read from
buffer: u64,
/// how many bits are valid in the buffer
valid_bits: usize,
/// The data to read from
data: &'a [u32]
}
Now let's implement a method for how to read the next bits
bits from the buffer:
impl<'a> Reader<'a> {
pub fn read_bits(&mut self, bits: usize) -> u32 {
debug_assert!(bits <= 32);
// if there aren't enough bits
if self.valid_bits < bits {
// read a new word and add it tot he buffer
self.buffer |= (self.data[0] as u64) << self.valid_bits;
self.valid_bits += 32;
self.data = &self.data[1..];
}
// extract the lowest bits
let res = (self.buffer as u32) & ((1 << bits as u32) - 1);
self.buffer >>= self.valid_bits;
self.valid_bits -= bits;
res
}
}
And now, how to read unary codes we can use rust's leaing_zeros
which uses the lzcnt
instruction on x86_64 and clz
on Aarch64, if the feature is enables. Otherwise Rust will generate the broadword code for it.
impl<'a> Reader<'a> {
pub fn read_unary(&mut self) -> usize{
let zeros = self.buffer.leading_zeros();
// the unary code fits inside the valid bits
if zeros <= self.valid_bits {
self.buffer <<= zeros + 1;
self.valid_bits -= zeros + 1;
return zeros + 1
}
let mut res = zeros;
// otherwise we have to read another word
self.buffer = self.data[0] << 32;
self.valid_bits += 32;
self.data = &self.data[1..];
let zeros = self.buffer.leading_zeros();
// our assumption is that all unary and bits are under 32 bits
// this is for simplicity, but it can be generalized
debug_assert!(zeros < 32);
res += zeros;
self.valid_bits -= zeros + 1;
self.buffer <<= zeros + 1;
res
}
}
Should we use tables?
We can use tables to speed up the decoding of codes, we can just peek
(read without seeking forward) the first x bits, and then use a table to decode them.
fn read_gamma() {
let value = self.peek_u16();
let len = GAMMA_LEN_TABLE[value as usize];
// if the length is not u16::MAX, we can use the table
if len != u16::MAX {
self.skip_bits(len as usize);
return GAMMA_VALUE_TABLE[value as usize];
}
// otherwise normally decode
let len = self.read_unary();
Ok(self.read_bits(len as usize) + (1 << len) - 1)
}
This might speed up the more complex codes, but it introduces a branch, and generally more logic which might be expensive.
So we need benchmarks to see when it's worth it.
How big should we make the tables?
Since all our codes are from skewed distributions, the most common values will be encoded with the shortest codes, so we the most impactful values to store are the shortest codes.
Bigger tables might avoid some computation, but we will use more Cache which might hurt the performance of other codes. Moreover, if the table entry is not in cache, accessing the table might be slower than computing the code.
The only way to know is to benchmark.
Merged or separated tables?
In the table/s we have to store the decoded value and the length of the code. Therefore, we can choose whether to store them in a single table or in two separated tables.
// Merged
const MERGED_TABLE: &[(u16, u8)] = &[...];
// Separated
const LEN_TABLE: &[u8] = &[...];
const VALUE_TABLE: &[u16] = &[...];
Merged have better cache locality and a single memory access but they are bigger than separated ones, due to padding because (u16, u8)
is actually stored as 4 bytes as (u16, u8, u8)
where the last byte is not used.
Separated tables have two memory accesses but uses 3/4 of the memory of merged tables. If we assume that both tables entries are in cache, then the second access will cost 0 cycles as both accesses will be executed in parallel.
During Out of Order Execution, since the loads are independent both will wait in the LoadBuffer. As we can see on "relatively modern cpus as the skylake microarchitecture, we can have up to 72 in-flight loads and the throughput from the L1 Data cache is 32 bytes / cycle, so we can easily fit ~3 bytes we need in a single cycle.
Therefore, my hypotesis is that separated tables will perform as good as the merged ones, but will work better with bigger tables as they are smaller.
Benchmarks
The answer to all these questions is to benchmark!
These benchmarks were run on a Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz, the csv with the raw data can be found here for reads and here for writes.
To run these benchmarks on your own machine you can use the following command:
# Clone the repo
git clone https://github.com/vigna/dsi-bitstream-rs.git
cd dsi-bitstream-rs
# Run the read benchmarks
python3 ./python/bench_code_tables_read.py > read.csv
# Make the plots
cat read.csv | python3 ./python/plot_code_tables_read.py
# Run the write benchmarks
python3 ./python/bench_code_tables_write.py > write.csv
# Make the plots
cat write.csv | python3 ./python/plot_code_tables_write.py
Read
Fixed Length
TODO!:
Truncated
TODO!:
Unary
Gamma
Delta
Since Delta has a gamma code inside, we can choose if the gamma code is read using a table or not.
Without gamma table: With gamma table (8 bits):
Zeta 3
Write
Fixed Length
TODO!:
Truncated
TODO!:
Unary
Gamma
Delta
Since Delta has a gamma code inside, we can choose if the gamma code is write using a table or not.
Without gamma table: With gamma table (TODO bits):
Zeta 3
Summary
Code | Distribution | Avg. Read (ns) | Avg. Write (ns) |
---|---|---|---|
Fixed Length | \(\frac{1}{|S|}\), Uniform were \(|S|\) is a power of two | TODO | TODO |
Truncated | \(\frac{1}{|S|}\), Uniform | TODO | TODO |
Unary | \((1 / 2)^s\), Geometric with p = \(\frac{1}{2}\) | 1.58 | 1.49 |
Gamma | Universal, \(\approx \frac{1}{2s^2}\), zipfian with \(\alpha = 2\) | 1.90 | 1.59 |
Delta | Universal, \(\approx \frac{1}{2s(\log (s + 1))^2}\), zipfian with \(\alpha \approx 1\) | 3.11 | 2.77 |
Zeta (k = 3) | \(\approx \frac{1}{x^{1 + 1 / k}}\), close to an arbitrary zipfian | 5.20 | 3.76 |
From all the benchmarks we learn that for reads:
- Unary: no table is needed.
- Gamma: with or without table is pretty close, so we can not use a table to leave more cache space for the other codes.
- Delta: It's fastest with no table, but with gamma table enabled.
- Zeta: It's fastest with a 12-bit table.
Gamma is fast and universal, so it's a good default choice.
Writes are generally faster than Reads, and all codes gets fasters with bigger tables.
Specifically, for Writes is optimal to use:
- Unary: 3 bits table.
- Gamma: 6 bits table.
- Delta: 13 bits table, with gamma tables.
- Zeta3: 15 bits table.
LittleEndian is slightly faster than BigEndian when reading with no tables, but with tables they are close due to the better cache locality of Big Endian tables. Writes otherwise, are faster with BigEndian.
Generally, BigEndian seems the better choiche, as they LZCNT
is supported on more CPUs than TZCNT
.
Separated tables are faster than Combined tables when the table is bigger than the L1 cache (this cpu has 32KiB \(\approx 2^{15}\) L1D per core), and basically the same otherwise.
Benchmarks are useful :)