Table of Contents
HyperLogLog
HyperLogLog counters are a smart way to estimate the number of uniques elements seen in a stream. They exploit the fact that the number (N) of leading (or trailing) zeros (or ones), on a random number follows a geometric distribution of parameter \(p = \frac{1}{2}\), i.e.:
\[P(N = n) \propto \frac{1}{2^{n}}\]
The jist of the HyperLogLog counters is to use an hash function on the values and keep track of the maximum number of leading zeros. To be more resiliant to outliers, multiple buckets are used. The first few bits are used to pick in which bucket each hashed value belongs to. In order to compute the caradinality estimate, we take the harmonic mean of the value of each bucket:
\[|C| \approx \left( \sum_i \frac{1}{2^{x_i}} \right)^{-1}\]
In this post I'll explore the different implementations I experimented with to optimize this simple but fundamental operation.
Naive implementation
This is the simplest way to compute it.
let mut total: f32 = 0.0;
for value in values {
total += 1.0 / 2.0.powf(value as f32);
}
The assembly inside the loop looks like this:
movzx edi, byte ptr [r14 + r15], ; Load `value` from `values`
vmovss xmm0, dword ptr [rip + .TWO] ; Load 2.0
call r12 ; Call ldexpf -> xmm0 = 2.0^values
vmovss xmm1, dword ptr [rip + .ONE] ; Load 1.0
vdivss xmm0, xmm1, xmm0 ; 1.0 / 2.0^value
vadds xmm2, xmm1, xmm0 ; total +=
Avoiding the call to ldexpf
Having a call in the function adds a lot of overhead so we can avoid it this way:
let mut total: f32 = 0.0;
for value in values {
total += 1.0 / ((1 << value) as f32);
}
The assembly inside the loop looks like this:
movzx r8d, byte ptr [rax] ; Load `value` from `values`
mov edx, 1 ; 1
shlx r8d, edx, e8d ; 1 << value ; 3 cycles
vcvtsi2ss xmm2, xmm3, r8d ; as f32 ; 8 cycles
vdivss xmm2, xmm1, xmm2 ; 1.0 / ; 18 cycles
vaddss xmm0, xmm0, xmm2 ; total += ; 3 cycles
Avoiding Divisions
One way to avoid division is to pre-compute all the values and look them. Assuming that the table in in the L1 Cache we can avoid the division and the conversion to f32, trading 18 + 8 = 26 cycles for a memory read which is about 10 cycles.
let mut total: f32 = 0.0;
for value in values {
total += RECIPROCALS_TABLE[value as usize];
}
The assembly inside the loop looks like this:
movzx r8d, byte ptr [rax] ; Load `value` from `values`
movd xmm2, ptr [r12 + r8d] ; RECIPROCALS_TABLE ; ~10 cycles
vaddss xmm0, xmm0, xmm2 ; total += ; 3 cycles
IEEE 754 tricks
The IEEE 754 floats, we use and love everyday, represents values as:
where a value is decoded as: \[\text{value} = \text{sign}() \cdot (1 + \text{fraction}) \cdot 2^{\text{exponent} - 127}\]
So with a bit of bit trickery we can directly build the value with just a sum and a shift:
let mut total: f32 = 0.0;
for value in values {
let hack = (127 + value as u32) << 23;
total += f32::from_ne_bytes(hack.to_ne_bytes());
}
The assembly inside the loop looks like this:
movzx edx, byte ptr [rax] ; Load `value` from `values`
shl edx, 23 ; << 23 ; 1 cycle
add ecx, 1065353216 ; + (127 << 23) ; 1 cycle
movd xmm1, ecx ; as f32 ; 3 cycles
vaddss xmm0, xmm0, xmm1 ; total += ; 3 cycles
Fixed point for precision and profit
If we assume that the values are less or equal than 32, we can use an u64 word as a fixed point precision value with 32 decimal bits. This allows us to compute the exact result with no loss of resolution and further speed up the code.
let mut counter: u64 = 0;
for value in values {
counter += (1_u64 << 32) >> value;
}
let exp = counter.leading_zeros() + 1;
counter <<= exp;
counter >>= 64 - 23;
let res = (((127 + 32 - exp) as u32) << 23) | (counter as u32);
let estimate = f32::from_ne_bytes(res.to_ne_bytes());
The assembly inside the loop looks like this:
movzx esi, byte ptr [rax] ; Load `value` from `values`
shrx rsi, rcx, rsi ; << ; 3 cycles
add rdx, rsi ; counter += ; 1 cycle
Benchmarks
Computing the sum of 10'000 elements (between 0 and 32) on my Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz
Impl | ns / element | speedup |
---|---|---|
naive | 9.2354 | / |
no call | 1.4025 | 6.58 |
table | 1.1763 | 7.85 |
IEE 754 | 1.1586 | 7.97 |
Fixed Point | 0.5693 | 16.22 |
And compiling with RUSTFLAGS="-C target-cpu=native"
we get:
Impl | ns / element | speedup |
---|---|---|
naive | 8.3358 | / |
no call | 1.2977 | 6.42 |
table | 1.1572 | 7.20 |
IEE 754 | 1.1303 | 7.37 |
Fixed Point | 0.1351 | 61.7 |