HyperLogLog

HyperLogLog counters are a smart way to estimate the number of uniques elements seen in a stream. They exploit the fact that the number (N) of leading (or trailing) zeros (or ones), on a random number follows a geometric distribution of parameter \(p = \frac{1}{2}\), i.e.:

\[P(N = n) \propto \frac{1}{2^{n}}\]

The jist of the HyperLogLog counters is to use an hash function on the values and keep track of the maximum number of leading zeros. To be more resiliant to outliers, multiple buckets are used. The first few bits are used to pick in which bucket each hashed value belongs to. In order to compute the caradinality estimate, we take the harmonic mean of the value of each bucket:

\[|C| \approx \left( \sum_i \frac{1}{2^{x_i}} \right)^{-1}\]

In this post I'll explore the different implementations I experimented with to optimize this simple but fundamental operation.

Naive implementation

This is the simplest way to compute it.

let mut total: f32 = 0.0;
for value in values {
    total += 1.0 / 2.0.powf(value as f32);
}

The assembly inside the loop looks like this:

movzx  edi, byte ptr [r14 + r15],    ; Load `value` from `values`
vmovss xmm0, dword ptr [rip + .TWO]  ; Load 2.0
call   r12                           ; Call ldexpf -> xmm0 = 2.0^values
vmovss xmm1, dword ptr [rip + .ONE]  ; Load 1.0
vdivss xmm0, xmm1, xmm0              ; 1.0 / 2.0^value
vadds  xmm2, xmm1, xmm0              ; total +=

Avoiding the call to ldexpf

Having a call in the function adds a lot of overhead so we can avoid it this way:

let mut total: f32 = 0.0;
for value in values {
    total += 1.0 / ((1 << value) as f32);
}

The assembly inside the loop looks like this:

movzx r8d, byte ptr [rax] ; Load `value` from `values`
mov   edx, 1              ; 1
shlx  r8d, edx, e8d       ; 1 << value ;  3 cycles
vcvtsi2ss xmm2, xmm3, r8d ; as f32     ;  8 cycles
vdivss xmm2, xmm1, xmm2   ; 1.0 /      ; 18 cycles
vaddss  xmm0, xmm0, xmm2  ; total +=   ;  3 cycles

Avoiding Divisions

One way to avoid division is to pre-compute all the values and look them. Assuming that the table in in the L1 Cache we can avoid the division and the conversion to f32, trading 18 + 8 = 26 cycles for a memory read which is about 10 cycles.

let mut total: f32 = 0.0;
for value in values {
    total += RECIPROCALS_TABLE[value as usize];
}

The assembly inside the loop looks like this:

movzx r8d, byte ptr [rax]  ; Load `value` from `values`
movd xmm2, ptr [r12 + r8d] ; RECIPROCALS_TABLE ; ~10 cycles 
vaddss  xmm0, xmm0, xmm2   ; total +=          ;   3 cycles

IEEE 754 tricks

The IEEE 754 floats, we use and love everyday, represents values as:

where a value is decoded as: \[\text{value} = \text{sign}() \cdot (1 + \text{fraction}) \cdot 2^{\text{exponent} - 127}\]

So with a bit of bit trickery we can directly build the value with just a sum and a shift:

let mut total: f32 = 0.0;
for value in values {
    let hack = (127 + value as u32) << 23;
    total += f32::from_ne_bytes(hack.to_ne_bytes());
}

The assembly inside the loop looks like this:

movzx edx, byte ptr [rax] ; Load `value` from `values`
shl   edx, 23             ; << 23         ; 1 cycle
add ecx, 1065353216       ; + (127 << 23) ; 1 cycle
movd xmm1, ecx            ; as f32        ; 3 cycles
vaddss xmm0, xmm0, xmm1   ; total +=      ; 3 cycles

Fixed point for precision and profit

If we assume that the values are less or equal than 32, we can use an u64 word as a fixed point precision value with 32 decimal bits. This allows us to compute the exact result with no loss of resolution and further speed up the code.

let mut counter: u64 = 0;
for value in values {
    counter += (1_u64 << 32) >> value;
}
let exp = counter.leading_zeros() + 1;
counter <<= exp;
counter >>= 64 - 23;
let res = (((127 + 32 - exp) as u32) << 23) | (counter as u32);
let estimate = f32::from_ne_bytes(res.to_ne_bytes());

The assembly inside the loop looks like this:

movzx esi, byte ptr [rax]   ; Load `value` from `values`
shrx rsi, rcx, rsi          ; <<         ; 3 cycles
add rdx, rsi                ; counter += ; 1 cycle

Benchmarks

Computing the sum of 10'000 elements (between 0 and 32) on my Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz

Impl	ns / element	speedup
naive	9.2354	/
no call	1.4025	6.58
table	1.1763	7.85
IEE 754	1.1586	7.97
Fixed Point	0.5693	16.22

And compiling with RUSTFLAGS="-C target-cpu=native" we get:

Impl	ns / element	speedup
naive	8.3358	/
no call	1.2977	6.42
table	1.1572	7.20
IEE 754	1.1303	7.37
Fixed Point	0.1351	61.7

Table of Contents

HyperLogLog

Naive implementation

Avoiding the call to ldexpf

Avoiding Divisions

IEEE 754 tricks

Fixed point for precision and profit

Benchmarks

Reference