Posted on

Table of Contents

This post assumes familiarity with the instantaneous codes we discussed in the previous post

BVGraph

BVGraph is a graph compression data-structure which exploits some common properties of real-world graphs. The basic idea is to use instantaneous codes to represent an adjacency list (a.k.a. CSR).

The BVGraph format has three core compression methods:

  • A node can copy successors from a previous node
  • Intervals of nodes with consecutive ids are written as the tuple (start, len)
  • The remaning nodes are sorted and encoded as a sequence of gaps (differences between successive nodes).

We can already notice that this compression depends from the node ids. Finding the optimal order is most likely NP Complete, so later in this post we will discuss some heuristic methods to find good orders.

Example

Let's start with a simple graph:

I've compressed it with the default codes, which are:

ValueCode
OutdegreeGamma
ReferenceOffsetUnary
BlockCountGamma
BlocksGamma
IntervalCountGamma
IntervalStartGamma
IntervalLenGamma
FirstResidualZeta(3)
ResidualGapZeta(3)

This graph has 9 nodes and 12 arcs, once compressed with BVGraph we obtain two files, the test.properties:

#BVGraph properties
version=0
graphclass=it.unimi.dsi.webgraph.BVGraph
nodes=9
arcs=12
minintervallength=3
maxrefcount=3
windowsize=7
zetak=3
compressionflags=

and the test.graph file which contains the compressed graph as the following bytes:

7d c5 ea 64 a7 27 72 97 a9 e0

which is equivalent to the bits:

01111101 11000101 11101010 01100100 10100111 00100111 01110010 10010111 10101001 11100000

and is equivalent to the bitstream:

011111011100010111101010011001001010011100100111011100101001011110101001111

With syntax highlighting:

011111011100|010111101|010011|0010010100111|001001110111001010|010111101|010011|1|1
Reading the codes we obtain:
20020|1004|110|30120|300201|1004|110|0|0

Finally, we can start iterating on the graph:

Node 0

Bits: 011111011100
Outdegree: 2
Reference Offset: 0
Number of Intervals: 0
First Residual: 2
Residual Gap: 0

This node has outdegree 2, doesn't copy any node and has no intervals, so it has only residuals.

The first residual is a signed offset from the current node, so the first successor is \(0 + \phi^{-1}(2) = 0 + 1 = 1\).

The second residual is the gap from the previous residual, minus one, so: \(1 + (0 + 1) = 2\).

So the successors of node 0 are \(\{1, 2\}\).

Node 1

Bits: 010111101
Outdegree: 1
Reference Offset: 0
Number of Intervals: 0
First Residual: 4

This node has outdegree 1, doesn't copy any node and has no intervals, so it has only residuals.

The first residual is a signed offset from the current node, so the first successor is \(0 + \phi^{-1}(4) = 1 + 2 = 3\).

So the successors of node 1 are \(\{3\}\).

Compression tradeoffs

Compression windows, Reference chain length

Offsets list

To have have random-access on a BVGraph bitstream we need to store the bitoffsets at which each node codes start, analogously to the offsets list of a CSR matrix.

While this could just be a vector of integers, we can notice that the offsets are a monotone non decreasing sequence of positive integers, thus they can be represented using Elias-Fano!

For a BVGraph bitstream of length (l) of a graph with (|V|) nodes, Elias-Fano stores the offsets using at most \(|V| (2 + \lfloor \log2 \frac{l}{|V|} \rfloor)\) bits while having \(O(1)\) access (select) if paired with an appropriate index.

See my other post for more info on Elias-Fano

Benchmarks

How to choose the best codes?

Use universal codes, gamma is really fast so it's a good default choice. Otherwise we can "simulate" writing every piece with all the codes, this requires just a complete iteration over the nodes so it's feasible also on large graphs.

Webgraph-rs has an optimized binary utility for this.

In my practical application we don't save much memory, about 3 GB over a 126GB graph, but we found out that zeta3 and delta have similar compressions, so we use delta which is 4 times faster than zeta3.

Computing an order

BFS

Layered Labels Propagation

Benchmarks

References