The algorithm takes a stream of data and chunks it with the following constraints:
The algorithm produces a set of chunks. Each of the chunks can either be:
The chunks are produced using the rolling hash algorithm H(), defined at //peridot/third_party/bup/bupsplit.h. Moreover, the result of this hashing function is passed through a user defined permutation so that the result of this split algorithm is not a signature of the initial file.
Given a sequence of bytes [b_0, b_1, b_2, ..., b_N], the algorithm computes the smallest index i_0, such that: i >= 4k and H([b_0, ..., b_{i_0}]) 13 least significant bits are 0.
If no such i_0 exist, and N is lesser than 2^16, the algorithm just returns the original data as the unique chunk, otherwise, i_0 is choosen to be equals to 2^16-1.
The first chunk is set to be [b_0, b_1, ..., b_{i_0}].
The algorithm continues to find the other indices with the following properties: i_k is the smallest index such that:
If so such i_k exists, and N-i_{k-1} is lesser than 2^16, i_k is choosen to be equals to N, and the index selection stops, otherwise, i_k is choosen to be i_{k-1} + 2^16 -1
Once all indices are chosen, this indicates where the initial stream must be chunked. Each chunk of data is then given an identifier, and the algorithm produces chunks to encode the sequence of indices.
Given a sequence of indices [i_0, i_1, ..., i_k], the algorithm needs to decide where a sub-sequence of index must be turned into a chunk, indexed and referenced by a higher level index. To decide this, the algorithm reuses the value of the Hash that determined where to cut the chunk. The hash ends with its 13 least significant bits as 0, the algorithm examines the remaininig bits. For each 4 least significant bits equals to 0, it increase the level of the index by 1. For example:
For this algorithm to be useful and efficient, the hash is chosen such that:
These properties ensure that: