tree: 9677c71f96bfe488800db26584cac38b17ca4733 [path history] [tgz]

std/adler32/README.md

Adler-32 is a checksum algorithm that hashes byte sequences to 32 bit values. It is named after its inventor, Mark Adler, who also co-invented the Gzip and Zlib compressed file formats. Amongst other differences, Gzip uses CRC-32 as its checksum and Zlib uses Adler-32.

The algorithm, described in RFC 1950, is simple. Conceptually, there are two unsigned integers `s1`

and `s2`

of infinite precision, initialized to `0`

and `1`

. These two accumulators are updated for every input byte `src[i]`

. At the end of the loop, `s1`

is `1`

plus the sum of all source bytes and `s2`

is the sum of all (intermediate and final) `s1`

values:

var s1 = 1; var s2 = 0; for_each i in_the_range_of src { s1 = s1 + src[i]; s2 = s2 + s1; } return ((s2 % 65521) << 16) | (s1 % 65521);

The final `uint32_t`

hash value is composed of two 16-bit values: `(s1 % 65521)`

in the low 16 bits and `(s2 % 65521)`

in the high 16 bits. `65521`

is the largest prime number less than `(1 << 16)`

.

Infinite precision arithmetic requires arbitrarily large amounts of memory. In practice, computing the Adler-32 hash instead uses a `uint32_t`

typed `s1`

and `s2`

, modifying the algorithm to be concious of overflow inside the loop:

uint32_t s1 = 1; uint32_t s2 = 0; for_each i in_the_range_of src { s1 = (s1 + src[i]) % 65521; s2 = (s2 + s1) % 65521; } return (s2 << 16) | s1;

The loop can be split into two levels, so that the relatively expensive modulo operation can be hoisted out of the inner loop:

uint32_t s1 = 1; uint32_t s2 = 0; for_each_sub_slice s of_length_up_to M partitioning src { for_each i in_the_range_of s { s1 = s1 + s[i]; s2 = s2 + s1; } s1 = s1 % 65521; s2 = s2 % 65521; } return (s2 << 16) | s1;

We just need to find the largest `M`

such that the inner loop cannot overflow. The worst case scenario is that `s1`

and `s2`

both start the inner loop at `65520`

and every subsequent `src[i]`

byte equals `0xFF`

. A simple computation finds that the largest non-overflowing `M`

is 5552.

In a happy coincidence, 5552 is an exact multiple of 16, which often works well with loop unrolling and with SIMD alignment.

Adler-32 is a very simple hashing algorithm. While its output is nominally a `uint32_t`

value, it isn't uniformly distributed across the entire `uint32_t`

range. The `[65521, 65535]`

range of each 16-bit half of an Adler-32 checksum is never touched.

While neither Adler-32 or CRC-32 are cryptographic hash functions, there is still a stark difference in the patterns (or lack of) in their hash values of the `N`

-byte string consisting entirely of zeroes, as this Go program shows:

N Adler-32 CRC-32 Input 0 0x00000001 0x00000000 "" 1 0x00010001 0xD202EF8D "\x00" 2 0x00020001 0x41D912FF "\x00\x00" 3 0x00030001 0xFF41D912 "\x00\x00\x00" 4 0x00040001 0x2144DF1C "\x00\x00\x00\x00" 5 0x00050001 0xC622F71D "\x00\x00\x00\x00\x00" 6 0x00060001 0xB1C2A1A3 "\x00\x00\x00\x00\x00\x00" 7 0x00070001 0x9D6CDF7E "\x00\x00\x00\x00\x00\x00\x00"

Adler-32 is a simpler algorithm than CRC-32. At the time Adler-32 was invented, it had noticably higher throughput. With modern SIMD implementations, that performance difference has largely disappeared.

A worked example for calculating the Adler-32 hash of the three byte input “Hi\n”, starting from the initial state `(s1 = 1)`

and `(s2 = 0)`

:

src[i] ((s2 << 16) | s1) ---- 0x00000001 0x48 0x00490049 0x69 0x00FB00B2 0x0A 0x01B700BC