Deflate is a general purpose compression format, using Huffman codes and Lempel-Ziv style length-distance back-references. It is specified in RFC 1951.
Gzip (RFC 1952) and zlib (RFC 1950) are two thin wrappers (with similar purpose, but different wire formats) around the raw deflate encoding. Gzip (the file format) is used by the gzip
and tar
command line tools and by the HTTP network protocol. Zlib is used by the ELF executable and PNG image file formats.
Zip (also known as PKZIP) is another wrapper, similar to combining gzip with tar, that can compress multiple files into a single archive. Zip is widely used by the ECMA Office Open XML format, the OASIS Open Document Format for Office Applications and the Java JAR format.
Wrangling those formats that build on deflate (gzip, zip and zlib) is not provided by this package. For zlib, look at the std/zlib
package instead. The other formats are TODO.
For example, look at test/data/romeo.txt*
. First, the uncompressed text:
$ xxd test/data/romeo.txt 00000000: 526f 6d65 6f20 616e 6420 4a75 6c69 6574 Romeo and Juliet 00000010: 0a45 7863 6572 7074 2066 726f 6d20 4163 .Excerpt from Ac etc 00000390: 740a 536f 2073 7475 6d62 6c65 7374 206f t.So stumblest o 000003a0: 6e20 6d79 2063 6f75 6e73 656c 3f0a n my counsel?.
The raw deflate encoding:
$ xxd test/data/romeo.txt.deflate 00000000: 4d53 c16e db30 0cbd f32b d853 2e46 0e3d MS.n.0...+.S.F.= 00000010: 2e87 20ed 0234 c5ba 0049 861e 861d 149b .. ..4...I...... etc 00000200: 7d13 8ffd b9a3 24bb 68f4 eb30 7a59 610d }.....$.h..0zYa. 00000210: 7f01 ..
The gzip format wraps a variable length header (in this case, 20 bytes) and 8 byte footer around the raw deflate data. The header contains the NUL-terminated C string name of the original, uncompressed file, “romeo.txt”, amongst other data. The footer contains a 4 byte CRC32 checksum and a 4 byte length of the uncompressed file (0x3ae = 942 bytes).
$ xxd test/data/romeo.txt.gz 00000000: 1f8b 0808 26d8 5d59 0003 726f 6d65 6f2e ....&.]Y..romeo. 00000010: 7478 7400 4d53 c16e db30 0cbd f32b d853 txt.MS.n.0...+.S etc 00000210: de5d 2c67 7d13 8ffd b9a3 24bb 68f4 eb30 .],g}.....$.h..0 00000220: 7a59 610d 7f01 ef07 e5ab ae03 0000 zYa...........
The zlib format wraps a 2 byte header and 4 byte footer around the raw deflate data. The footer contains a 4 byte Adler32 checksum. TODO: move this to std/zlib/README.md.
$ xxd test/data/romeo.txt.zlib 00000000: 789c 4d53 c16e db30 0cbd f32b d853 2e46 x.MS.n.0...+.S.F 00000010: 0e3d 2e87 20ed 0234 c5ba 0049 861e 861d .=.. ..4...I.... etc 00000200: 2c67 7d13 8ffd b9a3 24bb 68f4 eb30 7a59 ,g}.....$.h..0zY 00000210: 610d 7f01 57bb 3ede a...W.>.
Consider test/data/romeo.txt.deflate
. The relevant spec is RFC 1951.
offset xoffset ASCII hex binary 000000 0x0000 M 0x4D 0b_0100_1101 000001 0x0001 S 0x53 0b_0101_0011 000002 0x0002 . 0xC1 0b_1100_0001 000003 0x0003 n 0x6E 0b_0110_1110 000004 0x0004 . 0xDB 0b_1101_1011 000005 0x0005 0 0x30 0b_0011_0000 000006 0x0006 . 0x0C 0b_0000_1100 000007 0x0007 . 0xBD 0b_1011_1101 000008 0x0008 . 0xF3 0b_1111_0011 000009 0x0009 + 0x2B 0b_0010_1011 000010 0x000A . 0xD8 0b_1101_1000 000011 0x000B S 0x53 0b_0101_0011 000012 0x000C . 0x2E 0b_0010_1110 000013 0x000D F 0x46 0b_0100_0110 000014 0x000E . 0x0E 0b_0000_1110 000015 0x000F = 0x3D 0b_0011_1101 000016 0x0010 . 0x2E 0b_0010_1110 000017 0x0011 . 0x87 0b_1000_0111 000018 0x0012 0x20 0b_0010_0000 000019 0x0013 . 0xED 0b_1110_1101 etc 000522 0x020A . 0xEB 0b_1110_1011 000523 0x020B 0 0x30 0b_0011_0000 000524 0x020C z 0x7A 0b_0111_1010 000525 0x020D Y 0x59 0b_0101_1001 000526 0x020E a 0x61 0b_0110_0001 000527 0x020F . 0x0D 0b_0000_1101 000528 0x0210 . 0x7F 0b_0111_1111 000529 0x0211 . 0x01 0b_0000_0001
Starting at the top:
offset xoffset ASCII hex binary 000000 0x0000 M 0x4D 0b_0100_1101
As per the RFC section 3.2.3. Details of block format:
There are 3 block types: uncompressed, compressed with fixed Huffman codes and compressed with dynamic Huffman codes. The first type is straightforward: containing a uint16 length N
(and its complement, for error detection) and then N
literal bytes. The second type is the same as the third type but with built-in lcode
and dcode
tables (see below). The third type is the interesting part of the deflate format, and its deconstruction continues below.
The bit stream is now at:
offset xoffset ASCII hex binary 000000 0x0000 M 0x4D 0b_0100_1... 000001 0x0001 S 0x53 0b_0101_0011 000002 0x0002 . 0xC1 0b_1100_0001
As per the RFC section 3.2.7. Compression with dynamic Huffman codes, three variables (5, 5 and 4 bits) are read in the same natural (MSB to LSB) order:
Adding the implicit biases give:
nlit = 9 + 257 = 266 ndist = 19 + 1 = 20 nclen = 10 + 4 = 14
The bit stream is now at:
offset xoffset ASCII hex binary 000002 0x0002 . 0xC1 0b_1100_000. 000003 0x0003 n 0x6E 0b_0110_1110 000004 0x0004 . 0xDB 0b_1101_1011 000005 0x0005 0 0x30 0b_0011_0000 000006 0x0006 . 0x0C 0b_0000_1100 000007 0x0007 . 0xBD 0b_1011_1101
There are 19 possible code length code lengths (this is not a typo), but the wire format order is shuffled (in the idiosyncratic order: 16, 17, 18, 0, 8, ..., 15) so that later elements are more likely to be zero (and hence compressible). There are nclen
(in this case, 14) explicit code length code lengths, 3 bits each.
clcode_lengths[16] = 0b000 = 0 clcode_lengths[17] = 0b100 = 4 clcode_lengths[18] = 0b101 = 5 clcode_lengths[ 0] = 0b011 = 3 clcode_lengths[ 8] = 0b011 = 3 clcode_lengths[ 7] = 0b011 = 3 clcode_lengths[ 9] = 0b011 = 3 clcode_lengths[ 6] = 0b011 = 3 clcode_lengths[10] = 0b000 = 0 clcode_lengths[ 5] = 0b011 = 3 clcode_lengths[11] = 0b000 = 0 clcode_lengths[ 4] = 0b011 = 3 clcode_lengths[12] = 0b000 = 0 clcode_lengths[ 3] = 0b101 = 5
The remaining (19 - nclen
) = 5 entries are implicitly zero:
clcode_lengths[13] = 0 clcode_lengths[ 2] = 0 clcode_lengths[14] = 0 clcode_lengths[ 1] = 0 clcode_lengths[15] = 0
Undoing the shuffle gives:
clcode_lengths[ 0] = 3 clcode_lengths[ 1] = 0 clcode_lengths[ 2] = 0 clcode_lengths[ 3] = 5 clcode_lengths[ 4] = 3 clcode_lengths[ 5] = 3 clcode_lengths[ 6] = 3 clcode_lengths[ 7] = 3 clcode_lengths[ 8] = 3 clcode_lengths[ 9] = 3 clcode_lengths[10] = 0 clcode_lengths[11] = 0 clcode_lengths[12] = 0 clcode_lengths[13] = 0 clcode_lengths[14] = 0 clcode_lengths[15] = 0 clcode_lengths[16] = 0 clcode_lengths[17] = 4 clcode_lengths[18] = 5
For clcode
s, there are 7 + 1 + 2 = 10 non-zero entries: 7 3-bit codes, 1 4-bit code and 2 5-bit codes. The canonical Huffman encoding's map from bits to clcode
s is:
bits clcode 0b000 0 0b001 4 0b010 5 0b011 6 0b100 7 0b101 8 0b110 9 0b1110 17 0b11110 3 0b11111 18
Call that Huffman table H-CL
.
The bit stream is now at:
offset xoffset ASCII hex binary 000007 0x0007 . 0xBD 0b_1011_1... 000008 0x0008 . 0xF3 0b_1111_0011 000009 0x0009 + 0x2B 0b_0010_1011 000010 0x000A . 0xD8 0b_1101_1000 000011 0x000B S 0x53 0b_0101_0011 000012 0x000C . 0x2E 0b_0010_1110 000013 0x000D F 0x46 0b_0100_0110
When decoding via a Huffman table, bits are read in LSB to MSB order, right-to-left in this debugging output. Extra bits, if any, are then read in the other, natural order (MSB to LSB). Decoding via H-CL
gives:
bits clcode 0b1110 17, 3 extra bits = 0b111 = 7 0b001 4 0b11111 18, 7 extra bits = 0b0001010 = 10 0b001 4 0b101 8 0b1110 17, 3 extra bits = 0b010 = 2 0b100 7 0b1110 17, 3 extra bits = 0b001 = 1 0b011 6 0b000 0
Still in the RFC section 3.2.7., this means:
lcode_lengths has 3 + 7 = 10 consecutive zeroes lcode_lengths[ 10] = 4 lcode_lengths has 11 + 10 = 21 consecutive zeroes lcode_lengths[ 32] = 4 lcode_lengths[ 33] = 8 lcode_lengths has 3 + 2 = 5 consecutive zeroes lcode_lengths[ 39] = 7 lcode_lengths has 3 + 1 = 4 consecutive zeroes lcode_lengths[ 44] = 6 lcode_lengths[ 45] = 0
After decoding the first 10 code lengths, the bit stream is now at:
offset xoffset ASCII hex binary 000013 0x000D F 0x46 0b_01.._....
Continuing to read all nlit
(266) lcode
lengths and ndist
(20) dcode
lengths (zeroes are not shown here):
lcode_lengths[ 10] = 4 lcode_lengths[ 32] = 4 lcode_lengths[ 33] = 8 lcode_lengths[ 39] = 7 lcode_lengths[ 44] = 6 lcode_lengths[ 46] = 7 lcode_lengths[ 50] = 8 lcode_lengths[ 58] = 9 lcode_lengths[ 59] = 7 lcode_lengths[ 63] = 7 etc lcode_lengths[264] = 9 lcode_lengths[265] = 9
and
dcode_lengths[ 5] = 6 dcode_lengths[ 6] = 5 dcode_lengths[ 8] = 4 dcode_lengths[ 9] = 5 dcode_lengths[10] = 3 etc dcode_lengths[18] = 3 dcode_lengths[19] = 6
The H-CL
table is no longer used from here onwards.
For lcode
s, there are 6 + 9 + 10 + 16 + 10 + 12 = 63 non-zero entries: 6 4-bit codes, 9 5-bit codes, 10 6-bit codes, 16 7-bit codes, 10 8-bit codes and 12 9-bit codes. The canonical Huffman encoding's map from bits to lcode
values is:
bits lcode 0b0000 10 0b0001 32 0b0010 101 0b0011 116 0b0100 257 0b0101 259 0b01100 97 0b01101 104 0b01110 105 0b01111 110 0b10000 111 0b10001 114 0b10010 115 0b10011 258 0b10100 260 0b101010 44 0b101011 99 0b101100 100 0b101101 102 0b101110 108 0b101111 109 0b110000 112 0b110001 117 0b110010 119 0b110011 121 0b1101000 39 0b1101001 46 0b1101010 59 0b1101011 63 0b1101100 65 0b1101101 69 0b1101110 73 0b1101111 79 0b1110000 82 0b1110001 83 0b1110010 84 0b1110011 98 0b1110100 103 0b1110101 261 0b1110110 262 0b1110111 263 0b11110000 33 0b11110001 50 0b11110010 66 0b11110011 67 0b11110100 72 0b11110101 74 0b11110110 77 0b11110111 87 0b11111000 107 0b11111001 118 0b111110100 58 0b111110101 68 0b111110110 76 0b111110111 78 0b111111000 85 0b111111001 91 0b111111010 93 0b111111011 120 0b111111100 122 0b111111101 256 0b111111110 264 0b111111111 265
Call that Huffman table H-L
.
For dcode
s, there are 5 + 4 + 3 + 2 = 14 non-zero entries: 5 3-bit codes, 4 4-bit codes, 3 5-bit codes and 2 6-bit codes. The canonical Huffman encoding's map from bits to dcode
values is:
bits dcode 0b000 10 0b001 14 0b010 16 0b011 17 0b100 18 0b1010 8 0b1011 12 0b1100 13 0b1101 15 0b11100 6 0b11101 9 0b11110 11 0b111110 5 0b111111 19
Call that Huffman table H-D
.
The bit stream is now at:
offset xoffset ASCII hex binary 000052 0x0034 . 0xE7 0b_1..._.... 000053 0x0035 C 0x43 0b_0100_0011 000054 0x0036 . 0xE8 0b_1110_1000 000055 0x0037 ) 0x29 0b_0010_1001 000056 0x0038 . 0xA0 0b_1010_0000 000057 0x0039 . 0xF1 0b_1111_0001
Decoding from that bit stream via H-L
gives some literal bytes (where the lcode
value is < 256):
bits lcode 0b1110000 82 (ASCII 'R') 0b10000 111 (ASCII 'o') 0b101111 109 (ASCII 'm') 0b0010 101 (ASCII 'e') 0b10000 111 (ASCII 'o') 0b0001 32 (ASCII ' ') 0b01100 97 (ASCII 'a') 0b01111 110 (ASCII 'n')
This continues, up until we get to the “O Romeo, Romeo! wherefore art thou Romeo?” line.
The bit stream is now at:
offset xoffset ASCII hex binary 000089 0x0059 . 0xC1 0b_11.._.... 000090 0x005A . 0x1E 0b_0001_1110 000091 0x005B . 0x0F 0b_0000_1111 000092 0x005C . 0xF9 0b_1111_1001
Decoding from that bit stream via H-L
gives:
bits lcode 0b1101111 79 (ASCII 'O') 0b0001 32 (ASCII ' ') 0b1110000 82 (ASCII 'R') 0b10011 258
This 258 is our first non-literal lcode
, the start of a length-distance back-reference. We first need to decode the length. The RFC section 3.2.5. Compressed blocks (length and distance codes) says that an lcode
of 258 means that length=4 with no extra bits.
The bit stream is now at:
offset xoffset ASCII hex binary 000092 0x005C . 0xF9 0b_111._.... 000093 0x005D Y 0x59 0b_0101_1001
We next decode via H-D
instead of H-L
.
bits dcode 0b11110 11
Again, from section 3.2.5., a dcode
of 11 means a distance in the range [49, 64], 49 plus 4 extra bits. The next 4 bits are 0b0110=6, so the distance is 55. We therefore copy length=4 bytes from distance=55 bytes ago: “omeo”.
The bit stream is now at:
offset xoffset ASCII hex binary 000093 0x005D Y 0x59 0b_01.._.... 000094 0x005E U 0x55 0b_0101_0101 000095 0x005F > 0x3E 0b_0011_1110
We go back to decoding via H-L
, which gives
bits lcode 0b101010 44 (ASCII ',') 0b10100 260
This is another non-literal. Section 3.2.5. says that an lcode
of 260 means length=6 with no extra bits.
The bit stream is now at:
offset xoffset ASCII hex binary 000095 0x005F > 0x3E 0b_0011_111.
Decode with H-D
.
bits dcode 0b111110 5
Again, from section 3.2.5., a dcode
of 5 means a distance in the range [7, 8], 7 plus 1 extra bit. The next 1 bits are 0b0=0, so the distance is 7. We therefore copy length=6 bytes from distance=7 bytes ago: " Romeo".
The bit stream is now at:
offset xoffset ASCII hex binary 000096 0x0060 . 0x0F 0b_0000_1111
We go back to decoding via H-L
, which gives
bits lcode 0b11110000 33 (ASCII '!')
And on we go.
The block finishes with the bit stream at:
offset xoffset ASCII hex binary 000522 0x020A . 0xEB 0b_1110_101. 000523 0x020B 0 0x30 0b_0011_0000 000524 0x020C z 0x7A 0b_0111_1010 000525 0x020D Y 0x59 0b_0101_1001 000526 0x020E a 0x61 0b_0110_0001 000527 0x020F . 0x0D 0b_0000_1101 000528 0x0210 . 0x7F 0b_0111_1111 000529 0x0211 . 0x01 0b_0000_0001
The decoding is:
bits lcode 0b101011 99 (ASCII 'c') 0b10000 111 (ASCII 'o') 0b110001 117 (ASCII 'u') 0b01111 110 (ASCII 'n') 0b0100 257 (length=3 + 0_extra_bits...) bits dcode 0b1101 15 (distance=193 + 6_extra_bits, 0b000010 in MSB to LSB order; length=3, distance=195: copy "sel") bits lcode 0b1101011 63 (ASCII '?') 0b0000 10 (ASCII '\n') 0b111111101 256 (end of block)
In other words, the block ends with “counsel?\n”.
See test/data/artificial/deflate-*.commentary.txt