LZW is a general purpose compression algorithm, not specific to GIF, PDF, TIFF or even to graphics. In practice, there are two incompatible implementations, LSB (Least Significant Bits) and MSB (Most Significant Bits) first.
The GIF format uses LSB first. The literal width can vary between 2 and 8 bits, inclusive. There is no EarlyChange option.
The PDF and TIFF formats use MSB first. The literal width is fixed at 8 bits. There is an EarlyChange option that PDF sometimes uses (see Table 3.7 in the PDF 1.4 spec) and TIFF always uses.
An LZW encoding is a stream of codes, each code emitting zero or more bytes. There are four types of codes: literals, copies, clear and end.
Literal codes emit exactly one byte. For a given literal width L, there are (1 << L) literal codes. For example, if L is at least 7, then the 0x41 code emits the ASCII ‘A’ byte. If L is exactly 7, the highest literal code is 0x7F.
There is exactly one clear code and one end code, whose encoded values are right after the literal codes: CLEAR = ((1 << L) + 0) and END = ((1 << L) + 1). For 8 bit wide literals, CLEAR is 0x100 and END is 0x101. The clear code is discussed below. The end code means no more data. Both emit zero bytes.
Copy codes (with encoded values ranging from (END + 1) to 0xFFF, or equivalently, up to 4095 or up to 12 bits) emit two or more bytes, bytes that were previously emitted in the decoded stream. This is essentially a lookup in a key-value table. For example, the table could state that the 0x200 code emits “foo” and the 0x300 code emits “bar”.
Valid copy codes are in the range [END + 1, N], where N starts at END (i.e. the table's key range starts empty) and grows over time, up to a maximum value of 0xFFF. A clear code resets N to its starting value, END. In effect, a clear code clears the key-value table.
There is no explicit way to “increase N” or “add this key-value to the table”. Instead, as long as the table isn't full (i.e. that N < 0xFFF), every literal or copy code, other than the first one in the stream or the first one after a clear code, implicitly adds a key-value pair (with a key of N) and then increments N by 1.
The first literal or copy code in the stream (or after a clear code) merely increments N to (END + 1) without adding a key-value pair. In practice, it is equivalent to have that first code set a fake key-value pair for N == END, provided that table lookup algorithm ensures that “END means an end code” takes precendence.
The value in each key-value pair consists of a prefix (being 1 or more bytes) and a suffix (exactly 1 byte). The prefix equals the output emitted by the previous code: the one that was processed immediately before the current code. The suffix is the first byte of the output emitted by the current code.
For example, if a literal T code (which emits “T”) is followed by a literal O (which emits “O”), then the value in the implicit key-value pair is the entire “T” concatenated with the first byte of “O”, so the value is “TO”.
For example, if a copy 0x104 code (which emits “BE”) is followed by a copy 0x106 code (which emits “OR”), then the value is “BEO”.
For example, if a copy 0x10E code (which emits “TOBE”) is followed by a literal Y (which emits “Y”), then the value is “TOBEY”.
It is possible for a copy code to equal N, the very key to be implicitly added to the table. The prefix is, once again, the output emitted by the previous code. The suffix byte is, once again, the first byte of the output emitted by the current code, which in this case is the first byte of the prefix.
For example, if the previous code emitted “OTX”, then a copy code of N would both emit “OTXO” and add a (key=N, value=“OXTO”) pair to the table.
One possible encoding of “TOBEORNOTTOBEORTOBEORNOTXOTXOTXOOTXOOOTXOOOTOBEY” is in the table below. With dots to punctuate each code's emitted output, that is “T.O.B.E.O.R.N.O.T.TO.BE.OR.TOB.EO.RN.OT.X.OTX.OTXO.OTXOO.OTXOOO.TOBE.Y”.
Code Emits Key Value Pre1+Suf1 LM1 /Q %Q PreQ+SufQ T T O O 0x102 TO T O 1 0 1 - TO. B B 0x103 OB O B 1 0 1 - OB. E E 0x104 BE B E 1 0 1 - BE. O O 0x105 EO E O 1 0 1 - EO. R R 0x106 OR O R 1 0 1 - OR. N N 0x107 RN R N 1 0 1 - RN. O O 0x108 NO N O 1 0 1 - NO. T T 0x109 OT O T 1 0 1 - OT. 0x102 TO 0x10A TT T T 1 0 1 - TT. 0x104 BE 0x10B TOB 0x102 B 2 0 2 - TOB 0x106 OR 0x10C BEO 0x104 O 2 0 2 - BEO 0x10B TOB 0x10D ORT 0x106 T 2 0 2 - ORT 0x105 EO 0x10E TOBE 0x10B E 3 1 0 0x10B E.. 0x107 RN 0x10F EOR 0x105 R 2 0 2 - EOR 0x109 OT 0x110 RNO 0x107 O 2 0 2 - RNO X X 0x111 OTX 0x109 X 2 0 2 - OTX 0x111 OTX 0x112 XO X O 1 0 1 - XO. 0x113 OTXO 0x113 OTXO 0x111 O 3 1 0 0x111 O.. 0x114 OTXOO 0x114 OTXOO 0x113 O 4 1 1 0x111 OO. 0x115 OTXOOO 0x115 OTXOOO 0x114 O 5 1 2 0x111 OOO 0x10E TOBE 0x116 OTXOOOT 0x115 T 6 2 0 0x115 T.. Y Y 0x117 TOBEY 0x10E Y 4 1 1 0x10B EY. 0x101 -end-
See script/print-lzw-example.go
for the code that generated that table, including three implementations of a simplified core of the LZW algorithm.
The first four columns (Code, Emits, Key, Value) are discussed above. The remaining columns are discussed below.
A naive implementation of the key-value table, with the complete value stored contiguously for each key, would have quadratic worst case memory requirements, as each successive value can be up to 1 byte longer than the previous longest.
In contrast, the Pre1+Suf1 columns are a compact (3 bytes per key-value pair for a uint16 prefix and uint8 suffix) representation of the values. Each value is the concatenation of a variable length prefix (stored as a table key) and a 1-byte suffix. For example, the value for key 0x10C is “BE” + “O”, and “BE” is the value for the key 0x104.
Note that the Pre1 column is exactly the same as the Code column, shifted vertically by one row. Note also that the Suf1 column is exactly the same as the final byte of the Value column.
To reconstruct a code‘s value, start with the suffix and work backwards, walking the prefix chain producing one byte at a time. Either reconstruct the value in an intermediate buffer and then once fully reconstructed (and its length is known) copy it to the output buffer, or walk the chain twice (the first pass calculates the value length, the second pass reconstructs the value directly in the output buffer, skipping a copy) or store each value’s length along with its prefix and suffix (an additional uint16 stored per key-value pair) and reconstruct the value directly in the output buffer in one pass.
Some of the computations are slightly easier, avoiding multiple “+1”s and “-1”s in the program, if we store the “length minus 1” of each value, not the length. This is the LM1 column. Note that the LM1 of all literal codes are 0.
It can be faster (albeit with higher memory requirements) to store up-to-Q-byte suffixes instead of 1-byte suffixes, for some positive integer Q: the quantum of bytes per read or write. In effect, changing Q (usually at compile time, not at run time) is a classic performance versus memory trade-off.
On modern CPUs that can do unaligned 32 or 64 bit loads and stores, a Q of 4 or 8 can perform very well. To keep this worked example short and simple, we use a Q of 3, meaning that we store 1, 2 or 3 byte suffixes per key, and the prefix length is a multiple of 3. For example:
The number of steps (or equivalently, Q-byte chunks) in the prefix chain is (LM1 / Q), which is the “/Q” column in the example above.
The suffix length is ((LM1 % Q) + 1). The first part, (LM1 % Q), is the “%Q” column in the example above. When Q is 8, the “/Q” and “%Q” operations are simple bitwise operations, such as “>> 3” or “& 7”. When Q is 1 (a degenerate case), the “%Q” column is all zeroes and all suffixes are 1 byte long, which is unsurprisingly equivalent to the 1-byte suffix representation described above.
When copying the suffix to a buffer (either an intermediate buffer or a final buffer), it is often unnecessary to copy exactly suffix length bytes, provided that there is enough destination buffer space for all Q bytes. Even if we write excess bytes, decoding subsequent codes will overwrite the excess, or else we'll get an end code and the excess will be ignored. Again, on modern CPUs, it can be faster to write at least 7 bytes (i.e. exactly 8) than to write exactly 7.
Even so, the (LM1 % Q) value is still necessary to calculate where to start writing those Q-byte chunk. After that, reconstructing the value proceeds in the same way as in the 1-byte suffix algorithm. It just involves larger chunks. For example, the value for the key 0x116 is the suffix “T” preceded by the value for 0x115, which is the suffix “OOO” preceded by the value for 0x111, which is the suffix “OTX” with no prefix: the value is “OTX”+“OOO”+“T”.
The PreQ column is more complicated than the Pre1 column. It is no longer just a vertically shifted Code column, and can now contain “no prefix” values (or equivalently, the “/Q” column is 0). When inserting a new key-value pair, we only set the N‘th prefix equal to the previously seen code (the Code column shifted by one row) if the suffix for that previous code was a full Q bytes (or equivalently, if the current row’s “%Q” value is 0). Otherwise, we copy the previous code's prefix and suffix, and then extend the suffix with one more byte. For example, the PreQ+SufQ value for the 0x117 key is based on the PreQ+SufQ value (the same PreQ, an extended SufQ) of the previously seen code, namely 0x10E.
The description above has discussed an LZW stream as a stream of codes. In the actual wire format, those codes are packed into a stream of bytes, and a code can straddle byte boundaries.
Specifically, the next code in the stream takes B bits, where B is the smallest number of that bits that can distinguish all valid codes: those codes in the range [0, N]. If N is 0x101, B is 9. If N is 0x3FF, B is 10 bits. If N is 0x400, B is 11. Remember that each successive code increments N, up to a maximum N of 0xFFF, or 12 bits.
The literal width can also vary in the range [2..8] for LZW as used by GIF. In that case, B starts off as 1 more than the literal width. For example, with a literal width of 2, the four literal codes are 0x0, 0x1, 0x2 and 0x3, which are followed by a clear code of 0x4 and an end code of 0x5, so N is 0x5 and B is 3.
Unfortunately, some implementations misinterpreted the algorithm, and incremented B before instead of after emitting a bit-packed code for N == ((1 << B) - 1). Accordingly, LZW as used in TIFF and sometimes in PDF (depending on the EarlyChange bit) increment B one iteration earlier. The maximum value of B is still 12.
This misinterpretation is unfixable due to backwards compatibility,
The bits are then packed into bytes in a predetermined ordering. For GIF, the bits are packed Least Significant Bits first. For PDF and TIFF, Most Significant Bits first. With a 8 bit literal width (and therefore a 9 bit initial code width), the two literal codes T (0x54) and O (0x4F) followed by an end code 0x101 would be encoded as 27 bits, or 4 bytes (with 5 padding bits, which are ignored but are conventionally set to zero).
In LSB first, it would be four bytes, 0x54 0x9E 0x04 0x04:
offset xoffset ASCII hex binary 000000 0x0000 T 0x54 0b_0101_0100 000001 0x0001 . 0x9E 0b_1001_1110 000002 0x0002 . 0x04 0b_0000_0100 000003 0x0003 . 0x04 0b_0000_0100
These four bytes contain three 9-bit codes:
binary width code 0b_...._...._...._...0_0101_0100 9 0x0054 0b_...._...._...._..00_1001_111. 9 0x004F 0b_...._...._...._.100_0000_01.. 9 0x0101
In MSB first, it would again be four bytes, 0x2A 0x13 0xE0 0x20:
offset xoffset ASCII hex binary 000000 0x0000 * 0x2A 0b_0010_1010 000001 0x0001 . 0x13 0b_0001_0011 000002 0x0002 . 0xE0 0b_1110_0000 000003 0x0003 0x20 0b_0010_0000
Again, these four bytes contain three 9-bit codes:
binary width code 0b_0010_1010_0..._...._...._.... 9 0x0054 0b_.001_0011_11.._...._...._.... 9 0x004F 0b_..10_0000_001._...._...._.... 9 0x0101
See std/gif/README.md
for a complete worked example of LSB-first encoded LZW data, including a varying B width.
See test/data/artificial/gif-*.commentary.txt