Random Access Compression: RAC

Status: Draft (as of September 2019). There is no compatibility guarantee yet.

Overview

RAC is a compressed file format that allows random access (not just sequential access) to the decompressed contents. In comparison to some other popular compression formats, all four of the Zlib, Brotli, LZ4 and Zstandard specifications explicitly contain the identical phrase: “the data format defined by this specification does not attempt to allow random access to compressed data”.

Compression means that the derived file is typically smaller than the original file. Random access means that, starting from the compressed file, it is possible to reconstruct the half-open byte range [di .. dj) of the decompressed file without always having to first decompress all of [0 .. di).

Conceptually, the decompressed file is partitioned into non-overlapping chunks. Each compressed chunk can be decompressed independently (although possibly sharing additional context, such as a LZ77 prefix dictionary). A RAC file also contains a hierarchical index of those chunks.

RAC is a container format, and while it supports common compression codecs like Zlib, Brotli, LZ4 and Zstandard, it is not tied to any particular compression codec.

Non-goals for version 1 include:

  • Filesystem metadata like file names, file sizes and modification times.
  • Multiple source files.

There is the capability (see attributes and reserved TTags, below) but no promise to address these in a future RAC version. There might not be a need to, as other designs such as SquashFS and EROFS (Extendable Read-Only File System) already exist.

Non-goals in general include:

  • Encryption.
  • Streaming decompression.

A companion document has further discussion of RAC related work.

Glossary

  • CBias is the delta added to a CPointer to produce a COffset.
  • DBias is the delta added to a DPointer to produce a DOffset.
  • CFile is the compressed file.
  • DFile is the decompressed file.
  • CFileSize is the size of the CFile.
  • DFileSize is the size of the DFile.
  • COffset is a byte offset in CSpace.
  • DOffset is a byte offset in DSpace.
  • CPointer is a relative COffset, prior to bias-correction.
  • DPointer is a relative DOffset, prior to bias-correction.
  • CRange is a Range in CSpace.
  • DRange is a Range in DSpace.
  • CSpace means that byte offsets refer to the CFile.
  • DSpace means that byte offsets refer to the DFile.

Range is a pair of byte offsets [i .. j), in either CSpace or DSpace. It is half-open, containing every byte offset x such that (i <= x) and (x < j). It is invalid to have (i > j). The size of a Range equals (j - i).

All bytes are 8 bits and unless explicitly specified otherwise, all fixed-size unsigned integers (e.g. uint32_t, uint64_t) are encoded little-endian. Within those unsigned integers, bit 0 is the least significant bit and e.g. bit 31 is the most significant bit of a uint32_t.

The maximum supported CFileSize and the maximum supported DFileSize are the same number: 0x0000_FFFF_FFFF_FFFF, which is ((1 << 48) - 1).

File Structure

A RAC file (the CFile) must be at least 32 bytes long, and start with the 3 byte Magic (see below), so that no valid RAC file can also be e.g. a valid JPEG file.

The CFile contains a tree of Nodes. Each Node is either a Branch Node (pointing to between 1 and 255 child Nodes) or a Leaf Node. There must be at least one Branch Node, called the Root Node. Parsing a CFile requires knowing the CFileSize in order to identify the Root Node, which is either at the start or the end of the CFile.

Each Node has a DRange, which can be empty. Branch Nodes can also have attributes: elements that aren't child Nodes, which must have an empty DRange. Empty elements contain metadata or other decompression context such as a shared dictionary.

Each Leaf Node also has 3 CRanges (Primary, Secondary and Tertiary), any or all of which may be empty. The contents of the CFile, within those CRanges, are decompressed according to the Codec (see below) to reconstruct that part of the DFile within the Leaf Node's DRange.

Branch Nodes

A Branch Node's encoding in the CFile has a variable byte size, between 32 and 4096 inclusive, depending on its number of children. Specifically, it occupies ((Arity * 16) + 16) bytes, grouped into 8 byte segments (but not necessarily 8 byte aligned), starting at a COffset called its Branch COffset:

+-+-+-+-+-+-+-+-+
|0|1|2|3|4|5|6|7|
+-+-+-+-+-+-+-+-+
|Magic|A|Che|0|T|  Magic, Arity, Checksum,     Reserved (0),  TTag[0]
| DPtr[1]   |0|T|  DPtr[1],                    Reserved (0),  TTag[1]
| DPtr[2]   |0|T|  DPtr[2],                    Reserved (0),  TTag[2]
| ...       |0|.|  ...,                        Reserved (0),  ...
| DPtr[A-2] |0|T|  DPtr[Arity-2],              Reserved (0),  TTag[Arity-2]
| DPtr[A-1] |0|T|  DPtr[Arity-1],              Reserved (0),  TTag[Arity-1]
| DPtr[A]   |0|C|  DPtr[Arity] a.k.a. DPtrMax, Reserved (0),  CodecByte
| CPtr[0]   |L|S|  CPtr[0],                    CLen[0],       STag[0]
| CPtr[1]   |L|S|  CPtr[1],                    CLen[1],       STag[1]
| CPtr[2]   |L|S|  CPtr[2],                    CLen[2],       STag[2]
| ...       |L|.|  ...,                        ...,           ...
| CPtr[A-2] |L|S|  CPtr[Arity-2],              CLen[Arity-2], STag[Arity-2]
| CPtr[A-1] |L|S|  CPtr[Arity-1],              CLen[Arity-1], STag[Arity-1]
| CPtr[A]   |V|A|  CPtr[Arity] a.k.a. CPtrMax, Version,       Arity
+-+-+-+-+-+-+-+-+

See the “Examples” section below for example RAC files and, in particular, example Branch Nodes.

For the (XPtr | Other6 | Other7) 8 byte fields, the XPtr occupies the low 48 bits (as a little-endian uint64_t) and the Other fields occupy the high 16 bits.

The CPtr and DPtr values are what is explicitly written in the CFile's bytes. These are added to a Branch Node's implicit Branch CBias and Branch DBias values to give the implicit COff and DOff values: COff[i] and DOff[i] are defined to be (Branch_CBias + CPtr[i]) and (Branch_DBias + DPtr[i]).

CPtrMax is another name for CPtr[Arity], and COffMax is defined to be (Branch_CBias + CPtrMax). Likewise for DPtrMax and DOffMax.

The DPtr[0] value is implicit, and always equals zero, so that DOff[0] always equals the Branch DBias.

  • For the Root Node, the DPtrMax also sets the DFileSize. The Branch CBias and Branch DBias are both zero. The Branch COffset is determined by the “Root Node” section below.
  • For a child Branch Node, the Branch COffset, Branch CBias and Branch DBias are given by the parent Branch Node. See the “Search Within a Branch Node” section below.

Magic

Magic is the three bytes "\x72\xC3\x63", which is invalid UTF-8 but is "rÃc" in ISO 8859-1. The tilde isn't particularly meaningful, other than "rÃc" being a nonsensical word (with nonsensical capitalization) that is unlikely to appear in other files.

Every Branch Node must start with these Magic bytes, not just a Branch Node positioned at the start of the CFile.

Arity

Arity is the Branch Node's number of elements: the number of child Nodes plus the number of attributes. Having zero children is invalid.

The Arity byte is given twice: the fourth byte and the final byte of the Branch Node. The two values must match.

The repetition lets a RAC reader determine the size of the Branch Node data (as the size depends on the Arity), given either its start or its end offset in CSpace. For almost all Branch Nodes, we will know its start offset (its Branch COffset), but for a Root Node at the end of a CFile, we will only know its end offset.

Checksum

Checksum is a checksum of the Branch Node's bytes. It is not a checksum of the CFile or DFile contents pointed to by a Branch Node. Content checksums are a Codec-specific consideration.

The little-endian uint16_t Checksum value is the low 16 bits XOR'ed with the high 16 bits of the uint32_t CRC-32 IEEE checksum of the ((Arity * 16) + 10) bytes immediately after the Checksum. The 4 bytes immediately before the Checksum are not considered: the Magic bytes have only one valid value and the Arity byte near the start is replicated by the Arity byte at the end.

Reserved (0)

The Reserved (0) bytes must have the value 0x00.

COffs and DOffs, STags and TTags

For every a in the half-open range [0 .. Arity), the a'th element has two tags, STag[a] and TTag[a], and a DRange of [DOff[a] .. DOff[a+1]). The DOff values must be non-decreasing: see the “Branch Node Validation” section below.

A TTag[a] of 0xFE means that that element is a Branch Node child.

A TTag[a] of 0xFD means that there is no child, but is instead a Codec Element attribute, whose DRange must be empty, and the rest of this section does not apply: the STag is ignored.

A TTag[a] in the half-open range [0xC0 .. 0xFD) is reserved. Otherwise, the element is a Leaf Node child.

A child Branch Node's SubBranch COffset is defined to be COff[a]. Its SubBranch DBias and SubBranch DOffMax are defined to be DOff[a] and DOff[a+1].

  • When (STag[a] < Arity), it is a CBiasing Branch Node. The SubBranch CBias is defined to be (Branch_CBias + CPtr[STag[a]]). This expression is equivalent to COff[STag[a]].
  • When (STag[a] >= Arity), it is a CNeutral Branch Node. The SubBranch CBias is defined to be (Branch_CBias).

A child Leaf Node's STag[a] and TTag[a] values are also called its Leaf STag and Leaf TTag. It also has:

  • A Primary CRange, equal to MakeCRange(a).
  • A Secondary CRange, equal to MakeCRange(STag[a]).
  • A Tertiary CRange, equal to MakeCRange(TTag[a]).

The MakeCRange(i) function defines a CRange. If (i >= Arity) then that CRange is the empty range [COffMax .. COffMax). Otherwise, the lower bound is COff[i] and the upper bound is:

  • COffMax when CLen[i] is zero.
  • The minimum of COffMax and (COff[i] + (CLen[i] * 1024)) when CLen[i] is non-zero.

In other words, the COffMax value clamps the CRange upper bound. The CLen value, if non-zero, combines with the COff value to apply another clamp. The CLen is given in units of 1024 bytes, but the (COff[i] + (CLen[i] * 1024)) value is not necessarily quantized to 1024 byte boundaries.

Note that, since Arity is at most 255, an STag[a] of 0xFF always results in a CNeutral Branch Node or an empty Secondary CRange. Likewise, a TTag[a] of 0xFF always results in an empty Tertiary CRange.

COffMax

COffMax is an inclusive upper bound on every COff in a Branch Node and in its descendent Branch Nodes. A child Branch Node must not have a larger COffMax than the parent Branch Node's COffMax, and the Root Node's COffMax must equal the CFileSize. See the “Branch Node Validation” section below.

A RAC file can therefore be incrementally modified, if the RAC writer only appends new CFile bytes and does not re-write existing CFile bytes, so that the CFileSize increases. Even if the old (smaller) RAC file's Root Node was at the CFile start, the new (larger) CFileSize means that those starting bytes are an obsolete Root Node (but still a valid Branch Node). The new Root Node is therefore located at the end of the new RAC file.

Concatenating RAC files (concatenating in DSpace) involves concatenating the RAC files in CSpace and then appending a new Root Node with CBiasing Branch Nodes pointing to each source RAC file's Root Node.

Version

Version must have the value 0x01, indicating version 1 of the RAC format. The 0x00 value is reserved, although future editions may use other positive values.

Codec

Codecs define specializations of RAC, such as “RAC + Zlib” or “RAC + Brotli”. It is valid for a “RAC + Zstandard” only decoder to reject a “RAC + Brotli” file, even if it is a valid RAC file. Recall that RAC is just a container, and not tied to any particular compression codec.

There are two categories of Codecs: the high 0x80 bit of the Codec Byte being 0 or 1 denotes a Short or Long codec respectively. Short Codecs are represented by a single byte (the Codec Byte). Long Codecs use that Codec Byte to locate 7 additional bytes: a Codec Element.

For both Short Codecs and Long Codecs, the second-highest 0x40 bit of the Codec Byte is the Mix Bit. A Mix Bit of 0 means that all of the Node's descendents have exactly the same Codec (not just the same Codec Byte; Short Codecs and Long Codecs are not considered exactly the same even if they represent the same compression algorithm). A Mix Bit of 1 means that descendents may have a different Codec.

For Short Codecs, the remaining low 6 bits correspond to a specific compression algorithm:

  • 0x00 means “RAC + Zeroes”.
  • 0x01 means “RAC + Zlib”.
  • 0x02 means “RAC + LZ4”.
  • 0x03 means “RAC + Zstandard”.
  • All other values are reserved.

For Long Codecs, the remaining low 6 bits of the Codec Byte define a number c64. The lowest index i out of (c64 + (64 * 0)), (c64 + (64 * 1)), (c64 + (64 * 2)) and (c64 + (64 * 3)) such that TTag[i] is 0xFD locates the 7 bytes that identifies the Codec. The location is what would otherwise occupy the ith element's CPtr | CLen space. It is invalid for no such i to exist.

Long Codec values are not reserved by this specification, other than 7 NUL bytes also means “RAC + Zeroes”. Users may define their own values for arbitrary compression algorithms. Maintaining a centralized registry mapping values to algorithms is out of scope of this specification, although we suggest a convention that the 7 bytes form a human-readable (ASCII) string, padded with trailing NUL bytes. For example, a hypothetical “Middle Out version 2” compression algorithm that typically used the “.mdo2” file extension might be represented by the 7 bytes "mdo2\x00\x00\x00".

Branch Node Validation

The first time that a RAC reader visits any particular Branch Node, it must check that the Magic matches, the two Arity values match and are non-zero, there is at least one child Node (not just non-Node attributes), the computed checksum matches the listed Checksum and that the RAC reader accepts the Version. For Long Codecs, there must exist an 0xFD TTag as per the previous section.

It must also check that all of its DOff values are sorted: (DOff[a] <= DOff[a+1]) for every a in the half-open range [0 .. Arity). By induction, this means that all of its DOff values do not exceed DOffMax, and again by induction, therefore do not exceed DFileSize.

It must also check that, other than Codec Element attributes, all of its COff values do not exceed COffMax (and again by induction, therefore do not exceed CFileSize). Other than that, COff values do not have to be sorted: successive Nodes (in DSpace) can be out of order (in CSpace), allowing for incrementally modified RAC files.

For the Root Node, its COffMax must equal the CFileSize. Recall that parsing a CFile requires knowing the CFileSize, and also that a Root Node's Branch CBias is zero, so its COffMax equals its CPtrMax.

For a child Branch Node, if the parent‘s Codec does not have the Mix Bit set then the child’s Codec must equal its parent‘s. Furthermore, its Version must be less than or equal to its parent’s Version, its COffMax must be less than or equal to its parent‘s COffMax, and its DOffMax must equal its parent’s SubBranch DOffMax. The DOffMax condition is equivalent to checking that the parent and child agree on the child's size in DSpace. The parent states that it is its (DPtr[a+1] - DPtr[a]) and the child states that it is its DPtrMax.

One conservative way to check Branch Nodes' validity on first visit is to check them on every visit, as validating any particular Branch Node is idempotent, but other ways are acceptable.

Root Node

The Root Node might be at the start of the CFile, as this might optimize alignment of Branch Nodes and of CRanges. All Branch Nodes' sizes are multiples of 16 bytes, and a maximal Branch Node is exactly 4096 bytes.

The Root Node might be at the end of the CFile, as this allows one-pass (streaming) encoding of a RAC file. It also allows appending to, concatenating or incrementally modifying existing RAC files relatively cheaply.

To find the Root Node, first look for a valid Root Node at the CFile start. If and only if that fails, look at the CFile end. If that also fails, it is not a valid RAC file.

Root Node at the CFile Start

The fourth byte of the CFile gives the Arity, assuming the Root Node is at the CFile start. If it is zero, fail over to the CFile end. A RAC writer that does not want to place the Root Node at the CFile start should intentionally write a zero to the fourth byte.

Otherwise, that Arity defines the size in bytes of any starting Root Node: ((Arity * 16) + 16). If the CFileSize is less than this size, fail over to the CFile end.

If those starting bytes form a valid Root Node (as per the “Branch Node Validation” section), including having a CPtrMax equal to the CFileSize, then we have indeed found our Root Node: its Branch COffset is zero. Otherwise, fail over to the CFile end.

Root Node at the CFile End

If there is no valid Root Node at the CFile start then the last byte of the CFile gives the Root Node's Arity. This does not necessarily need to match the fourth byte of the CFile.

As before, that Arity defines the size in bytes of any ending Root Node: ((Arity * 16) + 16). If the CFileSize is less than this size, it is not a valid RAC file.

If those ending bytes form a valid Root Node (as per the “Branch Node Validation” section), including having a CPtrMax equal to the CFileSize, then we have indeed found our Root Node: its Branch COffset is the CFileSize minus the size implied by the Arity. Otherwise, it is not a valid RAC file.

DRange Reconstruction

To reconstruct the DRange [di .. dj), if that DRange is empty then the request is trivially satisfied.

Otherwise, if (dj > DFileSize) then reject the request.

Otherwise, start at the Root Node and continue to the next section to find the Leaf Node containing the DOffset di.

Search Within a Branch Node

Load (and validate) the Branch Node given its Branch COffset, Branch CBias and Branch DBias.

Find the largest child index a such that (a < Arity) and (DOff[a] <= di) and (DOff[a+1] > di), then examine TTag[a] to see if the child is a Leaf Node. If so, skip to the next section.

For a Branch Node child, let CRemaining equal this Branch Node‘s (the parents’) COffMax minus the SubBranch COffset. It invalid for CRemaining to be less than 4, or to be less than the child‘s size implied by the child’s Arity byte at a COffset equal to (SubBranch_COffset + 3).

The SubBranch COffset, SubBranch CBias and SubBranch DBias from the parent Branch Node become the Branch COffset, Branch CBias and Branch DBias of the child Branch Node. In order to rule out infinite loops, at least one of these two conditions must hold:

  • The child‘s Branch COffset is less than the parent’s Branch COffset.
  • The child‘s DPtrMax is less than the parent’s DPtrMax.

It is valid for one of those conditions to hold between a parent-child pair and the other condition to hold between the next parent-child pair.

Repeat this “Search Within a Branch Node” section with the child Branch Node.

Decompressing a Leaf Node

If a Leaf Node‘s DRange is empty, decompression is a no-op and skip the rest of this section. Specifically, a low-level library that iterates over a RAC file’s chunks, without actually performing decompression, should skip over empty chunks instead of yielding them to its caller.

Otherwise, decompression combines the Primary CRange, Secondary CRange and Tertiary CRange slices of the CFile, and the Leaf STag and Leaf TTag values, in a Codec-specific way to produce decompressed data.

There are two general principles, although specific Codecs can overrule:

  • The Codec may produce fewer bytes than the DRange size. In that case, the remaining bytes (in DSpace) are set to NUL (memset to zero).
  • The Codec may consume fewer bytes than each CRange size, and the compressed data typically contains an explicit “end of data” marker. In that case, the remaining bytes (in CSpace) are ignored. Padding allows COff values to optionally be aligned.

It is invalid to produce more bytes than the DRange size or to consume more bytes than each CRange size.

Continuing the Reconstruction

If decompressing that Leaf Node did not fully reconstruct [di .. dj), advance through the Node tree in depth first search order, decompressing every Leaf Node along the way, until we have gone up to or beyond dj.

During that traversal, Nodes with an empty DRange should be skipped, even if they are Branch Nodes.

Specific Codecs

Common Dictionary Format

Unless otherwise noted, codecs use this common dictionary format.

If a Leaf Node's Secondary CRange is empty then there is no dictionary. Otherwise, the Secondary CRange must be at least 8 bytes long:

  • 4 byte little-endian uint32_t Dictionary Length. The high 2 bits are reserved and must be zero. The maximum (inclusive) Dictionary Length is therefore ((1 << 30) - 1), or 1_073_741_823 bytes.
  • Dictionary Length bytes Dictionary.
  • 4 byte little-endian uint32_t Dictionary Checksum, the Checksumis CRC-32 IEEE checksum over theDictionary`'s bytes (excluding both the 4 byte prefix and the 4 byte suffix).
  • Padding (ignored).

The Leaf TTag must be 0xFF. All other Leaf TTag values (below 0xC0) are reserved. The empty Tertiary CRange is ignored. The Leaf STag value is also ignored, other than deriving the Secondary CRange.

RAC + Zeroes

The CRanges are ignored. The DRange is filled with NUL bytes (memset to zero).

RAC + Zlib

The CFile data in the Leaf Node‘s Primary CRange is decompressed as Zlib (RFC 1950), possibly referencing a LZ77 prefix dictionary (what the RFC calls a “preset dictionary”) wrapped in RAC’s common dictionary format, described above.

RAC + LZ4

TODO.

RAC + Zstandard

The CFile data in the Leaf Node‘s Primary CRange is decompressed as Zstandard (RFC 8478), possibly referencing a dictionary wrapped in RAC’s common dictionary format, described above. After unwrapping, the dictionary's bytes can be either a “raw” or “trained” dictionary, as per RFC 8478 section 5.

Examples

These examples display RAC files in the format of the hexdump -C command line tool. They are deliberately very short, for ease of understanding. Realistic RAC files, with larger chunk sizes, would typically exhibit much better compression ratios.

Zlib Example (Root Node at End)

The first example is relatively simple. The root node (located at the CFile end) only has one child: a leaf node whose compressed contents starts at position 0x04. Decompressing that chunk produces the 6 bytes “More!\n”.

00000000  72 c3 63 00 78 9c 01 06  00 f9 ff 4d 6f 72 65 21  |r.c.x......More!|
00000010  0a 07 42 01 bf 72 c3 63  01 65 a9 00 ff 06 00 00  |..B..r.c.e......|
00000020  00 00 00 00 01 04 00 00  00 00 00 01 ff 35 00 00  |.............5..|
00000030  00 00 00 01 01                                    |.....|

Zlib Example (Root Node at Start)

The second example consists of a root node with four children: one metadata node (a shared dictionary) and three data nodes. The shared dictionary, “\x20sheep.\n” is 0x00000008 bytes long and its CRC-32 IEEE checksum is 0x477A8DD0. The third child's (the second data node)'s compressed contents starts at position 0x75. Decompressing that chunk, together with that shared dictionary, produces the 11 bytes “Two sheep.\n”. The complete decoding of all three data chunks is “One sheep.\nTwo sheep.\nThree sheep.\n”.

00000000  72 c3 63 04 37 39 00 ff  00 00 00 00 00 00 00 ff  |r.c.79..........|
00000010  0b 00 00 00 00 00 00 ff  16 00 00 00 00 00 00 ff  |................|
00000020  23 00 00 00 00 00 00 01  50 00 00 00 00 00 01 ff  |#.......P.......|
00000030  60 00 00 00 00 00 01 00  75 00 00 00 00 00 01 00  |`.......u.......|
00000040  8a 00 00 00 00 00 01 00  a1 00 00 00 00 00 01 04  |................|
00000050  08 00 00 00 20 73 68 65  65 70 2e 0a d0 8d 7a 47  |.... sheep....zG|
00000060  78 f9 0b e0 02 6e f2 cf  4b 85 31 01 01 00 00 ff  |x....n..K.1.....|
00000070  ff 17 21 03 90 78 f9 0b  e0 02 6e 0a 29 cf 87 31  |..!..x....n.)..1|
00000080  01 01 00 00 ff ff 18 0c  03 a8 78 f9 0b e0 02 6e  |..........x....n|
00000090  0a c9 28 4a 4d 85 71 00  01 00 00 ff ff 21 6e 04  |..(JM.q......!n.|
000000a0  66                                                |f|

Zlib Example (Concatenation)

The third example demonstrates concatenating two RAC files: the two examples above. The decompressed form of the resultant RAC file is the concatenation of the two decompressed forms: “One sheep.\nTwo sheep.\nThree sheep.\nMore!\n”. Its 41 decompressed bytes consists of the “sheep” RAC file‘s 35 bytes and then the “more” RAC file’s 6 bytes. In hexadecimal, 0x29 = 0x23 + 0x06, and the 0x29 and 0x23 numbers can be seen in the compressed form's bytes.

The compressed form (what's shown in hexdump -C format below) is the concatenation of the two compressed forms, plus a new root node (64 bytes starting at position 0xD6). Even though the overall RAC file starts with the “sheep” RAC file, whose root node was at its start, those opening bytes are no longer a valid root node for the larger file.

That new root node has 3 children: 1 metadata node and 2 branch nodes. The metadata node (one whose DRange is empty) is required because one of the original RAC files' root node is not located at its start. Walking to that child branch node needs two COffset values: one for the embedded RAC file‘s start and one for the embedded RAC file’s root node.

00000000  72 c3 63 04 37 39 00 ff  00 00 00 00 00 00 00 ff  |r.c.79..........|
00000010  0b 00 00 00 00 00 00 ff  16 00 00 00 00 00 00 ff  |................|
00000020  23 00 00 00 00 00 00 01  50 00 00 00 00 00 01 ff  |#.......P.......|
00000030  60 00 00 00 00 00 01 00  75 00 00 00 00 00 01 00  |`.......u.......|
00000040  8a 00 00 00 00 00 01 00  a1 00 00 00 00 00 01 04  |................|
00000050  08 00 00 00 20 73 68 65  65 70 2e 0a d0 8d 7a 47  |.... sheep....zG|
00000060  78 f9 0b e0 02 6e f2 cf  4b 85 31 01 01 00 00 ff  |x....n..K.1.....|
00000070  ff 17 21 03 90 78 f9 0b  e0 02 6e 0a 29 cf 87 31  |..!..x....n.)..1|
00000080  01 01 00 00 ff ff 18 0c  03 a8 78 f9 0b e0 02 6e  |..........x....n|
00000090  0a c9 28 4a 4d 85 71 00  01 00 00 ff ff 21 6e 04  |..(JM.q......!n.|
000000a0  66 72 c3 63 00 78 9c 01  06 00 f9 ff 4d 6f 72 65  |fr.c.x......More|
000000b0  21 0a 07 42 01 bf 72 c3  63 01 65 a9 00 ff 06 00  |!..B..r.c.e.....|
000000c0  00 00 00 00 00 01 04 00  00 00 00 00 01 ff 35 00  |..............5.|
000000d0  00 00 00 00 01 01 72 c3  63 03 83 16 00 ff 00 00  |......r.c.......|
000000e0  00 00 00 00 00 fe 23 00  00 00 00 00 00 fe 29 00  |......#.......).|
000000f0  00 00 00 00 00 01 a1 00  00 00 00 00 00 ff 00 00  |................|
00000100  00 00 00 00 04 01 b6 00  00 00 00 00 04 00 16 01  |................|
00000110  00 00 00 00 01 03                                 |......|

Focusing on that new root node:

000000d0                    72 c3  63 03 83 16 00 ff 00 00
000000e0  00 00 00 00 00 fe 23 00  00 00 00 00 00 fe 29 00
000000f0  00 00 00 00 00 01 a1 00  00 00 00 00 00 ff 00 00
00000100  00 00 00 00 04 01 b6 00  00 00 00 00 04 00 16 01
00000110  00 00 00 00 01 03

Re-formatting to highlight the groups-of-8-bytes structure and reprise the “Branch Nodes” section's diagram:

000000d6    72 c3 63 03 83 16 00 ff    |Magic|A|Che|0|T|
000000de    00 00 00 00 00 00 00 fe    | DPtr[1]   |0|T|
000000e6    23 00 00 00 00 00 00 fe    | DPtr[2]   |0|T|
000000ee    29 00 00 00 00 00 00 01    | DPtr[A]   |0|C|
000000f6    a1 00 00 00 00 00 00 ff    | CPtr[0]   |L|S|
000000fe    00 00 00 00 00 00 04 01    | CPtr[1]   |L|S|
00000106    b6 00 00 00 00 00 04 00    | CPtr[2]   |L|S|
0000010e    16 01 00 00 00 00 01 03    | CPtr[A]   |V|A|

Its CodecByte (and therefore its Short Codec) is 0x01, “RAC + Zlib”, its Version is 0x01 and its Arity is 0x03. The DPtr values are 0x0000 (implicit), 0x0000, 0x0023 and 0x0029. The CPtr values are 0x00A1, 0x0000, 0x00B6 and 0x0116 (the size of the whole RAC file). Note that the CPtr values are not sorted. The last two children's TTags are 0xFE and 0xFE and their STags are 0x01 and 0x00, which means that they are both CBiasing Branch Nodes.

Reference Implementation

C programming language libraries:

Go programming language libraries:

Command line tool, installable via go install github.com/google/wuffs/cmd/ractool:

Acknowledgements

I (Nigel Tao) thank Robert Obryk, Sanjay Ghemawat and Sean Klein for their review.


Updated on September 2019.