Design of the Emboss Compiler (`embossc`)

This document describes the internals of the Emboss compiler. End users do not need to read this document.

Overall Design

The Emboss compiler follows a reasonably standard compiler design, where the input source text is first converted to an intermediate representation (IR), then various operations are performed on the IR, and finally the IR is used to construct the final output — at the time of writing, C++ source code:

%%{init: {"flowchart": {"htmlLabels": false}} }%%
graph LR
    diskstart@{ shape: doc, label: "example.emb" }
    parser["Parser"]
    processing@{ shape: procs, label: "IR processing" }
    backend["C++ Code Generator"]
    diskend@{ shape: doc, label: "example.emb.h" }
    diskstart-- ".emb" -->parser
    parser-- IR -->processing
    processing-- IR -->backend
    backend-- "C++" -->diskend

Currently, the Emboss compiler is split into three programs: the front end, which parses the input and does almost all of the IR processing, the C++ back end, which does a minimal amount of C++-specific IR processing and generates the final C++ code, and the embossc driver that combines the front end and back end. This split makes it straightforward to add new back ends later:

%%{init: {"flowchart": {"htmlLabels": false}} }%%
graph LR
    diskstart@{ shape: doc, label: "example.emb" }
    parser["Parser"]
    processing@{ shape: procs, label: "IR processing" }
    cppbackend["C++ Code Generator"]
    cppdiskend@{ shape: doc, label: "example.emb.h" }
    rustbackend["Rust Code Generator"]
    rustdiskend@{ shape: doc, label: "example.emb.rs" }
    mdbackend["Documentation Generator"]
    mddiskend@{ shape: doc, label: "example.emb.md" }
    diskstart-- ".emb" -->parser
    parser-- IR -->processing
    processing-- IR -->cppbackend
    cppbackend-- "C++" -->cppdiskend
    processing-- IR -->rustbackend
    rustbackend-- Rust -->rustdiskend
    processing-- IR -->mdbackend
    mdbackend-- Markdown -->mddiskend

IR (Intermediate Representation)

“IR” is a general term: each compiler has its own IR. The Emboss IR is a tree, with node types defined in compiler/util/ir_data.py.

The first stage of the compiler — the parser — generates an “initial” IR, which only contains information that is directly available in the source tree. Even without any further information, the initial IR can be quite large even for a very short .emb file. For example, this line:

  0  [+4]   UInt        file_state

turns into this IR (in JSON serialization) immediately after parsing:

{
  "location": {
    "start": {
      "constant": {
        "value": "0",
        "source_location": "22:3-22:4"
      },
      "source_location": "22:3-22:4"
    },
    "size": {
      "constant": {
        "value": "4",
        "source_location": "22:8-22:9"
      },
      "source_location": "22:8-22:9"
    },
    "source_location": "22:3-22:10"
  },
  "type": {
    "atomic_type": {
      "reference": {
        "source_name": [
          {
            "text": "UInt",
            "source_location": "22:13-22:17"
          }
        ],
        "source_location": "22:13-22:17"
      },
      "source_location": "22:13-22:17"
    },
    "source_location": "22:13-22:17"
  },
  "name": {
    "name": {
      "text": "file_state",
      "source_location": "22:25-22:35"
    },
    "source_location": "22:25-22:35"
  },
  "existence_condition": {
    "boolean_constant": {
      "value": true,
      "source_location": "22:3-22:35"
    },
    "source_location": "22:3-22:35"
  },
  "source_location": "22:3-22:35"
}

In graphical form (with source_location nodes omitted for clarity):

graph TD
    n0@{ shape: diamond, label: "field" }
    n0 --> n1
    n0 --> n4
    n0 --> n9
    n0 --> n10
    n1@{ shape: diamond, label: "location" }
    n1 --> n2
    n1 --> n3
    n2@{ shape: diamond, label: "start" }
    n2 --> l0
    n3@{ shape: diamond, label: "size" }
    n3 --> l1
    n4@{ shape: diamond, label: "type" }
    n4 --> n5
    n5@{ shape: diamond, label: "atomic_type" }
    n5 --> n6
    n6@{ shape: diamond, label: "reference" }
    n6 --> n7
    n7@{ shape: diamond, label: "source_names" }
    n7 --> n8
    n8@{ shape: diamond, label: "source_name" }
    n8 --> l2
    n9@{ shape: diamond, label: "name" }
    n9 --> l3
    n10@{ shape: diamond, label: "existence_condition" }
    n10 --> l4
    l0@{ shape: rect, label: "constant: 0" }
    l1@{ shape: rect, label: "constant: 4" }
    l2@{ shape: rect, label: "text: UInt" }
    l3@{ shape: rect, label: "name: file_state" }
    l4@{ shape: rect, label: "boolean_constant: True" }

This initial IR then goes through a series of elaborations, which annotate the IR, and validations, which check various properties of the IR. These are implemented as stages in the compiler:

%%{init: {"flowchart": {"htmlLabels": false}} }%%
graph LR
    diskstart@{ shape: doc, label: "example.emb" }
    parser["Parser"]
    stage1@{ shape: proc, label: "Stage 1" }
    stage2@{ shape: proc, label: "Stage 2" }
    stages@{ shape: text, label: "..." }
    stagenm1@{ shape: proc, label: "Stage N-1" }
    stagen@{ shape: proc, label: "Stage N" }
    backend["C++ Code Generator"]
    diskend@{ shape: doc, label: "example.emb.h" }
    diskstart-->parser
    parser-->stage1
    stage1-->stage2
    stage2-->stages
    stages-->stagenm1
    stagenm1-->stagen
    stagen-->backend
    backend-->diskend

In many cases, elaborations and validations are mixed together — for example, in the symbol resolution stage, names in the IR (field) are elaborated with the absolute symbol to which they resolve (module.Type.field), and, at the same time, the symbol resolver validates that every name resolves to exactly one absolute symbol.

At the end of this process, the IR is much larger:

graph TD
    n0@{ shape: diamond, label: "field" }
    n0 --> n1
    n0 --> n8
    n0 --> n15
    n0 --> n16
    n0 --> n19
    n0 --> n22
    n1@{ shape: diamond, label: "location" }
    n1 --> n2
    n1 --> n5
    n2@{ shape: diamond, label: "start" }
    n2 --> l0
    n2 --> n3
    n3@{ shape: diamond, label: "type" }
    n3 --> n4
    n4@{ shape: diamond, label: "integer" }
    n4 --> l1
    n4 --> l2
    n4 --> l3
    n4 --> l4
    n5@{ shape: diamond, label: "size" }
    n5 --> l5
    n5 --> n6
    n6@{ shape: diamond, label: "type" }
    n6 --> n7
    n7@{ shape: diamond, label: "integer" }
    n7 --> l6
    n7 --> l7
    n7 --> l8
    n7 --> l9
    n8@{ shape: diamond, label: "type" }
    n8 --> n9
    n9@{ shape: diamond, label: "atomic_type" }
    n9 --> n10
    n10@{ shape: diamond, label: "reference" }
    n10 --> n11
    n10 --> n13
    n11@{ shape: diamond, label: "canonical_name" }
    n11 --> l10
    n11 --> n12
    n12@{ shape: diamond, label: "object_paths" }
    n12 --> l11
    n13@{ shape: diamond, label: "source_names" }
    n13 --> n14
    n14@{ shape: diamond, label: "source_name" }
    n14 --> l12
    n15@{ shape: diamond, label: "write_method" }
    n15 --> l13
    n16@{ shape: diamond, label: "name" }
    n16 --> l14
    n16 --> n17
    n17@{ shape: diamond, label: "canonical_name" }
    n17 --> l15
    n17 --> n18
    n18@{ shape: diamond, label: "object_paths" }
    n18 --> l16
    n18 --> l17
    n19@{ shape: diamond, label: "attributes" }
    n19 --> n20
    n20@{ shape: diamond, label: "attribute" }
    n20 --> l18
    n20 --> n21
    n20 --> l20
    n21@{ shape: diamond, label: "value" }
    n21 --> l19
    n22@{ shape: diamond, label: "existence_condition" }
    n22 --> l21
    n22 --> n23
    n23@{ shape: diamond, label: "type" }
    n23 --> l22
    l0@{ shape: rect, label: "constant: 0" }
    l1@{ shape: rect, label: "modulus: infinity" }
    l2@{ shape: rect, label: "modular_value: 0" }
    l3@{ shape: rect, label: "minimum_value: 0" }
    l4@{ shape: rect, label: "maximum_value: 0" }
    l5@{ shape: rect, label: "constant: 4" }
    l6@{ shape: rect, label: "modulus: infinity" }
    l7@{ shape: rect, label: "modular_value: 4" }
    l8@{ shape: rect, label: "minimum_value: 4" }
    l9@{ shape: rect, label: "maximum_value: 4" }
    l10@{ shape: rect, label: "module_file: " }
    l11@{ shape: rect, label: "object_path: UInt" }
    l12@{ shape: rect, label: "text: UInt" }
    l13@{ shape: rect, label: "physical: True" }
    l14@{ shape: rect, label: "name: file_state" }
    l15@{ shape: rect, label: "module_file: testdata/golden/span" }
    l16@{ shape: rect, label: "object_path: LogFileStatus" }
    l17@{ shape: rect, label: "object_path: file_state" }
    l18@{ shape: rect, label: "name: byte_order" }
    l19@{ shape: rect, label: "string_constant: LittleEndian" }
    l20@{ shape: rect, label: "is_default: False" }
    l21@{ shape: rect, label: "boolean_constant: True" }
    l22@{ shape: rect, label: "boolean: True" }

Front End vs Back End(s)

The Emboss compiler is divided into a front end, which does most of the work, and back ends, which do language-specific validations and translate the final IR to the final output format. Currently, only a C++ back end exists.

The compiler is structured so that the front end and back end can run as separate programs, and when building with Bazel they do run separately. For efficiency, the embossc driver just imports the front end and C++ back end directly, so that it can skip the JSON serialization and deserialization steps.

The front end consists of (as of the time of writing) 14 stages:

Tokenization
Parse Tree Generation
Parse Tree → IR
Desugaring + Built-In Field Synthesis
Symbol Resolution Part 1: Head Symbols
Dependency Cycle Checking
Dependency Order Computation
Symbol Resolution Part 2: Field Access
Type Annotation
Type Checking
Bounds Computation
Front End Attribute Normalization + Verification
Miscellaneous Constraint Checking
Inferred Write Method Generation

Each of these stages will be explained in more detail later in this document.

The C++ back end is much simpler, with only 2 stages:

Back-End Attribute Normalization + Verification
C++ Header Generation

This lopsidedness is intentional: work done in the front end can be shared by all back ends, but work in a back end only benefits that specific back end.

IR Traversal

Most stages walk the IR one (or more) times, processing or checking specific types of nodes. The Emboss compiler uses a special fast_traverse_ir_top_down() function that takes a pattern and calls a handler on all nodes that match that pattern. fast_traverse_ir_top_down() has been optimized to walk trees efficiently: it will skip branches if no node in the branch could possibly match the pattern.

For example, this will call print() on every Word in the IR:

fast_traverse_ir_top_down(ir, [ir_data.Word], print)

Or you could limit the scope to only Word nodes inside of Import nodes:

fast_traverse_ir_top_down(ir, [ir_data.Import, ir_data.Word], print)

fast_traverse_ir_top_down() has a lot of options. For full usage, see its documentation in the source code.

Error Handling

The general scheme for running stages is:

for stage in stages:
    next_ir, errors = stage(prev_ir)
    if errors:
        return None, errors
    prev_ir = next_ir
return prev_ir, None

That is: run each stage, and if the stage returns any errors, stop processing and return the error(s).

Errors are represented by lists of messages (error, warning or note), where, generally, the first message will be an error or warning, and any subsequent messages are note messages that provide additional information.

Errors on Synthetic Nodes

One important caveat here is that errors on synthetic IR nodes — nodes that did not come directly from the input .emb — are deferred: the compiler will continue processing to try to find an error in a “natural” node to display to the user. This is covered in more detail in Built-In Field Synthesis.

Front End

compiler/front_end/...

The front end is responsible for reading in Emboss definitions and producing a normalized intermediate representation (IR). It is divided into several steps: roughly, parsing, import resolution, symbol resolution, and validation.

The front end is orchestrated by glue.py, which runs each front end component in the proper order to construct an IR suitable for consumption by the back end.

The front end driver program is emboss_front_end.py, which calls glue.ParseEmbossFile and prints the results.

Tokenization and Parsing

The first part of the front end translates the text of the .emb input into an IR structure.

Tokenization

tokenizer.py

The very first stage is tokenization, also called lexical analysis or lexing. Tokenization breaks the input text into tokens of one or more characters. For example, take the string abc+def:

+---+---+---+---+---+---+---+
| a | b | c | + | d | e | f |
+---+---+---+---+---+---+---+

This will be grouped into three tokens, abc, +, and def:

+-----------+-----+-----------+
|    abc    |  +  |    def    |
| SnakeWord | "+" | SnakeWord |
+-----------+-----+-----------+

In addition, each token is labeled with a category, such as Documentation, Number, or "*": by convention, tokens that are always a specific sequence of characters are given a label that is just the token in double quotes, such as "+" for +, "==" for ==, "if" for if, and so on.

These category labels match names in the grammar in the next step.

There is one extra bit of logic in the Emboss tokenizer, which mimics CPython: indentation is tracked and translated into special Indent and Dedent tokens when it changes. This fragment:

struct Foo:
  -- word

will be tokenized as:

+----------+-----------+-----+------+--------+---------------+--------+
|  struct  |    Foo    |  :  |  \n  |        |    -- word    |        |
| "struct" | SnakeName | ":" | "\n" | Indent | Documentation | Dedent |
+----------+-----------+-----+------+--------+---------------+--------+

Parse Tree Generation

lr1.py

The list of tokens is then passed to a parser, which produces a parse tree. The point of this stage is to take the list of tokens, and apply a tree structure on top of it. Starting with the tokens from the previous example:

graph LR
    colon@{label: "\":\""}
    newline@{label: "\"\\n\""}
    struct ~~~ Foo ~~~ colon ~~~ newline ~~~ Indent ~~~ Documentation ~~~ Dedent

the parser infers the parse tree:

graph TD
        r0@{shape: diamond, label: "module"}
        r1@{shape: diamond, label: "comment-line*"}
        r2@{shape: diamond, label: "doc-line*"}
        r3@{shape: diamond, label: "import-line*"}
        r4@{shape: diamond, label: "attribute-line*"}
        r5@{shape: diamond, label: "type-definition*"}
        r6@{shape: diamond, label: "type-definition"}
        r7@{shape: diamond, label: "struct"}
        r8@{shape: diamond, label: "type-name"}
        r9@{shape: diamond, label: "type-word"}
        r10@{shape: diamond, label: "delimited-parameter-definition-list?"}
        r11@{shape: diamond, label: "Comment?"}
        r12@{shape: diamond, label: "eol"}
        r13@{shape: diamond, label: "comment-line*"}
        r14@{shape: diamond, label: "struct-body"}
        r15@{shape: diamond, label: "doc-line*"}
        r16@{shape: diamond, label: "doc-line"}
        r17@{shape: diamond, label: "doc"}
        r18@{shape: diamond, label: "Comment?"}
        r19@{shape: diamond, label: "eol"}
        r20@{shape: diamond, label: "comment-line*"}
        r21@{shape: diamond, label: "doc-line*"}
        r22@{shape: diamond, label: "attribute-line*"}
        r23@{shape: diamond, label: "type-definition*"}
        r24@{shape: diamond, label: "struct-field-block"}
        r25@{shape: diamond, label: "type-definition*"}
    subgraph tokens
        direction LR
        t0@{shape: rect, label: "struct"}
        t1@{shape: rect, label: "Foo"}
        t2@{shape: rect, label: ":"}
        t3@{shape: rect, label: "\\n"}
        t4@{shape: rect, label: "Indent"}
        t5@{shape: rect, label: "-- word"}
        t6@{shape: rect, label: "\\n"}
        t7@{shape: rect, label: "Dedent"}
    end
    r0 --> r1
    r0 --> r2
    r0 --> r3
    r0 --> r4
    r0 --> r5
    r5 --> r6
    r5 --> r25
    r6 --> r7
    r7 ----> t0
    r7 --> r8
    r7 --> r10
    r7 ----> t2
    r7 --> r11
    r7 --> r12
    r7 --> r14
    r8 --> r9
    r9 ----> t1
    r12 ----> t3
    r12 --> r13
    r14 ----> t4
    r14 --> r15
    r14 --> r22
    r14 --> r23
    r14 --> r24
    r14 ----> t7
    r15 --> r16
    r15 --> r21
    r16 --> r17
    r16 --> r18
    r16 --> r19
    r17 ----> t5
    r19 ----> t6
    r19 --> r20

Parsing is a large subject with many good tutorials and references available, so only the embossc specifics are covered here.

The Emboss compiler uses a shift-reduce (also known as “LR(1)”) parser, using a custom engine that has some features that are not available in most other shift-reduce engines — and also some limitations.

The biggest limitation is that the table generator only implements the canonical LR(1) table generation algorithm, which means that there is no “side channel” way of specifying precedence or otherwise resolving conflicts — shift/reduce and reduce/reduce conflicts must be resolved by removing them from the grammar. (However, Emboss's grammar uses partially-ordered precedence, which — as far as the author is aware — is not supported by any off-the-shelf precedence system.)

The biggest unusual feature is the incorporation of a Merr-like system for error messages.

The output from this stage is a parse tree, not an IR.

Parse Tree → IR

module_ir.py

Finally, the parse tree is transformed into an IR. This is mostly a straightforward translation, but there are a few places where the parse tree is noticeably different from the IR. For example, in the parse tree, lists are explicit cons lists, like:

graph TD
    r1@{label: "struct-field-block"}
    r2@{label: "struct-field*"}
    r3@{label: "struct-field"}
    r4@{label: "struct-field*"}
    r5@{label: "struct-field"}
    r6@{label: "struct-field*"}
    r7@{label: "struct-field"}
    r8@{label: "struct-field*"}
    r1 --> r2
    r2 --> r3
    r2 --> r4
    r4 --> r5
    r4 --> r6
    r6 --> r7
    r6 --> r8

while in the IR these lists are flattened:

graph TD
    n0@{label: "structure"}
    n1@{label: "fields"}
    n2@{label: "field"}
    n3@{label: "field"}
    n4@{label: "field"}
    n0 --> n1
    n1 --> n2
    n1 --> n3
    n1 --> n4

One bit of pure syntax sugar that is handled at this stage is chained comparisons: a == b == c will be translated to the IR equivalent of a == b && b == c.

IR Processing

Desugaring + Built-In Field Synthesis

synthetics.py

The first stage that operates on the IR adds built-in fields like $size_in_bytes and rewrites some constructs to a more regular form. Specifically, anonymous bits will be rewritten into non-anonymous bits, and $next will be replaced with the offset + size of the previous field. Starting with this:

struct Foo:
    0 [+4]  bits:
        0  [+1]  Flag  low
        31 [+1]  Flag  high
    if low:
        $next [+4]  UInt  field

synthetics.py will rewrite the IR into (the equivalent of):

struct Foo:
    bits EmbossReservedAnonymous0:
        [text_output: "Skip"]
        0  [+1]  Flag  low
        31 [+1]  Flag  high
    let $size_in_bytes = $max(0, true ? 0+4 : 0, low ? (0+4)+4 : 0)
    let $max_size_in_bytes = $upper_bound($size_in_bytes)
    let $min_size_in_bytes = $lower_bound($size_in_bytes)
    0 [+4]  EmbossReservedAnonymous0  emboss_reserved_anonymous_1
    if low:
        0+4 [+4]  UInt  field
    let low = emboss_reserved_anonymous_1.low
    let high = emboss_reserved_anonymous_1.high

One important detail is that the newly-created nodes are marked as synthetic. This makes a difference for error handling: because of the way that some of the synthetic nodes are constructed from copied pieces of existing IR, errors can sometimes be detected in synthetic nodes before they are detected in the original source nodes they were copied from. These errors tend to be very confusing to end users, since they reference parts of the IR that were not entered by an end user. For example, the incorrect:

struct Foo:
    false [+4]  UInt  field

will have a synthesized $size_in_bytes field:

struct Foo:
    let $size_in_bytes = $max(0, true ? false+4 : 0)
    false [+4]  UInt  field

The erroneous expression false+4 will be detected before the top-level type error of using false as the field offset, with a message like:

example.emb:2:5: error: Left argument of operator '+' must be an integer.
    false [+4]  UInt  field
    ^^^^^

To avoid confusing end users, any error with a synthetic location will be deferred, and only shown to an end user if no non-synthetic errors are encountered during IR processing. In this case, the correct error message will be found and displayed to the end user:

example.emb:2:5: error: Start of field must be an integer.
    false [+4]  UInt  field
    ^^^^^

Symbol Resolution Part 1: Head Symbols

symbol_resolver.py

After desugaring, the first part of symbol resolution occurs: resolving head symbols. Head symbols are any symbols that do not follow a .: for example, head or Head in head, Head.field and head.Type.field. Resolution of symbols after . happens in a later stage, after some other checking.

Head symbol resolution follows lexical scoping rules: a symbol may refer to entities in a set of scopes determined by their position in the source text.

Head symbol resolution is straightforward: while walking down the IR tree, build up a list of available scopes. When a symbol is encountered, check to see if that symbol is defined in any of the available scopes. If it is defined in exactly one available scope, bind the symbol to that definition. If it is not defined, raise an error. Emboss takes the unusual step of also raising an error if the symbol is defined in two or more scopes: in most computer languages, there is some scope precedence, and the highest-precedence definition is used. However, this can lead to surprising, and in some cases silently incorrect results when the desired definition is shadowed by an unknown-to-the-user higher-precedence definition, so Emboss raises an error and requires the user to explicitly pick the correct definition.

Dependency Cycle Checking

dependency_checker.py

Once head symbols have been resolved, there is enough information in the IR to check for cycles in the dependency graph: cases where object_1 depends on object_2 and object_2 also depends on object_1, sometimes through a chain of multiple dependencies.

The Emboss compiler accomplishes this by first building up the complete dependency graph of the IR, and then running a graph partitioning algorithm (Tarjan's algorithm, in this case) on the dependency graph to find all strongly-connected components. If there are any partitions with more than one node, or any nodes with a self edge, then there is a dependency cycle.

Dependency Order Computation

dependency_checker.py

Every structure in the IR is also annotated with a ‘dependency ordering’: an ordering of its fields such that any field A that depends on another field B is later in the ordering than B.

This ordering is not used until the back end (where it is used to order fields for text output), but it is convenient to build it here, where the dependency graph has already been built.

Symbol Resolution Part 2: Field Access

symbol_resolver.py

Next, “tail” symbols are resolved: any symbol immediately after .. These are resolved after checking for dependency cycles because the resolution algorithm can become stuck in infinite recursion if there is a dependency cycle.

Type Annotation

type_check.py

With symbols resolved, it is possible to annotate expression types. The Emboss compiler annotates every expression and subexpression with a type: at this stage, those types are coarse: integer, enum, boolean, string, “opaque”, etc.

This stage also checks expressions for internal type consistency. For example, it will find the error in:

struct Foo:
    let a = 10 + true

but not in:

struct Foo:
    true [+4]  UInt  a

Type Checking

type_check.py

After annotating all types, the compiler checks that expressions have the correct top-level types: that is, that their types are correct for their positions within the IR. This is the stage that detects the use of a boolean as a field offset from the previous example:

struct Foo:
    true [+4]  UInt  a

Bounds Computation

expression_bounds.py

With coarse type checking done, the expression types can be refined into narrower types: for example, instead of just “integer,” “integer between 0 and 65535, inclusive.”

Fully understanding this stage involves (just a tiny bit!) of Type Theory: mainly, that a “type” in the computer science sense is just a set (in the mathematical sense) of all the values that a variable of that type could possibly hold. This sort of “type” is conflated with physical layout in most industrial computing languages (including Emboss), but is conceptually separate.

The boolean type is the set $\left\{ \scriptstyle{FALSE}, \scriptstyle{TRUE} \right\}$, so its only refinements (nonempty subsets) are $\left\{ \scriptstyle{FALSE} \right\}$ and $\left\{ \scriptstyle{TRUE} \right\}$. This stage will try to make those refinements, where possible.

The base integer type in the Emboss expression language is the set of all integers, normally written as $\mathbb{Z}$. It is obviously impossible for any real-world computer to have a variable of type $\mathbb{Z}$ — there is no computer with infinite memory — so this type needs to be refined.

In Emboss, currently, integer types are refined into the form $\left\{ x | x \in \mathbb{Z} \wedge a \le x \le b \wedge x \equiv n \pmod m \right\}$: that is, $x$ is an integer ($x \in \mathbb{Z}$), $x$ is at least $a$ and at most $b$ ($a \le x \le b$), and $x$ % $m$ == $n$ ($x \equiv n \pmod m$). The values $a$ and $b$ are lower and upper bounds, respectively, while $m$ and $n$ are used for alignment tracking — $m$ tracks the “level” of alignment, and $n$ tracks the “offset” within the alignment.

This integer type refinement starts with known types. A value read from an Int or UInt field will have known bounds (based on the bit width of the field), $m = 1$ and $n = 0$. A constant $c$ has $a = b = n = c$ and $m = \infty$.

From this basis, the refinements for any operations can be determined. For example, for x + y, the resulting lower and upper bounds $a = a_x + a_y$ and $b = b_x + b_y$. The resulting alignment is a little more complex, involving the greatest common divisor function.

The Emboss functions $lower_bound(x) and $upper_bound(x) are a little strange: their result is a constant that is the lower or upper bound (respectively) of their argument! That is, they extract a value from the type system and return it as an expression. There aren't a lot of human-authored uses of these functions, but they are used in the definitions of fields like $min_size_in_bytes and $max_size_in_bits.

The refinements that Emboss finds are not always perfectly optimal, for various reasons: for example, even if $x < 0$, x * x cannot be negative, but Emboss does not try to detect any cross-argument constraints, so it will not figure this out.

It might be tempting to add better refinement to Emboss, but there are downsides. First, new refinements take up compiler time, which can contribute to slow builds. Many possible refinements would only be found on a tiny fraction of all expressions, so it is not worth the compiler time to search for them.

Second, and probably more importantly, any new refinements become part of the Emboss contract with end users, forever: $upper_bound() must never return a larger value than it did in previous versions, and $lower_bound() must never return a smaller value. Reduced alignment bounds can cause performance regressions in the highest-performance code. This has direct implications for maintainence of the current Emboss compiler, but it also makes it even harder for any third-party compiler to fully implement Emboss — which, in turn, means that Emboss is even more likely to be a single-vendor language, like Python or Ruby, instead of a multi-vendor language like C++ or JavaScript.

Front End Attribute Normalization + Verification

attribute_checker.py

This stage checks the attributes (annotations like [byte_order: "LittleEndian"]) that are used by the Emboss front end, and adds missing attributes where needed.

Attributes were originally intended to be a way to add functionality without having to add new syntax, so their semantics are not conceptually cohesive. Nonetheless, they are all handled together in the same stage.

This stage checks that all attributes have the correct argument type (no [byte_order: 10]) and that their values are valid (no [byte_order: "scrambled"]). $default'ed attributes have their values copied into every place in the IR where they make sense and were not overridden (so [$default byte_order: "BigEndian"] will cause [byte_order: "BigEndian"] to be attached to most fields that do not already have a [byte_order] attribute).

Miscellaneous Constraint Checking

constraints.py

Even more conceptually incoherent than the previous stage, constraint checking verifies that certain things are true about the IR:

Fields in bits have types that are allowed in bits.
Arrays have constant sizes for all dimensions other than the last one, which may be empty.
Arrays in struct are arrays of types that are a whole number of bytes long (no UInt:4[]).
In T[], T has a constant size.
That fields with external types like Int, Flag, etc. meet the requirements for those types, in terms of size.
That enum, field, and type names are not reserved words in any existing mainstream programming language as of 2014, even if they are not Emboss keywords. For example, no field may be named class, even though Emboss does not care.
That constant references (references of the form Type.field) are actually constant.
That all enum values can be represented as 64-bit unsigned or signed 2's-complement integers (or narrower, if the enum has an explicit, narrower width).
That all expression values that might be computed at runtime fit within 64-bit unsigned or signed 2's-complement integers.

The common theme of these is that they are not used anywhere in the front end, but they simplify the back end by reducing the cases that have to be handled.

Most of these constraints could be lifted at some point.

Inferred Write Method Generation

write_inference.py

This stage — the last before the IR goes to the band end — figures out how to write through virtual fields. There are three possibilities:

If the virtual field is a direct alias (let field = some_other_field), then writing through field is exactly the same as writing through some_other_field. In fact, it is so exactly the same that the C++ back end handles this by making the field() method call just return the view of some_other_field() directly. This works even if some_other_field is actually a nested field, like some.other.field.

This method is most important for anonymous bits: it is important to be able to write through the aliases because the original field is not accessible to end users.

If the virtual field is a series of additions and subtractions on exactly one other field, like let f_plus_10 = f + 10, then a special “write” expression will be generated that inverts the calculation: in this case, f_plus_10 - 10. It was the intention of the Emboss authors to expand this inference over time, but (currently) addition and subtraction are the only fully-invertible operations in the Emboss expression language.

Lastly, if neither of the other options apply, the field is read-only.

It was also the intention to open this feature up to users by allowing “write method” attributes to be attached to virtual fields, like:

# An example from an actual device, where `x` is a frequency expressed in
# either Hz or MHz.
let f = x < 1000 ? x * 1000000 : x
    [write_field: x]
    [write_expression: value]

C++ Back End

compiler/back_end/cpp

The C++ back end transforms the fully elaborated IR into a header file.

C++ Back-End Attribute Verification

attributes.py

header_generator.py

The first stage of the C++ back end is a short one: it verifies that the [(cpp) namespace] attribute, if present, is a valid C++ namespace, and that any [(cpp) enum_case] attributes are valid.

C++ Header Generation

header_generator.py

The very last stage walks the IR and produces the text of a C++ .h file.

Unlike most stages, C++ header generation does not use fast_traverse_ir_top_down(): the code to walk the tree is manual, generally assembling fragments of C++ code from the bottom up. During most steps of this traversal, the C++ back end builds up three separate sections: a forward declaration section, a class definition section, and then a method definition section, which are eventually emitted in that specific order. This structure lets any circular references (e.g., Emboss structure types that mutually contain each other) work without the back end having to track anything special: by the time a type is referenced, it will always have been declared, and by the time a type is actually used it will always have been defined.

This stage has the most code of any single stage of the Emboss compiler, so it can be difficult to follow. (It is also some of the oldest code in the compiler, which does not help.)

Design of the Emboss Compiler (embossc)

Overall Design

IR (Intermediate Representation)

Front End vs Back End(s)

IR Traversal

Error Handling

Errors on Synthetic Nodes

Front End

Tokenization and Parsing

Tokenization

Parse Tree Generation

Parse Tree → IR

IR Processing

Desugaring + Built-In Field Synthesis

Symbol Resolution Part 1: Head Symbols

Dependency Cycle Checking

Dependency Order Computation

Symbol Resolution Part 2: Field Access

Type Annotation

Type Checking

Bounds Computation

Front End Attribute Normalization + Verification

Miscellaneous Constraint Checking

Inferred Write Method Generation

C++ Back End

C++ Back-End Attribute Verification

C++ Header Generation

Design of the Emboss Compiler (`embossc`)