This document describes the internals of Emboss. End users do not need to read this document.
TODO(bolms): Update this doc to include the newer passes.
The Emboss compiler is divided into separate “front end” and “back end” programs. The front end parses Emboss files (.emb
files) and produces a stable intermediate representation (IR), which is consumed by the back ends. This IR is defined in [public/ir_pb2.py][ir_pb2_py].
The back ends read the IR and emit code to view and manipulate Emboss-defined data structures. Currently, only a C++ back-end exists.
TODO(bolms): Split the symbol resolution and validation steps in a separate “middle” component, to allow external code generators to generate undecorated Emboss IR instead of Emboss source text?
Implemented in front_end/...
The front end is responsible for reading in Emboss definitions and producing a normalized intermediate representation (IR). It is divided into several steps: roughly, parsing, import resolution, symbol resolution, and validation.
The front end is orchestrated by glue.py, which runs each front end component in the proper order to construct an IR suitable for consumption by the back end.
The actual driver program is emboss_front_end.py, which just calls glue.ParseEmbossFile
and prints the results.
Per-file parsing consumes the text of a single Emboss module, and produces an “undecorated” IR for the module, containing only syntactic-level information from the module.
This “undecorated” IR is (almost) a subset of the final IR: later steps will add information and perform validation, but will rarely remove anything from the IR before it is emitted.
Implemented in tokenizer.py
The tokenizer is a fairly standard tokenizer, with Indent/Dedent insertion a la Python. It divides source text into parse_types.Symbol
objects, suitable for feeding into the parser.
Implemented in lr1.py and parser_generator.py, with a façade in structure_parser.py
Emboss uses a pretty standard Shift-Reduce LR(1) parser. This is implemented in three parts in Emboss:
Implemented in module_ir.py
Once a parse tree has been generated, it is fed into a normalizer which recursively turns the raw syntax tree into a “first stage” intermediate representation (IR). The first stage IR serves to isolate later stages from minor changes in the grammar, but only contains information from a single file, and does not perform any semantic checking.
TODO(bolms): Implement imports.
After each file is parsed, any new imports it has are added to a work queue. Each file in the work queue is parsed, potentially adding more imports to the queue, until the queue is empty.
Implemented in symbol_resolver.py
Symbol resolution is the process of correlating names in the IR. At the end of symbol resolution, every named entity (type definition, field definition, enum name, etc.) has a CanonicalName
, and every reference in the IR has a Reference
to the entity to which it refers.
This assignment occurs in two passes. First, the full IR is scanned, generating scoped symbol tables (nested dictionaries of names to CanonicalName
), and assigning identities to each Name
in the IR. Then the IR is fully scanned a second time, and each Reference
in the IR is resolved: all scopes visible to the reference are scanned for the name, and the corresponding CanonicalName
is assigned to the reference.
TODO(bolms): other validations?
TODO(bolms): describe
TODO(bolms): describe
Implemented in back_end/...
Currently, only a C++ back end is implemented.
A back end takes Emboss IR and produces code in a specific language for manipulating the Emboss-defined data structures.
Implemented in header_generator.py with templates in generated_code_templates, support code in emboss_cpp_util.h, and a driver program in emboss_codegen_cpp.py
The C++ code generator is currently very minimal. header_generator.py
essentially inserts values from the IR into text templates.
TODO(bolms): add more documentation once the C++ back end has more features.