| # Design of the Emboss Tool |
| |
| This document describes the internals of Emboss. End users do not need to read |
| this document. |
| |
| *TODO(bolms): Update this doc to include the newer passes.* |
| |
| The Emboss compiler is divided into separate "front end" and "back end" |
| programs. The front end parses Emboss files (`.emb` files) and produces a |
| stable intermediate representation (IR), which is consumed by the back ends. |
| This IR is defined in [public/ir_pb2.py][ir_pb2_py]. |
| |
| [ir_pb2_py]: public/ir_pb2.py |
| |
| The back ends read the IR and emit code to view and manipulate Emboss-defined |
| data structures. Currently, only a C++ back-end exists. |
| |
| *TODO(bolms): Split the symbol resolution and validation steps in a separate |
| "middle" component, to allow external code generators to generate undecorated |
| Emboss IR instead of Emboss source text?* |
| |
| ## Front End |
| |
| *Implemented in [front_end/...][front_end]* |
| |
| [front_end]: front_end/ |
| |
| The front end is responsible for reading in Emboss definitions and producing a |
| normalized intermediate representation (IR). It is divided into several steps: |
| roughly, parsing, import resolution, symbol resolution, and validation. |
| |
| The front end is orchestrated by [glue.py][glue_py], which runs each front end |
| component in the proper order to construct an IR suitable for consumption by the |
| back end. |
| |
| [glue_py]: front_end/glue.py |
| |
| The actual driver program is [emboss_front_end.py][emboss_front_end_py], which |
| just calls `glue.ParseEmbossFile` and prints the results. |
| |
| [emboss_front_end_py]: front_end/emboss_front_end.py |
| |
| ### File Parsing |
| |
| Per-file parsing consumes the text of a single Emboss module, and produces an |
| "undecorated" IR for the module, containing only syntactic-level information |
| from the module. |
| |
| This "undecorated" IR is (almost) a subset of the final IR: later steps will add |
| information and perform validation, but will rarely remove anything from the IR |
| before it is emitted. |
| |
| #### Tokenization |
| |
| *Implemented in [tokenizer.py][tokenizer_py]* |
| |
| [tokenizer_py]: front_end/tokenizer.py |
| |
| The tokenizer is a fairly standard tokenizer, with Indent/Dedent insertion a la |
| Python. It divides source text into `parse_types.Symbol` objects, suitable for |
| feeding into the parser. |
| |
| #### Syntax Tree Generation |
| |
| *Implemented in [lr1.py][lr1_py] and [parser_generator.py][parser_generator_py], with a façade in [structure_parser.py][structure_parser_py]* |
| |
| [lr1_py]: front_end/lr1.py |
| [parser_generator_py]: front_end/parser_generator.py |
| [structure_parser_py]: front_end/structure_parser.py |
| |
| Emboss uses a pretty standard Shift-Reduce LR(1) parser. This is implemented in |
| three parts in Emboss: |
| |
| * A generic parser generator implementing the table generation algorithms from |
| *[Compilers: Principles, Techniques, & Tools][dragon_book]* and the |
| error-marking algorithm from *[Generating LR Syntax Error Messages from |
| Examples][jeffery_2003]*. |
| * An Emboss-specific parser builder which glues the Emboss tokenizer, grammar, |
| and error examples to the parser generator, producing an Emboss parser. |
| * The Emboss grammar, which is extracted from the file normalizer |
| (*[module_ir.py][module_ir_py]*). |
| |
| [dragon_book]: http://www.amazon.com/Compilers-Principles-Techniques-Tools-2nd/dp/0321486811 |
| [jeffery_2003]: http://dl.acm.org/citation.cfm?id=937566 |
| |
| #### Normalization |
| |
| *Implemented in [module_ir.py][module_ir_py]* |
| |
| [module_ir_py]: front_end/module_ir.py |
| |
| Once a parse tree has been generated, it is fed into a normalizer which |
| recursively turns the raw syntax tree into a "first stage" intermediate |
| representation (IR). The first stage IR serves to isolate later stages from |
| minor changes in the grammar, but only contains information from a single file, |
| and does not perform any semantic checking. |
| |
| ### Import Resolution |
| |
| *TODO(bolms): Implement imports.* |
| |
| After each file is parsed, any new imports it has are added to a work queue. |
| Each file in the work queue is parsed, potentially adding more imports to the |
| queue, until the queue is empty. |
| |
| ### Symbol Resolution |
| |
| *Implemented in [symbol_resolver.py][symbol_resolver_py]* |
| |
| [symbol_resolver_py]: front_end/symbol_resolver.py |
| |
| Symbol resolution is the process of correlating names in the IR. At the end of |
| symbol resolution, every named entity (type definition, field definition, enum |
| name, etc.) has a `CanonicalName`, and every reference in the IR has a |
| `Reference` to the entity to which it refers. |
| |
| This assignment occurs in two passes. First, the full IR is scanned, generating |
| scoped symbol tables (nested dictionaries of names to `CanonicalName`), and |
| assigning identities to each `Name` in the IR. Then the IR is fully scanned a |
| second time, and each `Reference` in the IR is resolved: all scopes visible to |
| the reference are scanned for the name, and the corresponding `CanonicalName` is |
| assigned to the reference. |
| |
| ### Validation |
| |
| *TODO(bolms): other validations?* |
| |
| #### Size Checking |
| |
| *TODO(bolms): describe* |
| |
| #### Overlap Checking |
| |
| *TODO(bolms): describe* |
| |
| ## Back End |
| |
| *Implemented in [back_end/...][back_end]* |
| |
| [back_end]: back_end/ |
| |
| Currently, only a C++ back end is implemented. |
| |
| A back end takes Emboss IR and produces code in a specific language for |
| manipulating the Emboss-defined data structures. |
| |
| ### C++ |
| |
| *Implemented in [header_generator.py][header_generator_py] with templates in |
| [generated_code_templates][generated_code_templates], support code in |
| [emboss_cpp_util.h][emboss_cpp_util_h], and a driver program in |
| [emboss_codegen_cpp.py][emboss_codegen_cpp_py]* |
| |
| [header_generator_py]: back_end/cpp/header_generator.py |
| [generated_code_templates]: back_end/cpp/generated_code_templates |
| [emboss_cpp_util_h]: back_end/cpp/emboss_cpp_util.h |
| [emboss_codegen_cpp_py]: back_end/cpp/emboss_codegen_cpp.py |
| |
| The C++ code generator is currently very minimal. `header_generator.py` |
| essentially inserts values from the IR into text templates. |
| |
| *TODO(bolms): add more documentation once the C++ back end has more features.* |