docs/reference/fidl/language/fidlc.md - fuchsia - Git at Google

 # Fidl Compiler Frontend

 This document serves as an overview of the internal workings of the FIDL compiler `fidlc`.
 `fidlc` is a command line tool that takes in a number of FIDL files, and outputs
 [FIDL JSON IR][schema].

 ## Overview

 The main inputs to the compiler are the arguments to the `--files` flags, which describe a list of
 files grouped by library. The compiler parses each file individually to obtain a parse tree for
 each file:

 1. Each file is loaded into a [`SourceFile`](#sourcefile), which owns the buffer backing the file
 2. The compiler initializes a [`Lexer`](#lexing), which takes the `SourceFile` as an argument to
    its constructor. This class exposes a `Lex()` method, which returns the next [`Token`](#token)
    in the file; it can be called repeatedly to get the sequence of `Token`s in the file.
 3. The compiler uses the `Lexer` to initialize a [`Parser`](#parsing) and then calls the `Parse()`
    method, which constructs a parse tree. The code refers to parse trees as a raw AST. This function
    returns a `raw::File`, which is the class representing the root node of the raw AST.
 4. Once the compiler has parsed each file into a parse tree, it then groups the parse trees into a
    single AST (referred to in the code as a flat AST) for each library. The root of this flat AST
    is a `flat::Library`.
    * For each library, the compiler traverses every `raw::File` parse tree corresponding to that
      library, converting `raw::` nodes to their `flat::` counterparts. For example, a
      `raw::StructDeclaration` becomes a `flat::Struct`, and a `raw::Constant` becomes a
      `flat::Constant`. The converted `flat::` AST nodes are then stored under a single
      `flat::Library`. Initially these flat AST nodes contain the same information as the raw AST
      nodes, but they also contain fields such as a `value` field for values and a `typeshape` field
      for types that are later set during the compilation step.
 5. Once the AST has been fully initialized, the compiler evaluates constants and determines the
    memory alignment and size information for the declared types.

 Note: One of the key distinction between raw and flat ASTs is that each raw tree represents a
 single file, whereas each flat tree represents a single FIDL library that may contain multiple
 files.

 Finally, we end up with a flat AST that is processed and ready for backend generation either to C
 bindings or to a JSON IR.

 ## Lexing {#lexing}

 The `Lexer` works primarily by keeping track of two `char *` pointer into the file data. The
 `current_` pointer marks the current location that the class is at. The`token_start_` pointer marks
 the start of the lexeme that is currently being worked on. Each time the `Lex()` method is called,
 the `current_` pointer is advanced until a complete lexeme has been traversed. Then, a `Token` is
 constructed using the data in between the two pointers.

 The lexer also keeps track of a third `previous_end_` pointer so it can get the data between
 lexemes (generally whitespace) when it constructs the `Token`. This example shows of how the
 pointers change during a call to `Lex()` on the short FIDL snippet `const bool flag = true;`:

 Initial state after lexing the `const` keyword:

 ```
  const bool flag = true;
       ^current_
       ^token_start_
       ^previous_end_
 ```

 Whitespaces are skipped until the next lexeme:

 ```
 const bool flag = true;
       ^current_
       ^token_start_
      ^previous_end_
 ```

 The `current_` pointer is updated until the end of the lexeme:

 ```
 const bool flag = true;
           ^current_
       ^token_start_
      ^previous_end_
 ```

 At this point, the next `Token` that gets returned is ready to be constructed. A `Token` is created
 with its `previous_end_` argument set to the data between `previous_end_` and `token_start_`. The
 `location_` is set to the data between `token_start_` and `current_`. The kind is set to
 `Identifier`. Before returning, the pointers are reset and end up in a state similar to the initial
 state. This process can then be repeated for the next token:

 ```
  const bool flag = true;
            ^current_
            ^token_start_
            ^previous_end_
 ```

 Internally the two pointers are manipulated with these main methods:

 * `Skip()`. Skips over any unnecessary characters (e.g. whitespace) by
    moving both the `current_` and the `token_start_` pointers forward.
 * `Consume()`. Returns the current character and advances `current_`.
 * `Reset()`. Returns the data between `token_start_` and `current_`. Then, sets `token_start_`
    to the value of `current_`.

 ## Parsing {#parsing}

 The `Parser`'s goal is to convert a `Token` stream into a parse tree (the `raw::` AST) with the
 `Parse()` method. The `Token` stream is generated by calling `Lex` repeatedly on a FIDL file. This
 parser is implemented with [recursive
 descent](https://en.wikipedia.org/wiki/Recursive_descent_parser). Each node of the raw AST has a
 corresponding `ParseFoo()` method that consumes `Token`s from the `Lexer` and returns a
 `unique_ptr` to an instance of that node. In case there is a failure, a `nullptr` is returned.

 Note: A node of the raw AST is a [start and end token pair](#sourceelement) with pointers to any
 children. The start and end tokens of each child are contained within the start and end tokens
 of its parent. The root node, which is a `raw::File`, has a start token equal to the
 first token of the file and an end token equal to the last token of the file.

 The `Parser` keeps track of the:

 * Current nodes that are being built using a stack of [`SourceElements`](#sourceelement), which are
   stored in `active_ast_scopes_`.
 * Current and previous `Token`s that are being processed. `last_token_` and `previous_token_`,
   respectively.
 * Its own state machine inside of the `Parser::Lex` method. The current token (`last_token_`) is
   always the next token that is about to be consumed, which effectively gives the parser a one
   token lookahead (i.e. LL(1)).

 The `Parser` determines what kind of node the current `Token` belongs to based on the `Token::Kind`
 of `last_token_`, using the `Peek()` method. Then, the `Parser` updates its state and constructs the
 nodes through the use of the `ASTScope` class as well as the `ConsumeToken` and `MaybeConsumeToken`
 helper methods.

 This example shows a simple non-recursive case line by line. The parser method looks like this:

 ```c++
 std::unique_ptr<raw::StringLiteral> Parser::ParseStringLiteral() {
     ASTScope scope(this);
     ConsumeToken(OfKind(Token::Kind::kStringLiteral));
     if (!Ok())
         return Fail();

     return std::make_unique<raw::StringLiteral>(scope.GetSourceElement());
 }
 ```

 It consumes a single token and returns a leaf node, `raw::StringLiteral`:

 ```c++
 class StringLiteral : public SourceElement {
 public:
     explicit StringLiteral(SourceElement const& element) : SourceElement(element) {}

     virtual ~StringLiteral() {}
 }
 ```

 The method starts by creating a new `ASTScope`, which initializes the `SourceElement` that will
 later be used to create the returned node by pushing it onto `active_ast_scopes_`. The start of
 the `SourceElement` gets set to the first token that is consumed and the end gets set in the call
 to `GetSourceElement()` before returning.

 The new `SourceElement` for the `StringLiteral` gets initialized inside the `ASTScope` constructor:

 ```c++
 parser_->active_ast_scopes_.push_back(raw::SourceElement(Token(), Token()));
 ```

 Then, `ConsumeToken` is called to process the next token. This method takes a predicate constructed
 with `OfKind()` and calls that predicate on `last_token_`'s kind or subkind. `OfKind()` returns a
 function that checks that its input matches the given kind or subkind. If the predicate fails (in
 this case, if the current token is not a string literal), the error gets stored in the class and
 the method returns. The `Ok()` call catches the error and stops the compiler with a parsing error.
 If the predicate succeeds, the start token of any `SourceElement`s in the stack that are
 uninitialized are set to the current token:

 ```c++
 for (auto& scope : active_ast_scopes_) {
     if (scope.start_.kind() == Token::Kind::kNotAToken) {
         scope.start_ = token;
     }
 }
 ```

 In this example, the start token is set to the top element of the stack since it was initialized at
 the start of the method. The `Parser` then advances to the next token by setting `previous_token_ =
 last_token_` and `last_token_ = Lex()`.

 Finally, the resulting `StringLiteral` node is returned using `scope.GetSourceElement()`. This sets
 the end token of the `SourceElement` at the top of the stack to the `previous_token_`, and then
 returns the `SourceElement`. The final node has the same start and end token because
 `StringLiteral`s are only ever a single token long, but other methods may consume many tokens
 before calling `GetSourceElement()`. When the method returns, the destructor of `ASTScope` pops the
 top element off of `active_ast_scopes_`.

 ## Compilation

 At this point, the compiler has successfully constructed a `raw::File` for each input file and
 proceeds to compile these raw ASTs into flat ASTs in three steps:

 1. **Consumption**: where the raw ASTs (whose representation tries to match the FIDL grammar) are
    grouped by FIDL library and de-sugared into flat ASTs (whose representation tries to match the
    FIDL language semantics). This step converts one `raw::File` per file into one `flat::Library`
    per library.
 2. **Topological Sorting**: to determine the order in which to resolve the flat AST nodes (see next
    step).
 3. **Resolution**: which performs name resolution, type resolution, and type shape and size
    calculations. The resolution process is done node by node, and the resulting information is
    stored into the `flat::` node itself.

 Note: Within the code, the term "compilation" may refer to these three steps as a whole (for
 example, all three steps are performed within the `Library::Compile` method), or it could refer
 specifically to the "resolution" step (for example, the logic for performing resolution is
 contained in the `Library::CompileDecl` method ). In general this document tries to use
 "resolution" when referring to the third step to disambiguate between the two possible meanings.

 ### Consumption

 After each file is parsed into a `raw::File` and an empty AST (`flat::Library`) has been
 initialized for each library, the AST needs to be updated with all of data from the `raw::File`s
 that correspond to it. This is done recursively with the `ConsumeFoo()` methods. Each
 `ConsumeFoo()` method generally takes the corresponding raw AST node as input, updates the
 `Library` class, and then return a `bool` to indicate success or failure. These methods are
 responsible for:

 * Validating the placement of attributes; for example checking that the `Transport` attribute is
   only specified for protocols.
 * Checking for any undefined library dependencies (e.g. using `textures.Foo` will error if the
   `textures` library was not imported)
 * Converting the raw AST nodes into their flat AST node equivalents, and storing them in the
   `Library`'s `foo_declarations_` attribute. Initially the values of the flat AST nodes are unset
   but they are calculated later during compilation.
 * Registering each [declaration](#decl) by adding them to the `declarations_` vector. Const
   declarations and enum/bit fields (which declare a value) are also added to the `constants_`
   vector, whereas all other declarations (which declare a type) get their corresponding [type
   template](#typetemplate) added to the library's [typespace](#typespace).

 ### Topological Sorting

 Once all of the declarations for a given `Library` have been added to the `declarations_` vector,
 the compiler can proceed to resolve each individual declaration. However, it must do this in the
 correct order (so that any dependencies of a declaration are resolved before it); this is
 accomplished by first sorting the declarations into a separate `declarations_order_` vector, and
 then iterating through it to compile each declaration. The sorting is done in the
 `SortDeclarations()` method, and makes use of `DeclDependencies()` to determine the dependencies
 for a given declaration.

 ### Resolution

 Given the sorted declarations, the compilation happens through the `CompileFoo` methods, generally
 corresponding to the AST nodes (e.g. `CompileStruct`, `CompileConst`), with `CompileDecl` as the
 entrypoint. The main purpose of `CompileDecl` is for:

 * [`TypeDecl`s](#decl) to determine their [`TypeShape`](#typeshape)
 * [`Const`s](#decl) to resolve their value, and
 * `TypeConstructor`s to set their [`Type`](#type).

 Once this step is complete, the `flat::Library` contains all the necessary information for any code
 generation. The FIDL compiler can directly generate C bindings, or can generate a JSON IR that can
 be consumed by a separate backend

 ### Additional Checks

 Before marking compilation as successful, the FIDL compiler also does a few additional checks: for
 example, it will check that any constraints on attributes are satisfied, and that all declared
 library dependencies were actually used.

 ## Backend generation

 The FIDL compiler emits a [JSON intermediate representation (IR)][schema]. The JSON IR is
 consumed by a separate program, named the **back-end**, that generates the language bindings from
 the JSON IR.

 The officially supported FIDL language back-ends are:

 * C++, Rust, and Go: [fidlgen][fidlgen]
 * Dart: [fidlgen_dart][fidlgen-dart]

 ## C Bindings

 For historical reasons, C bindings are directly generated from the FIDL compiler - the C bindings
 do not support all features and types that are used with the C bindings need to be annotated with
 the [`Layout = "Simple"`][layout-attr] attribute.

 ## C Family Runtime

 `fidlc` is also responsible for generating "coding tables", which are instances of `fidl_type_t` used
 to represent the FIDL messages at runtime and are used by all of the bindings for the C family of
 languages (C, LLCPP, HLCPP). To do this, the flat AST is converted to an intermediate
 representation called the "coded" AST with the `CodedTypesGenerator`, which iterates through the
 `flat::Decl`s in a given `flat::Library` and converts each `raw::Decl` node to the corresponding
 `coded::Type` node (e.g. `flat::Struct` becomes a `coded::StructType`).

 The coding tables are then generated by the `TablesGenerator`, which will generate C code for each
 `coded::Type` that constructs the equivalent `fidl_type_t` type. For example, for a
 `coded::StructType` called `MyStruct`, the `TablesGenerator` would write out C code that constructs
 an equivalent `fidl::FidlCodedStruct`, something like:

 ```C
 const fidl_type_t MyStruct = fidl_type_t(::fidl::FidlCodedStruct(
     my_struct_fields, 1u, 32u, "mylibrary/MyStruct"));
 ```

 Coding tables for FIDL libraries in `//sdk/fidl` are generated into the out directory under
 `fidling/gen/sdk/fidl/LIBRARY/LIBRARY.fidl.tables.c`. For example
 `out/default/fidling/gen/sdk/fuchsia.sys/fuchsia.sys.fidl.tables.c`. The [README][coding-readme]
 gives additional context. The `fidl_type_t` definitions (such as for `FidlCodedStruct`) can be found
 in [internal.h][internal].

 ## Glossary

 ### Decl {#decl}

 The `Decl` is the base of all flat AST nodes, just like `SourceElement` is the base of all parser
 tree nodes, and corresponds to all possible declarations that a user can make in a FIDL file. There
 are two types of `Decl`s:

 * `Const`s, which declare a value, and have a `value` attribute that gets resolved during
   compilation, and
 * `TypeDecl`s, which declare a message type or interface and have a `typeshape` attribute that
   gets set during compilation.

 `TypeDecl`s representing an aggregate type (e.g. structs, tables, and unions) also
 have a static `Shape()` method, which contains the logic for determining the `Typeshape`
 of that given type.

 ### FieldShape {#fieldshape}

 A `FieldShape` describes the offset and padding information for members of an aggregate type like a
 struct or union. Generally these fields will require both a `Typeshape` as well as a `FieldShape`

 ### Name {#name}

 A `Name` represents a scope variable name, and consists of the library the name belongs to (or
 `nullptr` for global names), and the variable name itself as a string (for anonymous names) or a
 `SourceLocation`. `Name`s can also optionally include a `member_name_` field when referring to the
 field of a declared variable (for example `MyEnum.MyField`).

 ### SourceElement {#sourceelement}

 A `SourceElement` represents a block of code inside a fidl file and is parameterized by a `start_`
 and `end_` `Token`. All parser tree ("raw" AST) nodes inherit from this class.

 ### SourceFile {#sourcefile}

 Wrapper around a file, which is responsible for owning the data in that file. Also see [Virtual
 SourceFile](#virtualsourcefile)

 ### SourceLocation {#sourcelocation}

 Wrapper around a `StringView` and the `SourceFile` it comes from. It provides methods to get the
 surrounding line of the `StringView` as well as its location in the form of a
 `"[filename]:[line]:[col]"` string

 ### SourceManager {#sourcemanager}

 Wrapper around a set of `SourceFile`s that all relate to a single `Library`.

 ### Token {#token}

 A token is essentially a lexeme (in the form of a [`SourceLocation`](#sourcelocation) stored as the
 `location_` attribute), enhanced with two other pieces information that are useful to the parser
 during the later stages of compilation:

 * `previous_end_`. A `SourceLocation`, which starts at the end of the previous token and ends at the
    start of this token. It contains data that is uninteresting to the parser such as whitespace.
 *  A kind and subkind that, together, classify the lexeme. The possible kinds are:
    * The special characters such as `Kind::LeftParen`, `Kind::Dot`, `Kind::Comma`, etc...
    * String and numeric constants
    * Identifiers. Tokens that are keywords (e.g. `const`, `struct`) are considered identifiers, but
      also have a subkind defined to identify which keyword it is (e.g. `Subkind::Const`,
      `Subkind::Struct`). All other tokens have a subkind of `None`.
    * Uninitialized tokens have a kind of `kNotAToken`.

 ### Type {#type}

 A struct representing an instance of a type. For example, the `vector<int32>:10?` type corresponds
 to an instance of the `VectorType` `TypeDecl` with `max_size_ = 10` and `maybe_arg_type` set to the
 `Type` corresponding to `int32`. Built-in types all have a static `Shape()` method that
 returns the `Typeshape` given the parameters for an instance of that type. User defined types (e.g.
 structs or unions) will all have a type of `IdentifierType` - the corresponding
 [`TypeDecl`](#typedecl), like `Struct` provides the static `Shape()` method instead.

 ### TypeDecl {#typedecl}

 See [`Decl`](#decl)

 ### TypeShape {#typeshape}

 Information about how objects of a type should be laid out in memory, including their size,
 alignment, depth, etc.

 ### Typespace {#typespace}

 The typespace is a map from [`Type`](#type) names to a `TypeTemplate` for that `Type`. During
 compilation, the typespace is initialized to include all of the built-in types (e.g. `"vector"`
 maps to `VectorTypeTemplate`), whereas user-defined types get added during the compilation process.
 This also ensures that a single type template instance exists per type and avoids name
 collisions/shadowing of types (e.g. fxbug.dev/7646).

 ### TypeTemplate {#typetemplate}

 Instances of TypeTemplates provide a `Create()` method to create a new instance of a specific
 `Type` - therefore there is a TypeTemplate subclass for each built-in FIDL type (e.g.
 `ArrayTypeTemplate`, `PrimitiveTypeTemplate`, etc.) as well as a single class for all user defined
 types (`TypeDeclTypeTemplate`), and one for type aliases (`TypeAliasTypeTemplate`). `Create()`
 takes as parameters the possible type parameters: argument type, nullability, and size.

 For example, to create an object representing the type `vector<int32>:10?` the compiler would call
 the `Create()` method of the `VectorTypeTemplate` with an argument type of `int32`, max size of
 `10`, and a nullability of `types::Nullability::kNullable`. This call returns an instance of a
 `VectorType` with those parameters. Note that not all 3 of these parameters apply to all of the
 types (e.g. `PrimitiveType`s, like `int32` have none of the 3). The `Create()` method of the type
 template for each type automatically checks that only the relevant parameters are passed.

 The concrete type for user defined types is the `IdentifierType`, which gets generated by the
 `TypeDeclTypeTemplate`.

 ### Virtual SourceFile {#virtualsourcefile}

 A subclass of `SourceFile` that has a fake "filename" and is initialized with no backing data. It
 exposes an `AddLine()` method to add data to the file, and is used as a backing `SourceFile` for
 content that does not appear directly in any of the input `SourceFile`s, like for generated
 anonymous [`Name`](#name)s.

 <!-- xrefs -->
 [internal]: /zircon/system/ulib/fidl/include/lib/fidl/internal.h
 [layout-attr]: /docs/reference/fidl/language/attributes.md#layout_layout
 [fidlgen]: /garnet/go/src/fidl/compiler/backend
 [fidlgen-dart]: /tools/fidl/fidlgen_dart
 [schema]: /docs/reference/fidl/language/json-ir.md
 [coding-readme]: /src/lib/fidl/c/walker_tests/README.md