blob: 45d7221c7890ef16574930b9b4abc6c77b5a34bb [file] [log] [blame] [view]
# Fidl Compiler Frontend
This document serves as an overview of the internal workings of the FIDL compiler `fidlc`.
`fidlc` is a command line tool that takes in a number of FIDL files, and outputs
[FIDL JSON IR][schema].
## Overview
The main inputs to the compiler are the arguments to the `--files` flags, which describe a list of
files grouped by library. The compiler parses each file individually to obtain a parse tree for
each file:
1. Each file is loaded into a [`SourceFile`](#sourcefile), which owns the buffer backing the file
2. The compiler initializes a [`Lexer`](#lexing), which takes the `SourceFile` as an argument to
its constructor. This class exposes a `Lex()` method, which returns the next [`Token`](#token)
in the file; it can be called repeatedly to get the sequence of `Token`s in the file.
3. The compiler uses the `Lexer` to initialize a [`Parser`](#parsing) and then calls the `Parse()`
method which constructs a parse tree. The code refers to parse trees as a raw AST. This function
returns a `raw::File` which is the class representing the root node of the raw AST.
4. Once the compiler has parsed each file into a parse tree, it then groups the parse trees into a
single AST (referred to in the code as a flat AST) for each library. The root of this flat AST
is a `flat::Library`.
* For each library, the compiler traverses every `raw::File` parse tree corresponding to that
library, converting `raw::` nodes to their `flat::` counterparts. For example, a
`raw::StructDeclaration` becomes a `flat::Struct`, and a `raw::Constant` becomes a
`flat::Constant`. The converted `flat::` AST nodes are then stored under a single
`flat::Library`. Initially these flat AST nodes contain the same information as the raw AST
nodes, but they also contain fields such as a `value` field for values and a `typeshape` field
for types which are later set during the compilation step.
5. Once the AST has been fully initialized, the compiler evaluates constants and determines the
memory alignment and size information for the declared types.
Note: One of the key distinction between raw and flat ASTs is that each raw tree represents a
single file, whereas each flat tree represents a single FIDL library which may contain multiple
files.
Finally, we end up with a flat AST that is processed and ready for backend generation either to C
bindings or to a JSON IR.
## Lexing {#lexing}
The `Lexer` works primarily by keeping track of two `char *` pointer into the file data. The
`current_` pointer marks the current location that the class is at. The`token_start_` pointer marks
the start of the lexeme that is currently being worked on. Each time the `Lex()` method is called,
the `current_` pointer is advanced until a complete lexeme has been traversed. Then, a `Token` is
constructed using the data in between the two pointers.
The lexer also keeps track of a third `previous_end_` pointer so it can get the data between
lexemes (generally whitespace) when it constructs the `Token`. This example shows of how the
pointers change during a call to `Lex()` on the short FIDL snippet `const bool flag = true;`:
Initial state after lexing the `const` keyword:
```
const bool flag = true;
^current_
^token_start_
^previous_end_
```
Whitespaces are skipped until the next lexeme:
```
const bool flag = true;
^current_
^token_start_
^previous_end_
```
The `current_` pointer is updated until the end of the lexeme:
```
const bool flag = true;
^current_
^token_start_
^previous_end_
```
At this point, the next `Token` that gets returned is ready to be constructed. A `Token` is created
with its `previous_end_` argument set to the data between `previous_end_` and `token_start_`. The
`location_` is set to the data between `token_start_` and `current_`. The kind is set to
`Identifier`. Before returning, the pointers are reset and end up in a state similar to the initial
state. This process can then be repeated for the next token:
```
const bool flag = true;
^current_
^token_start_
^previous_end_
```
Internally the two pointers are manipulated with these main methods:
* `Skip()`. Skips over any unnecessary characters (e.g. whitespace) by
moving both the `current_` and the `token_start_` pointers forward.
* `Consume()`. Returns the current character and advances `current_`.
* `Reset()`. Returns the data between `token_start_` and `current_`. Then, sets `token_start_`
to the value of `current_`.
## Parsing {#parsing}
The `Parser`'s goal is to convert a `Token` stream into a parse tree (the `raw::` AST) with the
`Parse()` method. The `Token` stream is generated by calling `Lex` repeatedly on a FIDL file. This
parser is implemented with [recursive
descent](https://en.wikipedia.org/wiki/Recursive_descent_parser). Each node of the raw AST has a
corresponding `ParseFoo()` method that consumes `Token`s from the `Lexer` and returns a
`unique_ptr` to an instance of that node. In case there is a failure, a `nullptr` is returned.
Note: A node of the raw AST is a [start and end token pair](#sourceelement) with pointers to any
children. The start and end tokens of each child are contained within the start and end tokens
of its parent. The root node, which is a `raw::File`, has a start token equal to the
first token of the file and an end token equal to the last token of the file.
The `Parser` keeps track of the:
* Current nodes that are being built using a stack of [`SourceElements`](#sourceelement), which are
stored in `active_ast_scopes_`.
* Current and previous `Token`s that are being processed. `last_token_` and `previous_token_`,
respectively.
* Its own state machine inside of the `Parser::Lex` method. The current token (`last_token_`) is
always the next token that is about to be consumed, which effectively gives the parser a one
token lookahead (i.e. LL(1)).
The `Parser` determines what kind of node the current `Token` belongs to based on the `Token::Kind`
of `last_token_`, using the `Peek()` method. Then, the `Parser` updates its state and constructs the
nodes through the use of the `ASTScope` class as well as the `ConsumeToken` and `MaybeConsumeToken`
helper methods.
This example shows a simple non-recursive case line by line. The parser method looks like this:
```c++
std::unique_ptr<raw::StringLiteral> Parser::ParseStringLiteral() {
ASTScope scope(this);
ConsumeToken(OfKind(Token::Kind::kStringLiteral));
if (!Ok())
return Fail();
return std::make_unique<raw::StringLiteral>(scope.GetSourceElement());
}
```
It consumes a single token and returns a leaf node, `raw::StringLiteral`:
```c++
class StringLiteral : public SourceElement {
public:
explicit StringLiteral(SourceElement const& element) : SourceElement(element) {}
virtual ~StringLiteral() {}
}
```
The method starts by creating a new `ASTScope` which initializes the `SourceElement` that will
later be used to create the returned node by pushing it onto `active_ast_scopes_`. The start of
the `SourceElement` gets set to the first token that is consumed and the end gets set in the call
to `GetSourceElement()` before returning.
The new `SourceElement` for the `StringLiteral` gets initialized inside the `ASTScope` constructor:
```c++
parser_->active_ast_scopes_.push_back(raw::SourceElement(Token(), Token()));
```
Then, `ConsumeToken` is called to process the next token. This method takes a predicate constructed
with `OfKind()` and calls that predicate on `last_token_`'s kind or subkind. `OfKind()` returns a
function that checks that its input matches the given kind or subkind. If the predicate fails (in
this case, if the current token is not a string literal), the error gets stored in the class and
the method returns. The `Ok()` call catches the error and stops the compiler with a parsing error.
If the predicate succeeds, the start token of any `SourceElement`s in the stack that are
uninitialized are set to the current token:
```c++
for (auto& scope : active_ast_scopes_) {
if (scope.start_.kind() == Token::Kind::kNotAToken) {
scope.start_ = token;
}
}
```
In this example, the start token is set to the top element of the stack since it was initialized at
the start of the method. The `Parser` then advances to the next token by setting `previous_token_ =
last_token_` and `last_token_ = Lex()`.
Finally, the resulting `StringLiteral` node is returned using `scope.GetSourceElement()`. This sets
the end token of the `SourceElement` at the top of the stack to the `previous_token_`, and then
returns the `SourceElement`. The final node has the same start and end token because
`StringLiteral`s are only ever a single token long, but other methods may consume many tokens
before calling `GetSourceElement()`. When the method returns, the destructor of `ASTScope` pops the
top element off of `active_ast_scopes_`.
## Compilation
At this point, the compiler has successfully constructed a `raw::File` for each input file and
proceeds to compile these raw ASTs into flat ASTs in three steps:
1. **Consumption**: where the raw ASTs (whose representation tries to match the FIDL grammar) are
grouped by FIDL library and de-sugared into flat ASTs (whose representation tries to match the
FIDL language semantics). This step converts one `raw::File` per file into one `flat::Library`
per library.
2. **Topological Sorting**: to determine the order in which to resolve the flat AST nodes (see next
step).
3. **Resolution**: which performs name resolution, type resolution, and type shape and size
calculations. The resolution process is done node by node, and the resulting information is
stored into the `flat::` node itself.
Note: Within the code, the term "compilation" may refer to these three steps as a whole (for
example, all three steps are performed within the `Library::Compile` method), or it could refer
specifically to the "resolution" step (for example, the logic for performing resolution is
contained in the `Library::CompileDecl` method ). In general this document tries to use
"resolution" when referring to the third step to disambiguate between the two possible meanings.
### Consumption
After each file is parsed into a `raw::File` and an empty AST (`flat::Library`) has been
initialized for each library, the AST needs to be updated with all of data from the `raw::File`s
that correspond to it. This is done recursively with the `ConsumeFoo()` methods. Each
`ConsumeFoo()` method generally takes the corresponding raw AST node as input, updates the
`Library` class, and then return a `bool` to indicate success or failure. These methods are
responsible for:
* Validating the placement of attributes; for example checking that the `Transport` attribute is
only specified for protocols.
* Checking for any undefined library dependencies (e.g. using `textures.Foo` will error if the
`textures` library was not imported)
* Converting the raw AST nodes into their flat AST node equivalents, and storing them in the
`Library`'s `foo_declarations_` attribute. Initially the values of the flat AST nodes are unset
but they are calculated later during compilation.
* Registering each [declaration](#decl) by adding them to the `declarations_` vector. Const
declarations and enum/bit fields (which declare a value) are also added to the `constants_`
vector, whereas all other declarations (which declare a type) get their corresponding [type
template](#typetemplate) added to the library's [typespace](#typespace).
### Topological Sorting
Once all of the declarations for a given `Library` have been added to the `declarations_` vector,
the compiler can proceed to resolve each individual declaration. However, it must do this in the
correct order (so that any dependencies of a declaration are resolved before it); this is
accomplished by first sorting the declarations into a separate `declarations_order_` vector, and
then iterating through it to compile each declaration. The sorting is done in the
`SortDeclarations()` method, and makes use of `DeclDependencies()` to determine the dependencies
for a given declaration.
### Resolution
Given the sorted declarations, the compilation happens through the `CompileFoo` methods, generally
corresponding to the AST nodes (e.g. `CompileStruct`, `CompileConst`), with `CompileDecl` as the
entrypoint. The main purpose of `CompileDecl` is for:
* [`TypeDecl`s](#decl) to determine their [`TypeShape`](#typeshape)
* [`Const`s](#decl) to resolve their value, and
* `TypeConstructor`s to set their [`Type`](#type).
Once this step is complete, the `flat::Library` contains all the necessary information for any code
generation. The FIDL compiler can directly generate C bindings, or can generate a JSON IR that can
be consumed by a separate backend
### Additional Checks
Before marking compilation as successful, the FIDL compiler also does a few additional checks: for
example, it will check that any constraints on attributes are satisfied, and that all declared
library dependencies were actually used.
## Backend generation
The FIDL compiler emits a [JSON intermediate representation (IR)][schema]. The JSON IR is
consumed by a separate program, named the **back-end**, that generates the language bindings from
the JSON IR.
The officially supported FIDL language back-ends are:
* C++, Rust, and Go: [fidlgen][fidlgen]
* Dart: [fidlgen_dart][fidlgen-dart]
## C Bindings
For historical reasons, C bindings are directly generated from the FIDL compiler - the C bindings
do not support all features and types that are used with the C bindings need to be annotated with
the [`Layout = "Simple"`][layout-attr] attribute.
## C Family Runtime
`fidlc` is also responsible for generating "coding tables", which are instances of `fidl_type_t` used
to represent the FIDL messages at runtime and are used by all of the bindings for the C family of
languages (C, LLCPP, HLCPP). To do this, the flat AST is converted to an intermediate
representation called the "coded" AST with the `CodedTypesGenerator`, which iterates through the
`flat::Decl`s in a given `flat::Library` and converts each `raw::Decl` node to the corresponding
`coded::Type` node (e.g. `flat::Struct` becomes a `coded::StructType`).
The coding tables are then generated by the `TablesGenerator`, which will generate C code for each
`coded::Type` that constructs the equivalent `fidl_type_t` type. For example, for a
`coded::StructType` called `MyStruct`, the `TablesGenerator` would write out C code that constructs
an equivalent `fidl::FidlCodedStruct`, something like:
```C
const fidl_type_t MyStruct = fidl_type_t(::fidl::FidlCodedStruct(
my_struct_fields, 1u, 32u, "mylibrary/MyStruct"));
```
Coding tables for FIDL libraries in `//sdk/fidl` are generated into the out directory under
`fidling/gen/sdk/fidl/LIBRARY/LIBRARY.fidl.tables.c`. For example
`out/default/fidling/gen/sdk/fuchsia.sys/fuchsia.sys.fidl.tables.c`. The [README][coding-readme]
gives additional context. The `fidl_type_t` definitions (such as for `FidlCodedStruct`) can be found
in [internal.h][internal].
## Glossary
### Decl {#decl}
The `Decl` is the base of all flat AST nodes, just like `SourceElement` is the base of all parser
tree nodes, and corresponds to all possible declarations that a user can make in a FIDL file. There
are two types of `Decl`s:
* `Const`s, which declare a value, and have a `value` attribute that gets resolved during
compilation, and
* `TypeDecl`s, which declare a message type or interface and have a `typeshape` attribute that
gets set during compilation.
`TypeDecl`s representing an aggregate type (e.g. structs, tables, and unions) also
have a static `Shape()` method which contains the logic for determining the `Typeshape`
of that given type.
### FieldShape {#fieldshape}
A `FieldShape` describes the offset and padding information for members of an aggregate type like a
struct or union. Generally these fields will require both a `Typeshape` as well as a `FieldShape`
### Name {#name}
A `Name` represents a scope variable name, and consists of the library the name belongs to (or
`nullptr` for global names), and the variable name itself as a string (for anonymous names) or a
`SourceLocation`. `Name`s can also optionally include a `member_name_` field when referring to the
field of a declared variable (for example `MyEnum.MyField`).
### SourceElement {#sourceelement}
A `SourceElement` represents a block of code inside a fidl file and is parameterized by a `start_`
and `end_` `Token`. All parser tree ("raw" AST) nodes inherit from this class.
### SourceFile {#sourcefile}
Wrapper around a file which is responsible for owning the data in that file. Also see [Virtual
SourceFile](#virtualsourcefile)
### SourceLocation {#sourcelocation}
Wrapper around a `StringView` and the `SourceFile` it comes from. It provides methods to get the
surrounding line of the `StringView` as well as its location in the form of a
`"[filename]:[line]:[col]"` string
### SourceManager {#sourcemanager}
Wrapper around a set of `SourceFile`s that all relate to a single `Library`.
### Token {#token}
A token is essentially a lexeme (in the form of a [`SourceLocation`](#sourcelocation) stored as the
`location_` attribute), enhanced with two other pieces information that are useful to the parser
during the later stages of compilation:
* `previous_end_`. A `SourceLocation` which starts at the end of the previous token and ends at the
start of this token. It contains data that is uninteresting to the parser such as whitespace.
* A kind and subkind which together classifies the lexeme. The possible kinds are:
* The special characters such as `Kind::LeftParen`, `Kind::Dot`, `Kind::Comma`, etc...
* String and numeric constants
* Identifiers. Tokens that are keywords (e.g. `const`, `struct`) are considered identifiers, but
also have a subkind defined to identify which keyword it is (e.g. `Subkind::Const`,
`Subkind::Struct`). All other tokens have a subkind of `None`.
* Uninitialized tokens have a kind of `kNotAToken`.
### Type {#type}
A struct representing an instance of a type. For example, the `vector<int32>:10?` type corresponds
to an instance of the `VectorType` `TypeDecl` with `max_size_ = 10` and `maybe_arg_type` set to the
`Type` corresponding to `int32`. Built-in types all have a static `Shape()` method which will
return the `Typeshape` given the parameters for an instance of that type. User defined types (e.g.
structs or unions) will all have a type of `IdentifierType` - the corresponding
[`TypeDecl`](#typedecl), like `Struct` provides the static `Shape()` method instead.
### TypeDecl {#typedecl}
See [`Decl`](#decl)
### TypeShape {#typeshape}
Information about how objects of a type should be laid out in memory, including their size,
alignment, depth, etc.
### Typespace {#typespace}
The typespace is a map from [`Type`](#type) names to a `TypeTemplate` for that `Type`. During
compilation, the typespace is initialized to include all of the built-in types (e.g. `"vector"`
maps to `VectorTypeTemplate`), whereas user-defined types get added during the compilation process.
This also ensures that a single type template instance exists per type and avoids name
collisions/shadowing of types (e.g. FIDL-310).
### TypeTemplate {#typetemplate}
Instances of TypeTemplates provide a `Create()` method to create a new instance of a specific
`Type` - therefore there is a TypeTemplate subclass for each built-in FIDL type (e.g.
`ArrayTypeTemplate`, `PrimitiveTypeTemplate`, etc.) as well as a single class for all user defined
types (`TypeDeclTypeTemplate`), and one for type aliases (`TypeAliasTypeTemplate`). `Create()`
takes as parameters the possible type parameters: argument type, nullability, and size.
For example, to create an object representing the type `vector<int32>:10?` the compiler would call
the `Create()` method of the `VectorTypeTemplate` with an argument type of `int32`, max size of
`10`, and a nullability of `types::Nullability::kNullable`. This call returns an instance of a
`VectorType` with those parameters. Note that not all 3 of these parameters apply to all of the
types (e.g. `PrimitiveType`s, like `int32` have none of the 3). The `Create()` method of the type
template for each type automatically checks that only the relevant parameters are passed.
The concrete type for user defined types is the `IdentifierType`, which gets generated by the
`TypeDeclTypeTemplate`.
### Virtual SourceFile {#virtualsourcefile}
A subclass of `SourceFile` that has a fake "filename" and is initialized with no backing data. It
exposes an `AddLine()` method to add data to the file, and is used as a backing `SourceFile` for
content that does not appear directly in any of the input `SourceFile`s, like for generated
anonymous [`Name`](#name)s.
<!-- xrefs -->
[internal]: /zircon/system/ulib/fidl/include/lib/fidl/internal.h
[layout-attr]: /docs/reference/fidl/language/attributes.md#layout_layout
[fidlgen]: /garnet/go/src/fidl/compiler/backend
[fidlgen-dart]: /topaz/bin/fidlgen_dart
[schema]: /docs/reference/fidl/language/json-ir.md
[coding-readme]: /src/lib/fidl/c/walker_tests/README.md