Welcome to lib/Syntax!
This library implements data structures and algorithms for dealing with Swift syntax, striving to be safe, correct, and intuitive to use. The library emphasizes immutable, thread-safe data structures, full-fidelity representation of source, and facilities for structured editing.
What is structured editing? It's an editing strategy that is keenly aware of the structure of source code, not necessarily its representation (i.e. characters or bytes). This can be achieved at different granularities: replacing an identifier, changing a call to global function to a method call, or indenting and formatting an entire source file based on declarative rules. These kinds of diverse operations are critical to the Swift Migrator, which is the immediate client for this library, now developed in the open. Along with that, the library will also provide infrastructure for a first-class swift-format
tool.
Eventually, the goal of this library is to represent Swift syntax to all of the compiler. Currently, lib/AST structures don't make a very clear distinction between syntactic and semantic information. Long term, we hope to achieve the following based on work here:
This library is a work in progress and should be expected to be in a molten state for some time. Don't integrate this into other areas of the compiler or use it for anything serious just now.
You can read more about the status of the library's implementation at the Syntax Status Page. More information about opportunities to get involved to come.
In no particular order, here is a summary of the design and implementation points for this library:
struct
.Make APIs are for creating new syntax nodes in a single call. Although you need to provide all of the pieces of syntax to these APIs, you are free to use “missing” placeholders as substructure. Make APIs return freestanding syntax nodes and do not establish parental relationships.
The SyntaxFactory
embodies the Make APIs and is the one-stop shop for creating new syntax nodes and tokens in a single call. There are two main Make APIs exposed for each Syntax node: making the node with all of the pieces, or making a blank node with all of the pieces marked as missing. For example, a StructDeclSyntax
node has a makeStructDeclSyntax
and makeBlankStructDeclSyntax
on SyntaxFactory
for those two cases respectively.
Instead of constructors on each syntax node‘s class, static creation methods are all supplied here in the SyntaxFactory
for better code completion - you don’t need to know the exact name of the class. Just type SyntaxFactory::make
and let code completion show you what you can make.
Example
// A 'typealias' keyword with one space after auto TypeAliasKeyword = SyntaxFactory::makeTypeAliasKeyword({}, Trivia::spaces(1)); // The identifier "Element" with one space after auto ElementID = SyntaxFactory::makeIdentifier("Element", {}, Trivia::spaces(1)); // An equal '=' token with one space after auto Equal = SyntaxFactory::makeEqualToken({}, Trivia::spaces(1)); // A type identifier for "Int" auto IntType = SyntaxFactory::makeTypeIdentifier("Int", {}, {}) // Finally, the actual type alias declaration syntax. auto TypeAlias = SyntaxFactory::makeTypeAliasDecl(TypeAliasKeyword, ElementID, EmptyGenericParams, Equal, IntType); TypeAlias.print(llvm::outs());
typealias Element = Int
With APIs are essentially setters on Syntax
nodes you already have in hand but, because they are immutable, return new Syntax
nodes with only the specified substructure replaced. Raw backing storage is shared as much as possible.
Example
Say you have a MyStruct
of type StructDeclSyntax
representing:
struct MyStruct {}
Now, let's create a new struct with a different identifier, “YourStruct”. The original struct is unharmed but identical tokens are shared.
auto NewIdentifier = SyntaxFactory::makeIdentifier("YourStruct", MyStruct.getIdentifier().getLeadingTrivia(), MyStruct.getIdentifier().getTrailingTrivia()); MyStruct.withIdentifier(NewIdentifier).print(llvm::outs());
struct YourStruct {}
Builder APIs are provided for building up syntax incrementally as it appears. At any point in the building process, you can call build()
and get a reasonably formed Syntax node (i.e. with no raw nullptr
s) using what you‘ve provided to the builder so far. Anything that you haven’t supplied is marked as missing.
Example
StructDeclSyntaxBuilder Builder; // We previously parsed a struct keyword, let's tell the builder to use it. Builder.useStructKeyword(StructKeyword); // Hm, we didn't see an identifier, but saw a left brace. Let's keep going. Builder.useLeftBraceToken(ParsedLeftBrace) // No members of the struct; we saw a right brace. Builder.useRightBraceToken(ParsedRightBrace);
Let's see what we have so far.
auto StructWithoutIdentifier = Builder.build(); StructWithoutIdentifier.print(llvm::outs());
struct {}
Whoops! You forgot an identifier. Let's add one here for fun.
auto MyStructID = SyntaxFactory::createIdentifier("MyStruct", {}, Trivia::spaces(1)); Builder.useIdentifier(MyStructID); auto StructWithIdentifier = Builder.build(); StructWithIdentifier.print(llvm::outs());
struct MyStruct {}
Much better!
TODO
.
RawSyntax
are the raw immutable backing store for all syntax. Essentially, they store a kind, whether they were missing in the source, and the layout, which is a list of children and represents the recursive substructure. Although these are tree-like in nature, they maintain no parental relationships because they can be shared among many nodes. Eventually, RawSyntax
bottoms out in tokens, the terminals, which are represented by the TokenSyntax
class.
RawSyntax
are the immutable backing store for all syntax.RawSyntax
are immutable.RawSyntax
establishes the tree structure of syntax.RawSyntax
store no parental relationships and can therefore be shared among syntax nodes if they have identical content.These are special cases of RawSyntax
and represent all terminals in the grammar. Aside from the token kind, these have two very important pieces of information for full-fidelity source: leading and trailing source trivia surrounding the token.
TokenSyntax
are RawSyntax
and represent the terminals in the Swift grammar.RawSyntax
, TokenSyntax
are immutable.TokenSyntax
do not have pointer equality, as they can be shared among syntax nodes.TokenSyntax
have leading- and trailing trivia, the purely syntactic formatting information like whitespace and comments.You‘ve already seen some uses of Trivia
in the examples above. These are pieces of syntax that aren’t really relevant to the semantics of the program, such as whitespace and comments. These are modeled as collections and, with the exception of comments, are sort of “run-length” encoded.
Some examples of the “atoms” of Trivia
:
//
) comments/* ... */
) comments///
) comments/** ... */
) commentsThere are two Rules of Trivia that you should obey when parsing or constructing new Syntax
nodes:
A token owns all of its trailing trivia up to, but not including, the next newline character.
Looking backward in the text, a token owns all of the leading trivia up to and including the first contiguous sequence of newlines characters.
Let's take a look at how this shows up in practice with a small snippet of Swift code.
Example
func foo() { var x = 2 }
Breaking this down token by token:
func
// Equivalent to: Trivia::spaces(1)
foo
func
at the space before.(
)
{
)
ate the space before.var
// Equivalent to: Trivia::newlines(1) + Trivia::spaces(2)
x
var
ate the space before.=
x
at the space before.2
=
at the space before.}
EOF
Trivia
represent source trivia, the whitespace and comments in a Swift source file.Trivia
are immutable.Trivia
don't have pointer identity - they are primitive values.SyntaxData
nodes wrap RawSyntax
nodes with a few important pieces of information: a pointer to a parent, the position in which the node occurs in its parent, and cached children. For example, if we have a StructDeclSyntaxData
, wrapping a RawSyntax
for a struct declaration, we might ask for the generic parameter clause. At first, this is only represented in the raw syntax. On first ask, we thaw those out by creating a new GenericParameterClauseSyntaxData
, cache it as our child, set its parent to this
, and send it back to the caller.
Beyond this, SyntaxData
nodes have no signficant public API.
SyntaxData
are immutable. However, they may mutate themselves in order to implement lazy instantiation of children and caching. This should be transparent and safe to any internal implementation.SyntaxData
have identity, i.e. they can be compared with “pointer equality”.SyntaxData
are implementation detail have no public API.RawSyntax
and SyntaxData
are essentially implementation detail in order to maintain all of those nice properties like immutability and information sharing. Now, we get to the main players: the Syntax
nodes. These have the interesting public interface: the With APIs, getters, etc. Anyone working with the Syntax
library will be touching these nodes.
Internally, they are actually packaged as a strong reference to the root of the tree in which that node resides, and a weak reference to the SyntaxData
representing that node. Why a weak reference to the data? We do this to prevent retain cycles: all strong references point down in the tree, starting at the root.
Although it‘s important for the entire library to be easy to use and maintain, it’s especially important that the APIs in Syntax
nodes remain intuitive and do what you expect with no weird side effects, necessary contexts to maintain, etc. If you have a handle on a Syntax
node, you're safe to query anything about it without other processes pulling out the rug from under you.
Here's a handy checklist when implementing a production in the grammar.
lib/AST
node has SourceLocs
for all terms. If it doesn't, [file a Swift bug][NewSwiftBug] and fix that first.Syntax
bug label!Syntax
bug label!${KIND}SyntaxData
class.RC<${CHILDKIND}SyntaxData>
${KIND}Syntax
class.Define the Cursor
enum for the syntax node. This specifies all of the terms of the production, including optional terms. For example, a same-type generic requirement is:same-type-requirement -> type-identifier '==' type
That's three terms in the production, and you can see this reflected in the StructDeclSyntaxData
class:
enum Cursor : CursorIndex { LeftTypeIdentifier, EqualityToken, RightType, };
With APIs for all layout elements (e.g. withLeftTypeIdentifier(...)
)
Syntax
node has identical content except for what you changed. print
the new node and check the text.Getters for all layout elements (e.g. getLeftTypeIdentifier()
)
${KIND}SyntaxData
class.get
ing the child, verify:print
s the expected text.SyntaxFactory
make${KIND}Syntax(... all elements ...)
RC<TokenSyntax>
. Check that the asserts are as expected.makeBlank${KIND}Syntax()
${KIND}SyntaxBuilder
.use____(...)
methods for each layout element - takes a ${KIND}Syntax
for that child type.${KIND}Syntax build() const
build()
at all stages of building, followed by print()
.RUN
lines:-round-trip-lex
, and-round-trip-parse
lib/Syntax/Status.md
if applicable.[NewSwiftBug]: https://bugs.swift.org/secure/CreateIssue!default.jspa)