Fuchsia Shell Parser

Provides a basic parser for the shell grammar.

The parser is built on combinators, which means the parser is built out of other parsers.

A parser in the source appears as a function taking a single argument of type ParseResult and returning a value of type ParseResult. The argument is the result of parsing a prefix before the current parse. As an example, consider the grammar:

S <- A B

Assuming we had parser functions for A and B, named A() and B() respectively, we could implement S as follows:

ParseResult S(ParseResult prefix) {
  return B(A(prefix));
}

A ParseResult represents a position in the string. If parsing fails completely, the returned ParseResult is equal to ParseResult::kEnd, which tests false when used as a boolean. Generally, we hope that a failed parse will be recovered, so we will always get a valid parse from our grammar, but it may or may not contain error nodes where we had to recover a failure.

ParseResult contains three pieces of information: the “tail” of the parse, representing the un-parsed data remaining after the parse, various information comprising an “error score”, which indicates how much error recovery was necessary to make the parse succeed, and a stack, which contains the AST nodes we have produced.

To understand the stack, consider our A B example above. After successfully parsing S, the stack might contain two nodes:

Node::A("A") Node::B("B")

The ParseResult's node accessor would yield Node::B, as it yields the top of the stack. We can call Reduce() on a ParseResult to collapse the stack into a single node, so if we called Reduce<Node::S>() on our result, we would end up with one node on the stack:

Node::S(children = { Node::A Node::B })

Reduce() will consume the entire stack by default. This is inconvenient for more complicated parsers. To make S safe, we would insert a “marker node” to indicate the point in the stack where Reduce() should stop consuming. Thus our implementation looks like this:

ParseResult S(ParseResult prefix) {
  return B(A(prefix.Mark())).Reduce<Node::S>();
}

This pattern is somewhat unwieldy, so we implement several combinator functions to produce it automatically. The NT (“NonTerminal”) combinator takes a parser as input and automatically places a Mark() before it and a Reduce() after it, while the Seq combinator runs two parsers in sequence.

ParseResult S(ParseResult prefix) {
  return NT<Node::S>(Seq(A, B));
}

Error Handling in a Nutshell

Every parse result has an error score. The error score determines how much error correction has been necessary to keep the parse going. To increase the error score, there are two special combinators: ErInsert and ErSkip.

ErSkip takes a child parser, which it runs. If that child parser parses without error, the parsed text is discarded and an error is inserted instead. Here's an example:

Alt(Expression, ErSkip("Expected expression before '%MATCH%'", OnePlus(AnyCharBut(";"))));

This parser would attempt to parse an expression, and if it can't find one, it will consume up to the next semicolon and insert an error message. Note the '%MATCH%' in the message string; this will be substituted for the text matched before the semicolon. The error score for the returned parse result will increase by the length of the text matched by the error.

If the parser given to ErSkip doesn't match without error, ErSkip fails completely, so error recovery does not occur.

ErInsert is more straightforward than ErSkip. It injects the given amount of error at the current parse position without moving.

Alt(Expression, Alt(Token(";"), ErInsert("Expected semicolon", 1)));

With this parser, if we do not see an expression, we'll return a ParseResult with the same position as the one we were given and an error score increased by 1.

Conceptually, ErInsert can be thought of as inserting text into the parse stream to allow the parse to continue. As such, it has an overload that allows you to specify the string inserted instead of a length:

Alt(Expression, Alt(Token(";"), ErInsert("Expected semicolon", ";")));

This is for readability purposes only; the only property of the string that affects behavior is its length.

Levenshtein Error Score Combining

The way the error combinators affect error score has an important caveat. Consider the following:

Seq(ErInsert("Expected a thing", 2), ErSkip("Unexpected character", AnyChar));

The ErInsert adds an error score of 2, and the ErSkip parses a single skipped character, so it adds a score of 1. We would thus expect the total error score to be 3. However, it will actually be 2.

The reason is that the error score is a Levenshtein distance between the parse stream and the hypothetical parse stream which contains our modified, recovered parse. This means the error score increases by one for any character that is inserted, deleted, or replaced. We don't have a specific error combinator for replacement, but an insert and a skip consecutively will be merged into replacement if possible.

This may be more intuitive if we consider the algorithm used to calculate the error score. The parse result actually maintains three error scores: an insertion error score, a deletion error score, and an internal error score. They are used as follows:

ErSkip increases the deletion error score
ErInsert increases the insertion error score
If we parse a non-zero number of characters normally, the internal error score increases by the maximum of the insertion and deletion error scores, and the insertion and deletion error scores become zero.
The error score of the parse result at any given moment is the internal score plus the maximum of the insertion and deletion scores.