blob: 11149b34df92471527a3a592bf52acc33aa27861 [file] [log] [blame] [view]
# Fuchsia Shell Parser
Provides a basic parser for the shell grammar.
The parser is built on combinators, which means the parser is built out of other parsers.
A parser in the source appears as a function taking a single argument of type ParseResultStream and
returning a value of type ParseResultStream. The argument is the result of parsing a prefix before
the current parse. As an example, consider the grammar:
S <- A B
Assuming we had parser functions for A and B, named A() and B() respectively, we could implement S
as follows:
ParseResultStream S(ParseResultStream prefixes) {
return B(A(prefixes));
}
(There's some subtleties here, such that this rule wouldn't be implemented this way in practice, but
it gets a point across).
ParseResultStream is a stream of ParseResult objects. If parsing goes normally, the first
ParseResult yielded by the stream is the only one of interest. If parsing has failed, each
successive ParseResult yielded by the stream will contain a different attempt at error recovery.
Successful parses will also yield error recovery attempts if polled further, as a parse that seemed
okay may turn out to be wrong after parsing further. As an example, consider a grammar that parses
nested parentheses encountering the string '('. We might parse the opening parenthesis successfully,
then try to parse the closing parenthesis. When we discover it's missing, we might decide that the
opening parenthesis should not actually have parsed successfully.
ParseResultStream also contains a flag indicating whether the last parser succeeded. This is useful
in avoiding error recovery in some cases, as polling the stream will always attempt to construct a
recovered parse, whereas we sometimes only want to know the parse failed but don't need a recovery.
ParseResult itself contains three pieces of information: the "tail" of the parse, representing the
un-parsed data remaining after the parse, various information comprising an "error score", which
indicates how much error recovery was necessary to make the parse succeed, and a stack, which
contains the AST nodes we have produced.
To understand the stack, consider our A B example above. After successfully parsing S, the stack
might contain two nodes:
Node("A") Node("B")
The ParseResult's "node" accessor would yield Node("B"), as it yields the top of the stack. We can
call Reduce() on a ParseResult to collapse the stack into a single node, so if we called Reduce("S")
on our result, we would end up with one node on the stack:
Node("S", children = { Node(A) Node(B) })
Reduce() will consume the entire stack by default. This is inconvenient for more complicated
parsers. To make S safe, we would insert a "marker node" to indicate the point in the stack where
Reduce() should stop consuming.
ParseResultStream has methods to automatically push a marker node to each result it yields, or
reduce every result it yields, so a better implementation of S would be:
ParseResultStream S(ParseResultStream prefixes) {
return B(A(prefixes.Mark())).Reduce("S");
}
The only issue with this is that it doesn't necessarily set the "ok" flag on ParseResultStream
correctly. Because the "ok" flag is set to the success of the last parser, this would return "ok" if
B returned "ok". But from a caller's perspective, the parse should only succeed if neither A nor B
required error correction, so we need to adjust our handling of the flag:
ParseResultStream S(ParseResultStream prefixes) {
auto a_result = A(prefixes.Mark());
bool a_ok = a_result.ok();
auto b_result = B(std::move(a_result)).Reduce("S");
if (ok) {
return b_result;
} else {
return b_result.Fail();
}
}
This pattern is somewhat unwieldy, so we implement several combinator functions to produce it
automatically. The NT ("NonTerminal") combinator takes a parser as input and automatically places a
Mark() before it and a Reduce after it, while the Seq combinator runs two parsers in sequence and
handles the ok flag correctly. With these tools, our S implementation becomes:
ParseResultStream S(ParseResultStream prefixes) {
return NT("S", Seq(A, B));
}
## Error handling in a nutshell
Error handling relies on one invariant: ParseResultStream yields results in order of increasing
amount of error correction (measured as Levenshtein distance to the nearest correctly-parsing
string). That means that a correct parse will always come first, and that if there is no correct
parse, the parse requiring the least error correction will come first.
For the simple base parsers which parse single tokens, this is baked into the implementation. The
provided combinators guarantee this property by induction (i.e. if the input parsers have this
property, the combinator will produce a parser with this property).