| # Fuchsia Shell Parser |
| |
| Provides a basic parser for the shell grammar. |
| |
| The parser is built on combinators, which means the parser is built out of other parsers. |
| |
| A parser in the source appears as a function taking a single argument of type ParseResultStream and |
| returning a value of type ParseResultStream. The argument is the result of parsing a prefix before |
| the current parse. As an example, consider the grammar: |
| |
| S <- A B |
| |
| Assuming we had parser functions for A and B, named A() and B() respectively, we could implement S |
| as follows: |
| |
| ParseResultStream S(ParseResultStream prefixes) { |
| return B(A(prefixes)); |
| } |
| |
| (There's some subtleties here, such that this rule wouldn't be implemented this way in practice, but |
| it gets a point across). |
| |
| ParseResultStream is a stream of ParseResult objects. If parsing goes normally, the first |
| ParseResult yielded by the stream is the only one of interest. If parsing has failed, each |
| successive ParseResult yielded by the stream will contain a different attempt at error recovery. |
| Successful parses will also yield error recovery attempts if polled further, as a parse that seemed |
| okay may turn out to be wrong after parsing further. As an example, consider a grammar that parses |
| nested parentheses encountering the string '('. We might parse the opening parenthesis successfully, |
| then try to parse the closing parenthesis. When we discover it's missing, we might decide that the |
| opening parenthesis should not actually have parsed successfully. |
| |
| ParseResultStream also contains a flag indicating whether the last parser succeeded. This is useful |
| in avoiding error recovery in some cases, as polling the stream will always attempt to construct a |
| recovered parse, whereas we sometimes only want to know the parse failed but don't need a recovery. |
| |
| ParseResult itself contains three pieces of information: the "tail" of the parse, representing the |
| un-parsed data remaining after the parse, various information comprising an "error score", which |
| indicates how much error recovery was necessary to make the parse succeed, and a stack, which |
| contains the AST nodes we have produced. |
| |
| To understand the stack, consider our A B example above. After successfully parsing S, the stack |
| might contain two nodes: |
| |
| Node("A") Node("B") |
| |
| The ParseResult's "node" accessor would yield Node("B"), as it yields the top of the stack. We can |
| call Reduce() on a ParseResult to collapse the stack into a single node, so if we called Reduce("S") |
| on our result, we would end up with one node on the stack: |
| |
| Node("S", children = { Node(A) Node(B) }) |
| |
| Reduce() will consume the entire stack by default. This is inconvenient for more complicated |
| parsers. To make S safe, we would insert a "marker node" to indicate the point in the stack where |
| Reduce() should stop consuming. |
| |
| ParseResultStream has methods to automatically push a marker node to each result it yields, or |
| reduce every result it yields, so a better implementation of S would be: |
| |
| ParseResultStream S(ParseResultStream prefixes) { |
| return B(A(prefixes.Mark())).Reduce("S"); |
| } |
| |
| The only issue with this is that it doesn't necessarily set the "ok" flag on ParseResultStream |
| correctly. Because the "ok" flag is set to the success of the last parser, this would return "ok" if |
| B returned "ok". But from a caller's perspective, the parse should only succeed if neither A nor B |
| required error correction, so we need to adjust our handling of the flag: |
| |
| ParseResultStream S(ParseResultStream prefixes) { |
| auto a_result = A(prefixes.Mark()); |
| bool a_ok = a_result.ok(); |
| auto b_result = B(std::move(a_result)).Reduce("S"); |
| |
| if (ok) { |
| return b_result; |
| } else { |
| return b_result.Fail(); |
| } |
| } |
| |
| This pattern is somewhat unwieldy, so we implement several combinator functions to produce it |
| automatically. The NT ("NonTerminal") combinator takes a parser as input and automatically places a |
| Mark() before it and a Reduce after it, while the Seq combinator runs two parsers in sequence and |
| handles the ok flag correctly. With these tools, our S implementation becomes: |
| |
| ParseResultStream S(ParseResultStream prefixes) { |
| return NT("S", Seq(A, B)); |
| } |
| |
| ## Error handling in a nutshell |
| |
| Error handling relies on one invariant: ParseResultStream yields results in order of increasing |
| amount of error correction (measured as Levenshtein distance to the nearest correctly-parsing |
| string). That means that a correct parse will always come first, and that if there is no correct |
| parse, the parse requiring the least error correction will come first. |
| |
| For the simple base parsers which parse single tokens, this is baked into the implementation. The |
| provided combinators guarantee this property by induction (i.e. if the input parsers have this |
| property, the combinator will produce a parser with this property). |