blob: 5087716a59bf24fcec80d4323e4e5c1edf2bbf87 [file] [log] [blame] [view]
# Fuchsia Shell Parser
Provides a basic parser for the shell grammar.
The parser is built on combinators, which means the parser is built out of other parsers.
A parser in the source appears as a function taking a single argument of type `ParseResult` and
returning a value of type `ParseResult`. The argument is the result of parsing a prefix before the
current parse. As an example, consider the grammar:
```
S <- A B
```
Assuming we had parser functions for `A` and `B`, named `A()` and `B()` respectively, we could
implement `S` as follows:
```
ParseResult S(ParseResult prefix) {
return B(A(prefix));
}
```
A `ParseResult` represents a position in the string. If parsing fails completely, the returned
`ParseResult` is equal to `ParseResult::kEnd`, which tests `false` when used as a boolean.
Generally, we hope that a failed parse will be recovered, so we will always get a valid parse from
our grammar, but it may or may not contain error nodes where we had to recover a failure.
`ParseResult` contains three pieces of information: the "tail" of the parse, representing the
un-parsed data remaining after the parse, various information comprising an "error score", which
indicates how much error recovery was necessary to make the parse succeed, and a stack, which
contains the AST nodes we have produced.
To understand the stack, consider our `A B` example above. After successfully parsing `S`, the stack
might contain two nodes:
```
Node::A("A") Node::B("B")
```
The `ParseResult`'s `node` accessor would yield `Node::B`, as it yields the top of the stack. We can
call `Reduce()` on a `ParseResult` to collapse the stack into a single node, so if we called
`Reduce<Node::S>()` on our result, we would end up with one node on the stack:
```
Node::S(children = { Node::A Node::B })
```
`Reduce()` will consume the entire stack by default. This is inconvenient for more complicated
parsers. To make `S` safe, we would insert a "marker node" to indicate the point in the stack where
`Reduce()` should stop consuming. Thus our implementation looks like this:
```
ParseResult S(ParseResult prefix) {
return B(A(prefix.Mark())).Reduce<Node::S>();
}
```
This pattern is somewhat unwieldy, so we implement several combinator functions to produce it
automatically. The `NT` ("NonTerminal") combinator takes a parser as input and automatically places
a `Mark()` before it and a `Reduce()` after it, while the `Seq` combinator runs two parsers in
sequence.
```
ParseResult S(ParseResult prefix) {
return NT<Node::S>(Seq(A, B));
}
```
## Error Handling in a Nutshell
Every parse result has an error score. The error score determines how much error correction has been
necessary to keep the parse going. To increase the error score, there are two special combinators:
`ErInsert` and `ErSkip`.
`ErSkip` takes a child parser, which it runs. If that child parser parses without error, the parsed
text is discarded and an error is inserted instead. Here's an example:
```
Alt(Expression, ErSkip("Expected expression before '%MATCH%'", OnePlus(AnyCharBut(";"))));
```
This parser would attempt to parse an expression, and if it can't find one, it will consume up to
the next semicolon and insert an error message. Note the `'%MATCH%'` in the message string; this
will be substituted for the text matched before the semicolon. The error score for the returned
parse result will increase by the length of the text matched by the error.
If the parser given to `ErSkip` doesn't match without error, `ErSkip` fails completely, so error
recovery does not occur.
`ErInsert` is more straightforward than `ErSkip`. It injects the given amount of error at the
current parse position without moving.
```
Alt(Expression, Alt(Token(";"), ErInsert("Expected semicolon", 1)));
```
With this parser, if we do not see an expression, we'll return a `ParseResult` with the same
position as the one we were given and an error score increased by 1.
Conceptually, `ErInsert` can be thought of as inserting text into the parse stream to allow the
parse to continue. As such, it has an overload that allows you to specify the string inserted
instead of a length:
```
Alt(Expression, Alt(Token(";"), ErInsert("Expected semicolon", ";")));
```
This is for readability purposes only; the only property of the string that affects behavior is
its length.
### Levenshtein Error Score Combining
The way the error combinators affect error score has an important caveat. Consider the following:
```
Seq(ErInsert("Expected a thing", 2), ErSkip("Unexpected character", AnyChar));
```
The `ErInsert` adds an error score of 2, and the `ErSkip` parses a single skipped character, so it
adds a score of 1. We would thus expect the total error score to be 3. However, it will actually be
2.
The reason is that the error score is a Levenshtein distance between the parse stream and the
hypothetical parse stream which contains our modified, recovered parse. This means the error score
increases by one for any character that is inserted, deleted, *or replaced*. We don't have a
specific error combinator for replacement, but an insert and a skip consecutively will be merged
into replacement if possible.
This may be more intuitive if we consider the algorithm used to calculate the error score. The parse
result actually maintains three error scores: an insertion error score, a deletion error score, and
an internal error score. They are used as follows:
* `ErSkip` increases the deletion error score
* `ErInsert` increases the insertion error score
* If we parse a non-zero number of characters normally, the internal error score increases by the
maximum of the insertion and deletion error scores, and the insertion and deletion error scores
become zero.
* The error score of the parse result at any given moment is the internal score plus the maximum
of the insertion and deletion scores.