blob: 24e03495be73e7f2739d7d1b49023e1daf3e66ae [file] [log] [blame] [view]
# scanlex - a simple lexical scanner.
## The Problem of Input
It is easier to write things out than to read them in, since more
things can go wrong. The read may fail, the text may not be
valid UTF-8, the number may be malformed or simply out of range.
## Lexical Scanners
Lexical scanners split a stream of characters into _tokens_.
Tokens are returned by repeatedly calling the `get` method of `Scanner`,
(which will return `Token::End` if no tokens are left)
or by iterating over the scanner. They represent numbers, characters, identifiers,
or single/double quoted strings. There is also `Token::Error` to
indicate a badly formed token.
This lexical scanner makes some
assumptions, such as a number may not be directly followed
by a letter, etc. No attempt is made in this version to decode C-style
escape codes in strings. All whitespace is ignored. It's intended
for processing generic structured data, rather than code.
For example, the string "hello 'dolly' * 42" will be broken into four tokens:
- an _identifier_ 'hello'
- a quoted string 'dolly'
- a character '*'
- and a number 42
```rust
extern crate scanlex;
use scanlex::{Scanner,Token};
let mut scan = Scanner::new("hello 'dolly' * 42");
assert_eq!(scan.get(),Token::Iden("hello".into()));
assert_eq!(scan.get(),Token::Str("dolly".into()));
assert_eq!(scan.get(),Token::Char('*'));
assert_eq!(scan.get(),Token::Int(10));
assert_eq!(scan.get(),Token::End);
```
To extract the values, use code like this:
```rust
let greeting = scan.get_iden()?;
let person = scan.get_string()?;
let op = scan.get_char()?;
let answer = scan.get_integer(); // i64
```
`Scanner` implements `Iterator`. If you just wanted to extract the words from
a string, then filtering with `as_iden` will do the trick, since it returns
`Option<String>`.
```rust
let s = Scanner::new("bonzo 42 dog (cat)");
let v: Vec<_> = s.filter_map(|t| t.as_iden()).collect();
assert_eq!(v,&["bonzo","dog","cat"]);
```
Using `as_number` instead you can use this strategy to extract all the numbers out of a
document, ignoring all other structure. The `scan.rs` example shows you the tokens
that would be generated by parsing the given string on the commmand-line.
This iterator only stops at `Token::End` - you can handle `Token::Error` yourself.
Usually it's important _not_ to ignore structure. Say we have input strings that
look like this "(WORD) = NUMBER":
```rust
scan.skip_chars("(")?;
let word = scan.get_iden()?;
scan.skip_chars(")=")?;
let num = scan.get_number()?;
```
_Any_ of these calls may fail!
It is a common pattern to create a scanner for each line of text read from a readable
source. The `scanline.rs` example shows how to use `ScanLines` to accomplish this.
```rust
let f = File::open("scanline.rs").expect("cannot open scanline.rs");
let mut iter = ScanLines::new(&f);
while let Some(s) = iter.next() {
let mut s = s.expect("cannot read line");
// show the first token of each line
println!("{:?}",s.get());
}
```
A more serious example (taken from the tests) is parsing JSON:
```rust
type JsonArray = Vec<Box<Value>>;
type JsonObject = HashMap<String,Box<Value>>;
#[derive(Debug, Clone, PartialEq)]
pub enum Value {
Str(String),
Num(f64),
Bool(bool),
Arr(JsonArray),
Obj(JsonObject),
Null
}
fn scan_json(scan: &mut Scanner) -> Result<Value,ScanError> {
use Value::*;
match scan.get() {
Token::Str(s) => Ok(Str(s)),
Token::Num(x) => Ok(Num(x)),
Token::Int(n) => Ok(Num(n as f64)),
Token::End => Err(scan.scan_error("unexpected end of input",None)),
Token::Error(e) => Err(e),
Token::Iden(s) =>
if s == "null" {Ok(Null)}
else if s == "true" {Ok(Bool(true))}
else if s == "false" {Ok(Bool(false))}
else {Err(scan.scan_error(&format!("unknown identifier '{}'",s),None))},
Token::Char(c) =>
if c == '[' {
let mut ja = Vec::new();
let mut ch = c;
while ch != ']' {
let o = scan_json(scan)?;
ch = scan.get_ch_matching(&[',',']'])?;
ja.push(Box::new(o));
}
Ok(Arr(ja))
} else
if c == '{' {
let mut jo = HashMap::new();
let mut ch = c;
while ch != '}' {
let key = scan.get_string()?;
scan.get_ch_matching(&[':'])?;
let o = scan_json(scan)?;
ch = scan.get_ch_matching(&[',','}'])?;
jo.insert(key,Box::new(o));
}
Ok(Obj(jo))
} else {
Err(scan.scan_error(&format!("bad char '{}'",c),None))
}
}
}
```
(This is of course an Illustrative Example. JSON is a solved problem.)
## Options
With `no_float` you get a barebones parser that does not recognize floats,
just integers, strings, chars and identifiers. This is useful if the
existing rules are too strict - e.g "2d" is fine in `no_float` mode, but
an error in the default mode. [chrono-english](https://github.com/stevedonovan/chrono-english)
uses this mode to parse date expressions.
With `line_comment` you provide a character; after this character, the rest of the current line
will be ignored.