It is easier to write things out than to read them in, since more things can go wrong. The read may fail, the text may not be valid UTF-8, the number may be malformed or simply out of range.
Lexical scanners split a stream of characters into tokens. Tokens are returned by repeatedly calling the get
method of Scanner
, (which will return Token::End
if no tokens are left) or by iterating over the scanner. They represent numbers, characters, identifiers, or single/double quoted strings. There is also Token::Error
to indicate a badly formed token.
This lexical scanner makes some assumptions, such as a number may not be directly followed by a letter, etc. No attempt is made in this version to decode C-style escape codes in strings. All whitespace is ignored. It's intended for processing generic structured data, rather than code.
For example, the string “hello ‘dolly’ * 42” will be broken into four tokens:
extern crate scanlex; use scanlex::{Scanner,Token}; let mut scan = Scanner::new("hello 'dolly' * 42"); assert_eq!(scan.get(),Token::Iden("hello".into())); assert_eq!(scan.get(),Token::Str("dolly".into())); assert_eq!(scan.get(),Token::Char('*')); assert_eq!(scan.get(),Token::Int(10)); assert_eq!(scan.get(),Token::End);
To extract the values, use code like this:
let greeting = scan.get_iden()?; let person = scan.get_string()?; let op = scan.get_char()?; let answer = scan.get_integer(); // i64
Scanner
implements Iterator
. If you just wanted to extract the words from a string, then filtering with as_iden
will do the trick, since it returns Option<String>
.
let s = Scanner::new("bonzo 42 dog (cat)"); let v: Vec<_> = s.filter_map(|t| t.as_iden()).collect(); assert_eq!(v,&["bonzo","dog","cat"]);
Using as_number
instead you can use this strategy to extract all the numbers out of a document, ignoring all other structure. The scan.rs
example shows you the tokens that would be generated by parsing the given string on the commmand-line.
This iterator only stops at Token::End
- you can handle Token::Error
yourself.
Usually it's important not to ignore structure. Say we have input strings that look like this “(WORD) = NUMBER”:
scan.skip_chars("(")?; let word = scan.get_iden()?; scan.skip_chars(")=")?; let num = scan.get_number()?;
Any of these calls may fail!
It is a common pattern to create a scanner for each line of text read from a readable source. The scanline.rs
example shows how to use ScanLines
to accomplish this.
let f = File::open("scanline.rs").expect("cannot open scanline.rs"); let mut iter = ScanLines::new(&f); while let Some(s) = iter.next() { let mut s = s.expect("cannot read line"); // show the first token of each line println!("{:?}",s.get()); }
A more serious example (taken from the tests) is parsing JSON:
type JsonArray = Vec<Box<Value>>; type JsonObject = HashMap<String,Box<Value>>; #[derive(Debug, Clone, PartialEq)] pub enum Value { Str(String), Num(f64), Bool(bool), Arr(JsonArray), Obj(JsonObject), Null } fn scan_json(scan: &mut Scanner) -> Result<Value,ScanError> { use Value::*; match scan.get() { Token::Str(s) => Ok(Str(s)), Token::Num(x) => Ok(Num(x)), Token::Int(n) => Ok(Num(n as f64)), Token::End => Err(scan.scan_error("unexpected end of input",None)), Token::Error(e) => Err(e), Token::Iden(s) => if s == "null" {Ok(Null)} else if s == "true" {Ok(Bool(true))} else if s == "false" {Ok(Bool(false))} else {Err(scan.scan_error(&format!("unknown identifier '{}'",s),None))}, Token::Char(c) => if c == '[' { let mut ja = Vec::new(); let mut ch = c; while ch != ']' { let o = scan_json(scan)?; ch = scan.get_ch_matching(&[',',']'])?; ja.push(Box::new(o)); } Ok(Arr(ja)) } else if c == '{' { let mut jo = HashMap::new(); let mut ch = c; while ch != '}' { let key = scan.get_string()?; scan.get_ch_matching(&[':'])?; let o = scan_json(scan)?; ch = scan.get_ch_matching(&[',','}'])?; jo.insert(key,Box::new(o)); } Ok(Obj(jo)) } else { Err(scan.scan_error(&format!("bad char '{}'",c),None)) } } }
(This is of course an Illustrative Example. JSON is a solved problem.)
With no_float
you get a barebones parser that does not recognize floats, just integers, strings, chars and identifiers. This is useful if the existing rules are too strict - e.g “2d” is fine in no_float
mode, but an error in the default mode. chrono-english uses this mode to parse date expressions.
With line_comment
you provide a character; after this character, the rest of the current line will be ignored.