| # scanlex - a simple lexical scanner. |
| |
| ## The Problem of Input |
| |
| It is easier to write things out than to read them in, since more |
| things can go wrong. The read may fail, the text may not be |
| valid UTF-8, the number may be malformed or simply out of range. |
| |
| ## Lexical Scanners |
| |
| Lexical scanners split a stream of characters into _tokens_. |
| Tokens are returned by repeatedly calling the `get` method of `Scanner`, |
| (which will return `Token::End` if no tokens are left) |
| or by iterating over the scanner. They represent numbers, characters, identifiers, |
| or single/double quoted strings. There is also `Token::Error` to |
| indicate a badly formed token. |
| |
| This lexical scanner makes some |
| assumptions, such as a number may not be directly followed |
| by a letter, etc. No attempt is made in this version to decode C-style |
| escape codes in strings. All whitespace is ignored. It's intended |
| for processing generic structured data, rather than code. |
| |
| For example, the string "hello 'dolly' * 42" will be broken into four tokens: |
| |
| - an _identifier_ 'hello' |
| - a quoted string 'dolly' |
| - a character '*' |
| - and a number 42 |
| |
| |
| ```rust |
| extern crate scanlex; |
| use scanlex::{Scanner,Token}; |
| |
| let mut scan = Scanner::new("hello 'dolly' * 42"); |
| assert_eq!(scan.get(),Token::Iden("hello".into())); |
| assert_eq!(scan.get(),Token::Str("dolly".into())); |
| assert_eq!(scan.get(),Token::Char('*')); |
| assert_eq!(scan.get(),Token::Int(10)); |
| assert_eq!(scan.get(),Token::End); |
| ``` |
| To extract the values, use code like this: |
| |
| ```rust |
| let greeting = scan.get_iden()?; |
| let person = scan.get_string()?; |
| let op = scan.get_char()?; |
| let answer = scan.get_integer(); // i64 |
| ``` |
| |
| |
| `Scanner` implements `Iterator`. If you just wanted to extract the words from |
| a string, then filtering with `as_iden` will do the trick, since it returns |
| `Option<String>`. |
| |
| ```rust |
| let s = Scanner::new("bonzo 42 dog (cat)"); |
| let v: Vec<_> = s.filter_map(|t| t.as_iden()).collect(); |
| assert_eq!(v,&["bonzo","dog","cat"]); |
| ``` |
| |
| Using `as_number` instead you can use this strategy to extract all the numbers out of a |
| document, ignoring all other structure. The `scan.rs` example shows you the tokens |
| that would be generated by parsing the given string on the commmand-line. |
| |
| This iterator only stops at `Token::End` - you can handle `Token::Error` yourself. |
| |
| Usually it's important _not_ to ignore structure. Say we have input strings that |
| look like this "(WORD) = NUMBER": |
| |
| ```rust |
| scan.skip_chars("(")?; |
| let word = scan.get_iden()?; |
| scan.skip_chars(")=")?; |
| let num = scan.get_number()?; |
| ``` |
| |
| _Any_ of these calls may fail! |
| |
| It is a common pattern to create a scanner for each line of text read from a readable |
| source. The `scanline.rs` example shows how to use `ScanLines` to accomplish this. |
| |
| ```rust |
| let f = File::open("scanline.rs").expect("cannot open scanline.rs"); |
| let mut iter = ScanLines::new(&f); |
| while let Some(s) = iter.next() { |
| let mut s = s.expect("cannot read line"); |
| // show the first token of each line |
| println!("{:?}",s.get()); |
| } |
| ``` |
| |
| A more serious example (taken from the tests) is parsing JSON: |
| |
| ```rust |
| type JsonArray = Vec<Box<Value>>; |
| type JsonObject = HashMap<String,Box<Value>>; |
| |
| #[derive(Debug, Clone, PartialEq)] |
| pub enum Value { |
| Str(String), |
| Num(f64), |
| Bool(bool), |
| Arr(JsonArray), |
| Obj(JsonObject), |
| Null |
| } |
| |
| fn scan_json(scan: &mut Scanner) -> Result<Value,ScanError> { |
| use Value::*; |
| match scan.get() { |
| Token::Str(s) => Ok(Str(s)), |
| Token::Num(x) => Ok(Num(x)), |
| Token::Int(n) => Ok(Num(n as f64)), |
| Token::End => Err(scan.scan_error("unexpected end of input",None)), |
| Token::Error(e) => Err(e), |
| Token::Iden(s) => |
| if s == "null" {Ok(Null)} |
| else if s == "true" {Ok(Bool(true))} |
| else if s == "false" {Ok(Bool(false))} |
| else {Err(scan.scan_error(&format!("unknown identifier '{}'",s),None))}, |
| Token::Char(c) => |
| if c == '[' { |
| let mut ja = Vec::new(); |
| let mut ch = c; |
| while ch != ']' { |
| let o = scan_json(scan)?; |
| ch = scan.get_ch_matching(&[',',']'])?; |
| ja.push(Box::new(o)); |
| } |
| Ok(Arr(ja)) |
| } else |
| if c == '{' { |
| let mut jo = HashMap::new(); |
| let mut ch = c; |
| while ch != '}' { |
| let key = scan.get_string()?; |
| scan.get_ch_matching(&[':'])?; |
| let o = scan_json(scan)?; |
| ch = scan.get_ch_matching(&[',','}'])?; |
| jo.insert(key,Box::new(o)); |
| } |
| Ok(Obj(jo)) |
| } else { |
| Err(scan.scan_error(&format!("bad char '{}'",c),None)) |
| } |
| } |
| } |
| ``` |
| |
| (This is of course an Illustrative Example. JSON is a solved problem.) |
| |
| ## Options |
| |
| With `no_float` you get a barebones parser that does not recognize floats, |
| just integers, strings, chars and identifiers. This is useful if the |
| existing rules are too strict - e.g "2d" is fine in `no_float` mode, but |
| an error in the default mode. [chrono-english](https://github.com/stevedonovan/chrono-english) |
| uses this mode to parse date expressions. |
| |
| With `line_comment` you provide a character; after this character, the rest of the current line |
| will be ignored. |
| |