|  | regex-automata | 
|  | ============== | 
|  | A low level regular expression library that uses deterministic finite automata. | 
|  | It supports a rich syntax with Unicode support, has extensive options for | 
|  | configuring the best space vs time trade off for your use case and provides | 
|  | support for cheap deserialization of automata for use in `no_std` environments. | 
|  |  | 
|  | [](https://github.com/BurntSushi/regex-automata/actions) | 
|  | [](https://crates.io/crates/regex-automata) | 
|  |  | 
|  | Dual-licensed under MIT or the [UNLICENSE](http://unlicense.org). | 
|  |  | 
|  |  | 
|  | ### Documentation | 
|  |  | 
|  | https://docs.rs/regex-automata | 
|  |  | 
|  |  | 
|  | ### Usage | 
|  |  | 
|  | Add this to your `Cargo.toml`: | 
|  |  | 
|  | ```toml | 
|  | [dependencies] | 
|  | regex-automata = "0.1" | 
|  | ``` | 
|  |  | 
|  | and this to your crate root (if you're using Rust 2015): | 
|  |  | 
|  | ```rust | 
|  | extern crate regex_automata; | 
|  | ``` | 
|  |  | 
|  |  | 
|  | ### Example: basic regex searching | 
|  |  | 
|  | This example shows how to compile a regex using the default configuration | 
|  | and then use it to find matches in a byte string: | 
|  |  | 
|  | ```rust | 
|  | use regex_automata::Regex; | 
|  |  | 
|  | let re = Regex::new(r"[0-9]{4}-[0-9]{2}-[0-9]{2}").unwrap(); | 
|  | let text = b"2018-12-24 2016-10-08"; | 
|  | let matches: Vec<(usize, usize)> = re.find_iter(text).collect(); | 
|  | assert_eq!(matches, vec![(0, 10), (11, 21)]); | 
|  | ``` | 
|  |  | 
|  | For more examples and information about the various knobs that can be turned, | 
|  | please see the [docs](https://docs.rs/regex-automata). | 
|  |  | 
|  |  | 
|  | ### Support for `no_std` | 
|  |  | 
|  | This crate comes with a `std` feature that is enabled by default. When the | 
|  | `std` feature is enabled, the API of this crate will include the facilities | 
|  | necessary for compiling, serializing, deserializing and searching with regular | 
|  | expressions. When the `std` feature is disabled, the API of this crate will | 
|  | shrink such that it only includes the facilities necessary for deserializing | 
|  | and searching with regular expressions. | 
|  |  | 
|  | The intended workflow for `no_std` environments is thus as follows: | 
|  |  | 
|  | * Write a program with the `std` feature that compiles and serializes a | 
|  | regular expression. Serialization should only happen after first converting | 
|  | the DFAs to use a fixed size state identifier instead of the default `usize`. | 
|  | You may also need to serialize both little and big endian versions of each | 
|  | DFA. (So that's 4 DFAs in total for each regex.) | 
|  | * In your `no_std` environment, follow the examples above for deserializing | 
|  | your previously serialized DFAs into regexes. You can then search with them | 
|  | as you would any regex. | 
|  |  | 
|  | Deserialization can happen anywhere. For example, with bytes embedded into a | 
|  | binary or with a file memory mapped at runtime. | 
|  |  | 
|  | Note that the | 
|  | [`ucd-generate`](https://github.com/BurntSushi/ucd-generate) | 
|  | tool will do the first step for you with its `dfa` or `regex` sub-commands. | 
|  |  | 
|  |  | 
|  | ### Cargo features | 
|  |  | 
|  | * `std` - **Enabled** by default. This enables the ability to compile finite | 
|  | automata. This requires the `regex-syntax` dependency. Without this feature | 
|  | enabled, finite automata can only be used for searching (using the approach | 
|  | described above). | 
|  | * `transducer` - **Disabled** by default. This provides implementations of the | 
|  | `Automaton` trait found in the `fst` crate. This permits using finite | 
|  | automata generated by this crate to search finite state transducers. This | 
|  | requires the `fst` dependency. | 
|  |  | 
|  |  | 
|  | ### Differences with the regex crate | 
|  |  | 
|  | The main goal of the [`regex`](https://docs.rs/regex) crate is to serve as a | 
|  | general purpose regular expression engine. It aims to automatically balance low | 
|  | compile times, fast search times and low memory usage, while also providing | 
|  | a convenient API for users. In contrast, this crate provides a lower level | 
|  | regular expression interface that is a bit less convenient while providing more | 
|  | explicit control over memory usage and search times. | 
|  |  | 
|  | Here are some specific negative differences: | 
|  |  | 
|  | * **Compilation can take an exponential amount of time and space** in the size | 
|  | of the regex pattern. While most patterns do not exhibit worst case | 
|  | exponential time, such patterns do exist. For example, `[01]*1[01]{N}` will | 
|  | build a DFA with `2^(N+1)` states. For this reason, untrusted patterns should | 
|  | not be compiled with this library. (In the future, the API may expose an | 
|  | option to return an error if the DFA gets too big.) | 
|  | * This crate does not support sub-match extraction, which can be achieved with | 
|  | the regex crate's "captures" API. This may be added in the future, but is | 
|  | unlikely. | 
|  | * While the regex crate doesn't necessarily sport fast compilation times, the | 
|  | regexes in this crate are almost universally slow to compile, especially when | 
|  | they contain large Unicode character classes. For example, on my system, | 
|  | compiling `\w{3}` with byte classes enabled takes just over 1 second and | 
|  | almost 5MB of memory! (Compiling a sparse regex takes about the same time | 
|  | but only uses about 500KB of memory.) Conversly, compiling the same regex | 
|  | without Unicode support, e.g., `(?-u)\w{3}`, takes under 1 millisecond and | 
|  | less than 5KB of memory. For this reason, you should only use Unicode | 
|  | character classes if you absolutely need them! | 
|  | * This crate does not support regex sets. | 
|  | * This crate does not support zero-width assertions such as `^`, `$`, `\b` or | 
|  | `\B`. | 
|  | * As a lower level crate, this library does not do literal optimizations. In | 
|  | exchange, you get predictable performance regardless of input. The | 
|  | philosophy here is that literal optimizations should be applied at a higher | 
|  | level, although there is no easy support for this in the ecosystem yet. | 
|  | * There is no `&str` API like in the regex crate. In this crate, all APIs | 
|  | operate on `&[u8]`. By default, match indices are guaranteed to fall on | 
|  | UTF-8 boundaries, unless `RegexBuilder::allow_invalid_utf8` is enabled. | 
|  |  | 
|  | With some of the downsides out of the way, here are some positive differences: | 
|  |  | 
|  | * Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply | 
|  | deserialized. Deserialization always takes constant time since searching can | 
|  | be performed directly on the raw serialized bytes of a DFA. | 
|  | * This crate was specifically designed so that the searching phase of a DFA has | 
|  | minimal runtime requirements, and can therefore be used in `no_std` | 
|  | environments. While `no_std` environments cannot compile regexes, they can | 
|  | deserialize pre-compiled regexes. | 
|  | * Since this crate builds DFAs ahead of time, it will generally out-perform | 
|  | the `regex` crate on equivalent tasks. The performance difference is likely | 
|  | not large. However, because of a complex set of optimizations in the regex | 
|  | crate (like literal optimizations), an accurate performance comparison may be | 
|  | difficult to do. | 
|  | * Sparse DFAs provide a way to build a DFA ahead of time that sacrifices search | 
|  | performance a small amount, but uses much less storage space. Potentially | 
|  | even less than what the regex crate uses. | 
|  | * This crate exposes DFAs directly, such as `DenseDFA` and `SparseDFA`, | 
|  | which enables one to do less work in some cases. For example, if you only | 
|  | need the end of a match and not the start of a match, then you can use a DFA | 
|  | directly without building a `Regex`, which always requires a second DFA to | 
|  | find the start of a match. | 
|  | * Aside from choosing between dense and sparse DFAs, there are several options | 
|  | for configuring the space usage vs search time trade off. These include | 
|  | things like choosing a smaller state identifier representation, to | 
|  | premultiplying state identifiers and splitting a DFA's alphabet into | 
|  | equivalence classes. Finally, DFA minimization is also provided, but can | 
|  | increase compilation times dramatically. | 
|  |  | 
|  |  | 
|  | ### Future work | 
|  |  | 
|  | * Look into being smarter about generating NFA states for large Unicode | 
|  | character classes. These can create a lot of additional work for both the | 
|  | determinizer and the minimizer, and I suspect this is the key thing we'll | 
|  | want to improve if we want to make DFA compile times faster. I *believe* | 
|  | it's possible to potentially build minimal or nearly minimal NFAs for the | 
|  | special case of Unicode character classes by leveraging Daciuk's algorithms | 
|  | for building minimal automata in linear time for sets of strings. See | 
|  | https://blog.burntsushi.net/transducers/#construction for more details. The | 
|  | key adaptation I think we need to make is to modify the algorithm to operate | 
|  | on byte ranges instead of enumerating every codepoint in the set. Otherwise, | 
|  | it might not be worth doing. | 
|  | * Add support for regex sets. It should be possible to do this by "simply" | 
|  | introducing more match states. I think we can also report the positions at | 
|  | each match, similar to how Aho-Corasick works. I think the long pole in the | 
|  | tent here is probably the API design work and arranging it so that we don't | 
|  | introduce extra overhead into the non-regex-set case without duplicating a | 
|  | lot of code. It seems doable. | 
|  | * Stretch goal: support capturing groups by implementing "tagged" DFA | 
|  | (transducers). Laurikari's paper is the usual reference here, but Trofimovich | 
|  | has a much more thorough treatment here: | 
|  | http://re2c.org/2017_trofimovich_tagged_deterministic_finite_automata_with_lookahead.pdf | 
|  | I've only read the paper once. I suspect it will require at least a few more | 
|  | read throughs before I understand it. | 
|  | See also: http://re2c.org/ | 
|  | * Possibly less ambitious goal: can we select a portion of Trofimovich's work | 
|  | to make small fixed length look-around work? It would be really nice to | 
|  | support ^, $ and \b, especially the Unicode variant of \b and CRLF aware $. | 
|  | * Experiment with code generating Rust code. There is an early experiment in | 
|  | src/codegen.rs that is thoroughly bit-rotted. At the time, I was | 
|  | experimenting with whether or not codegen would significant decrease the size | 
|  | of a DFA, since if you squint hard enough, it's kind of like a sparse | 
|  | representation. However, it didn't shrink as much as I thought it would, so | 
|  | I gave up. The other problem is that Rust doesn't support gotos, so I don't | 
|  | even know whether the "match on each state" in a loop thing will be fast | 
|  | enough. Either way, it's probably a good option to have. For one thing, it | 
|  | would be endian independent where as the serialization format of the DFAs in | 
|  | this crate are endian dependent (so you need two versions of every DFA, but | 
|  | you only need to compile one of them for any given arch). | 
|  | * Experiment with unrolling the match loops and fill out the benchmarks. | 
|  | * Add some kind of streaming API. I believe users of the library can already | 
|  | implement something for this outside of the crate, but it would be good to | 
|  | provide an official API. The key thing here is figuring out the API. I | 
|  | suspect we might want to support several variants. | 
|  | * Make a decision on whether or not there is room for literal optimizations | 
|  | in this crate. My original intent was to not let this crate sink down into | 
|  | that very very very deep rabbit hole. But instead, we might want to provide | 
|  | some way for literal optimizations to hook into the match routines. The right | 
|  | path forward here is to probably build something outside of the crate and | 
|  | then see about integrating it. After all, users can implement their own | 
|  | match routines just as efficiently as what the crate provides. | 
|  | * A key downside of DFAs is that they can take up a lot of memory and can be | 
|  | quite costly to build. Their worst case compilation time is O(2^n), where | 
|  | n is the number of NFA states. A paper by Yang and Prasanna (2011) actually | 
|  | seems to provide a way to character state blow up such that it is detectable. | 
|  | If we could know whether a regex will exhibit state explosion or not, then | 
|  | we could make an intelligent decision about whether to ahead-of-time compile | 
|  | a DFA. | 
|  | See: https://www.researchgate.net/profile/XU_Shutu/publication/229032602_Characterization_of_a_global_germplasm_collection_and_its_potential_utilization_for_analysis_of_complex_quantitative_traits_in_maize/links/02bfe50f914d04c837000000.pdf |