Netstack3 Improvements

This document describes various improvements that we might want to make to structure or algorithms in the future, but that we don‘t have concrete plans to implement. It’s essentially a bucket of ideas we don't want to forget.

Performance Issues

This section documents design and implementation details that we speculate may cause performance issues.

Single-threaded execution

Contact: joshlf@google.com

Our single-threaded execution model is great for simplicity, but will likely run into problems in practice depending on our performance goals. These include:

Large state changes (add/remove device, update forwarding table, etc) take much longer to compute than a normal event like receipt of a packet, and stall the entire event loop until they complete. This may introduce unacceptably large latency spikes.
If packet flow is high enough (e.g., in the case of a router forwarding traffic for a network), the single execution thread may not be able to keep up with throughput.

Possible solutions

Use read-copy-update (RCU) or any other read-optimized architecture. It is likely that any performance goals we have regardless of application domain will heavily favor read throughput over write throughput. “Reading” here refers to reading shared data structures like device state, connection state, forwarding tables, etc. In a read-optimized architecture, most threads could spend most of their time handling common events like packet receipts or application client messages, and stay on the optimized, read-only hot path most of the time. Any write operations such as device insertion/removal, forwarding table update, etc would likely perform much worse than in the current model, but that's probably an acceptable tradeoff.
Note that RCU and similar schemes can incur significant complexity. However, I believe that Rust should make it much easier to encapsulate this complexity such that most of the stack can remain agnostic to these details.
Another thing to keep in mind with this approach is that the read/write distinction is not a perfect one. For example, handling a segment in a TCP connection involves read-only operations on global data structures such as the forwarding table, but involves write operations on the TCP connection state. It's likely that, even in an read-optimized world, coarser-grained synchronization around less global state like TCP connection state would be acceptable.

`debug_err!` macro

Contact: joshlf@google.com

The debug_err! macro is used to emit a debug log and then return an error value. The debug_err_fn! macro produces a closure which calls debug_err!. They are both used extensively in the wire module during packet parsing. As of this writing, debug_err! is implemented as follows:

/// Emit a debug message and return an error.
///
/// Invoke the `debug!` macro on all but the first argument. A call to
/// `debug_err!(err, ...)` is an expression whose value is the expression `err`.
macro_rules! debug_err {
    ($err:expr, $($arg:tt)*) => (
        {
            use ::log::debug;
            debug!($($arg)*);
            $err
        }
    )
}

I speculate that the unconditional use of the debug! macro may cause performance problems as, by default, it checks to see whether debug logging is enabled. One way to alleviate this issue would be some kind of conditional compilation. E.g., we could add a compile-time flag to enable or disable the debug! invocation entirely. We could also use the log crate's release_max_level_xxx features to disable debug logging crate-wide.

Pervasive use of `HashMap`s

Contact: joshlf@google.com

In order to play nicely with Rust's ownership model, we generally pass HashMap keys around where implementations in other languages might pass pointers. This allows most functions to take a mutable reference to the entire state, but also requires that we do fairly frequent hash map lookups even in the hot path.

In certain cases, it may make sense to replace hash maps with vectors (e.g., see “Devices in HashMaps” section below). However, vectors require either linear scanning or keys which can be used to index into the vector. Neither of these is viable in the general case due to either large sets of data (too large to linearly scan) or large, unstructured keys (unsuitable for indexing directly into a vector).

Possible solutions

Use Rc<RefCell<T>> for hash map entries. This allows us to cache reference-counted pointers which point directly to the hash map entries. It requires a reference count check in order to access the RefCell, although this is undoubtedly much faster than a hash map lookup. The big disadvantage is that it requires heap allocating the hash map entries rather than storing them inline, which could significantly worsen both cache locality and cache utilization.
Use BTreeMaps. BTreeMaps are not quite as general as HashMaps since they require their keys to be ordered, but in practice, most of our keys are. For smaller maps, BTreeMaps are superior due to not having to compute a hash. They are still pretty good for cache locality and utilization, although not as good as HashMaps. When they get large, they lag behind HashMaps. We would need to figure out where the cutoff point is, and how large we expect our working sets to be for various map instances. DoS vulnerability may be also be a concern for BTreeMaps (a maliciously-chosen set of keys could induce the pathalogical case of a tree becoming a linked list). HashMaps can protect against this with DoS-resistant hash functions.
Use more special-purpose data structures in particular cases. For example, maps keyed by IPv4 addresses could be stored in a trie where each level is keyed off of a byte of the IPv4 address, leading to a maximum trie depth of 4.
For any of these solutions, dynamically choose based on working set size or other runtime characteristics. For example, we could provide a map type which uses a BTreeMap below a certain size, and switches to a HashMap at larger sizes.

Devices in `HashMap`s

Contact: joshlf@google.com

Currently, device::DeviceId has the following signature:

pub struct DeviceId {
    id: u64,
    protocol: DeviceProtocol,
}

enum DeviceProtocol {
    Ethernet,
    ...
}

In order to maintain static typing, we store the state associated with a device in a HashMap which is specific to that protocol type. In other words, Ethernet devices are stored in a HashMap<u64, EthernetState>, Token Ring devices would (if they existed) be stored in a HashMap<u64, TokenRingState>, etc. Device-layer functions use the protocol field of DeviceId to determine which HashMap to look in, and then use the id field to look up the particular state.

However, HashMaps can be slow. If performance becomes a problem, we may want to switch to using Vec instead. For example, Ethernet state would be stored in a Vec<Option<EthernetState>> (the Option for deleted devices - more on that in a bit). Lookup would then consist of a single index operation.

This would have some drawbacks. For one, the netstack guarantees that device IDs never repeat, even if a device is removed. Thus, over time, the Vec would tend to get sparse. However, it's likely that this would happen very slowly, as devices are not removed and added with any frequency. Another problem would be that, because all device IDs share the same namespace regardless of device type, all of the Vecs would have to be long enough to accommodate every device ID, even device IDs of other types. This could, in turn, be alleviated by allocating device IDs in pools (e.g., using the top 32 bits to identify the type of device, and the bottom 32 bits to identify the device itself).

Another concern is security - is it important that device IDs be unpredictable? Are we concerned that sequential device IDs might leak information that we don‘t want to leak? My feeling is that the answer is no, but it’s worth considering.

One way to solve these problems might be to have the device collection be an enum of a HashMap and a Vec, and have a cutoff point at which, once the Vec grows sufficiently large, a HashMap is used instead.

Another option would be to retain randomly-generated IDs, and simply keep devices in a Vec which is scanned linearly on lookup. For small numbers of devices, this might still be faster than a HashMap.

NOTE: It seems likely that Rust's HashMap will see significant speed improvements in the near future, which may make the tradeoffs here lean somewhat more in favor of using HashMaps

Netstack3 Improvements

Performance Issues

Single-threaded execution

Possible solutions

debug_err! macro

Pervasive use of HashMaps

Possible solutions

Devices in HashMaps

`debug_err!` macro

Pervasive use of `HashMap`s

Devices in `HashMap`s