blob: cd08d243cce4a7f592c03a0d12559eb519c503b3 [file] [view]
# Fuchsia RCU Design
This document describes how the Fuchsia RCU crate uses atomics and the C++20
memory model to synchronize access to data.
## Usage Model
You can use `RcuCell` to create a cell that can be used concurrently from
multiple threads. Many threads can read from the cell concurrently and do not
block on writers. When the cell is written, reads may continue to see the old
value of the cell for some period of time.
```rust
struct MyStruct {
a: usize,
b: usize,
}
struct SharedStruct {
foo: RcuCell<MyStruct>
}
```
### Read
To read the value of the cell, use the `read()` method:
```rust
fn my_reader(x: Arc<SharedStruct>) {
let guard = x.foo.read();
println!("foo.a: {}", guard.a);
println!("foo.b: {}", guard.b);
}
```
The values of `a` and `b` will be consistent with each other as long as their
are read from the same `RcuReadGuard` returned from `read()`. However, another
thread running concurrently might see different values for `a` and `b` when
reading this cell.
### Write
To write the value of the cell, use the `set()` method:
```rust
fn my_writer(x: Arc<SharedStruct>) {
x.foo.set(MyStruct { a: 42, b: 24 });
}
```
After `set` returns, future calls to `read()` will observe the new values for
`a` and `b`, but concurrent readers can still observe the old values. The
storage for the old values will not be reclaimed until all the concurrent
readers have dropped their read guards and the RCU state machine has made
sufficient progress.
### Progress requirements
In order for the RCU state machine to make progress, the program using
`fuchsia-rcu` must periodically call `rcu_synchronize`. Otherwise, memory
allocated during `RcuCell::set` will never be freed.
## Low-level interface
`RcuCell` and similar high-level data structures are built on top of a low-level
interface to the RCU state machine. This interface synchronizes access to
objects referenced through `std::sync::atomic::AtomicPtr`.
The RCU library also provides `RcuPtr`, which is a thin wrapper around an
`AtomicPtr` and the low-level interface described below. Clients that require
low-level access to the RCU state machine should use `RcuPtr` rather than
calling the low-level interface directly.
### Read
The interface for reading objects managed by RCU is as follows:
```rust
fn rcu_read_lock()
fn rcu_read_unlock()
fn rcu_read_pointer(ptr: &AtomicPtr<T>) -> *const T
```
To read from an object, a client must first call `rcu_read_lock()` and then call
`rcu_read_pointer()`. The returned pointer remains valid until the balancing
call to `rcu_read_unlock()`, after which time the client must not dereference
the pointer.
A given thread can acquire nested RCU read locks. Threads running concurrently
might receive pointers to different objects when calling `rcu_read_pointer()`
with the same `AtomicPtr`.
## Write
The interface for writing objects managed by RCU is as follows:
```rust
fn rcu_assign_pointer(ptr: &AtomicPtr<T>, new_ptr: *mut T)
fn rcu_replace_pointer(ptr: &AtomicPtr<T>, new_ptr: *mut T) -> *mut T
fn rcu_call(callback: impl FnOnce() + Send + Sync + 'static)
fn rcu_drop<T: Send + Sync + 'static>(value: T)
```
To write an object, a client first creates a new instance of the object and then
uses either `rcu_assign_pointer()` or `rcu_replace_pointer()` to store a pointer
to that object in an `AtomicPtr`. The `rcu_assign_pointer()` operation discards
whatever value was previously stored in this pointer, whereas the
`rcu_replace_pointer()` operation returns the previous value of the `AtomicPtr`.
Typically, the caller wishes to clean up whatever object was previously stored
in the `AtomicPtr`. However, the caller cannot clean up that object immediately
because there might be other threads that are currently reading from that
object. Instead, the caller uses `rcu_call()` to schedule a callback that will
run after all the currently in-flight readers have completed.
The RCU library provides `rcu_drop()`, which is a convenience function that
schedules the given object to be dropped once all the currently in-flight
readers have completed.
### Progress
The interface for progressing the RCU state machine is as follows:
```rust
fn rcu_synchronize()
```
Clients must call at least one of these functions periodically to ensure that
callbacks scheduled with `rcu_call()` eventually happen.
The `rcu_synchronize()` function blocks until the RCU state machine has advanced
sufficiently to call all the callbacks that were scheduled prior to calling
`rcu_synchronize()`.
## State Machine
To show that the RCU state machine provides the necessary synchronization
properties, we need to show two properties about a given object `o` protected by
the RCU mechanism:
1. All the writes to `o` happen before all the reads from `o`.
2. All the reads from `o` happen before we run the RCU callback (i.e., the
callback scheduled with `rcu_call()`) associated with `o`.
First, we will state the forms of the relevant operations and then we will show
that they have desired synchronization properties.
### States
#### Idle
The RCU state machine starts in the *Idle* state. In this state, readers can
begin read operations, which are counted using the `read_counters`. The state
machine remains in this state until the next call to `rcu_synchronize`.
There are no preconditions for leaving the *Idle* state. The post condition for
leaving the *Idle* state is that the `callback_chain` has been moved to the
`waiting_callbacks` queue, the `generation` counter has been increased, and the
state machine is in the *Waiting* state.
#### Waiting
In the *Waiting* state, existing readers complete and decrement their
`read_counter`. New readers can begin read operations, and increment a different
`read_counter`. The state machine remains in this state until the precondition
for leaving the *Waiting* state has been obtained.
The precondition for leaving the *Waiting* state is that the `read_counter` for
the previous generation has reached zero. This condition indicates that all the
read operations that were in flight when the state machine entered the *Waiting*
state have completed.
The postcondition for leaving the *Waiting* state is that the front entry in the
`waiting_callbacks` queue has been removed (advancing all the callbacks in the
queue) and the state machine is in the *Idle* state.
After leaving the *Waiting* state, the set of callbacks removed from the
`waiting_callbacks` queue run, potentially on a different thread.
### Operations
#### Writes
All the writes to `o` have the following form:
```rust
[create and initialize o]
let p = [address of o];
rcu_assign_ptr(&ptr, p); // or rcu_replace_pointer()
```
#### Reads
All the reads from `o` have the following form:
```rust
rcu_read_lock();
let p = rcu_read_pointer(&ptr);
[read from o via p]
rcu_read_unlock();
```
#### Callbacks
All the callbacks associated with `o` have the following form:
```rust
let another_pointer = [address of another object or null];
let p = rcu_replace_pointer(&ptr, another_pointer);
rcu_call(|| {
[operate on p, e.g., freeing o]
});
```
### Writes Happen Before Reads
Readers call `rcu_read_pointer()` before reading from `o`, which contains an
`Ordering::Acquire` load of the `AtomicPtr` to `o` at synchronization point
`[D]`. This load _synchronizes-with_ the `rcu_assign_ptr()` operation performed
by the writer because `rcu_assign_ptr()` contains an `Ordering::Release` store
of the same `AtomicPtr` at synchronization point `[E]`. For this reason, the
write to `o` _happens-before_ the read from `o`.
### Reads Happen Before Callbacks Run
Assume, without loss of generality, that the RCU state machine starts in the
*Idle* state and that the `generation` is even. Consider a given reader thread
reading from `o` and another thread scheduling a callback associated with `o`.
Before we run that callback, the state machine will need to advance through the
*Idle* and *Waiting* states twice.
In this sequence, there are three atomic operations with `Ordering::SeqCst`:
1. Synchronization point `[A]` in `rcu_read_lock()`.
2. Synchronization point `[C1]` the first time through the *Waiting* state in
`rcu_synchronize()`.
3. Synchronization point `[C2]` the second time through the *Waiting* state in
`rcu_synchronize()`.
The memory model guarantees that these three operations happen in a single total
order. The mutex that protects the `rcu_control_block` ensures that `[C1]` is
always before `[C2]` in the single total order. The argument for correctness is
different depending on whether `[A]` is before or after `[C1]` in the single
total order.
#### `[A]` is before `[C1]`
When the reader loads the `generation` count, the reader will either load an
even or an odd generation number. If the `generation` count is even, then the
load at `[C1]` will _synchronize-with_ the decrement in `rcu_read_unlock()`
(synchronization point `[B]`) because both operate on `read_counter[0]` (recall
that we assumed the `generation` count was originally even). If the loaded
`generation` count is odd, then the load at `[C2]` will _synchronize-with_ the
decrement in `rcu_read_unlock()` because both operate on `read_counter[1]`.
Either way, the read from `o` _happens-before_ the state machine exits
the *Waiting* state the second time.
#### `[C1]` is before `[A]`
We will show the _happens-before_ relation for the following synchronization
points, in order: `[F]` in `rcu_replace_pointer()`, `[G]` in `rcu_call()`, `[H]`
in the *Idle* state of `rcu_synchronize()`, `[C1]` in the *Waiting* state
of `rcu_synchronize()`, `[A]` in `rcu_read_lock()`, and `[D]` in
`rcu_read_pointer()`:
* `[F]` _happens-before_ `[G]` because `[F]` is _sequenced-before_ `[G]`.
* `[G]` _happens-before_ `[H]` because `[G]` _synchronizes-with_ `[H]`.
* `[H]` _happens-before_ `[C1]` because of the mutex that protects the
`rcu_control_block`.
* `[C1]` _happens-before_ `[A]` because `[C1]` precedes `[A]` in the single
total order.
* `[A]` _happens-before_ `[D]` because `[A]` is _sequenced-before_ `[D]`.
Therefore, the pointer to `o` stored in the `AtomicPtr` is replaced with another
value before being loaded by the reader, which means the reader does not
actually read from `o`. The callback runs after all the reads from `o` because
there are no reads from `o`.
*Note:* The logic described above suggests that the sequential consistency of
`[C1]` and `[A]` combined with the synchronization between `[G]` and `[H]` is
sufficient. However, this reasoning relies on transitivity across three
different threads (Writer -> Synchronizer -> Reader), which is not automatically
guaranteed by the memory model. A model checker (see
https://fxbug.dev/484397559) has shown that this sequence does not strictly
guarantee that the reader observes the new pointer. As a result, `rcu_call` now
includes explicit synchronization with `rcu_read_lock` to cover this case.
## RSEQ Backend
When the `rseq_backend` feature is enabled, the RCU implementation uses
Restartable Sequences (RSEQ) and membarriers to optimize the read-side critical
path. This approach avoids atomic read-modify-write operations on shared cache
lines, which significantly improves scalability on many-core systems.
### RSEQ Primitives
RSEQ allows us to perform per-CPU operations atomically with respect to other
threads on the same CPU. We use this to maintain per-CPU read counters.
The per-CPU state consists of two pairs of counters (one pair for each
generation):
- `begin`: Incremented when entering a read-side critical section.
- `end`: Incremented when exiting a read-side critical section.
A given read-side critical section might begin on one CPU and end on a different
CPU. In that case, the `begin` counter will be incremented on the first CPU, and
the `end` counter will be incremented on the second CPU. For this reason, the
absolute value of the `begin` and `end` counters is not meaningful, only the
difference between the `begin` and `end` counters for a given generation is
meaningful.
To determine if there are any active readers for a given generation, we sum the
negated `end` counter and the `begin` counter over all CPUs. If the sum is
zero, then there are no active readers.
### Barrier Pairing
1. **Read-Side**: `rcu_read_lock()` and `rcu_read_unlock()` issue a
`compiler_fence(Ordering::SeqCst)`. This barrier prevents the compiler from
reordering the critical section outside the counter increments.
2. **Writer-Side**: `rcu_synchronize()` (specifically `has_active_readers`)
issues a system barrier (`zx_system_barrier`).
The pairing works as follows:
- `rcu_read_lock()`:
1. Increment `begin` (RSEQ).
2. `CompilerBarrier`.
3. Critical Section.
- `rcu_read_unlock()`:
1. Critical Section.
2. `CompilerBarrier`.
3. Increment `end` (RSEQ).
- `rcu_synchronize()` (checking for quiescence):
1. Sum all `end` counters (negated).
2. System barrier (`zx_system_barrier`).
3. Sum all `begin` counters (positive).
If `Sum(begin) - Sum(end) == 0`, then all readers that started before the RSEQ
barrier have completed.
The critical ordering is reading `end` *before* the barrier and `begin` *after*
the barrier. There are four cases to consider:
1. The advancer observes the increment to `begin` but not to `end`. This
observation implies that there is an active reader and the advancer will
continue to wait.
2. The advancer observes the increment to both `begin` and `end`. This
observation implies the read critical section is complete and the advancer
can proceed.
3. The advancer observes the increment to neither `begin` nor `end`. This
observation implies that the read critical section belongs to the next
generation and the advancer can proceed.
4. The advancer observes the increment to `end` but not to `begin`. This
observation would be problematic because the advancer would incorrectly
think there were fewer active readers than there actually were.
The correctness of this barrier pairing relies on the total order `S` of memory
operations guarantees by the memory model. Specifically, the
`compiler_fence(Ordering::SeqCst)` in `rcu_read_lock()` ensures that the
increment to `begin` is _sequenced-before_ the critical section, and the
critical section is _sequenced-before_ the increment to `end`.
The `zx_system_barrier(ZX_SYSTEM_BARRIER_DATA_MEMORY)` in `has_active_readers` ensures that if the advancer
observes the increment to `end` (at step 1), it must also observe the
increment to `begin` (at step 3). This is because the store to `begin`
_happens-before_ the store to `end` in the reader thread (due to program order
and the compiler fence). When `zx_system_barrier` acts as a memory barrier, it
ensures that all stores sequenced-before the barrier interruption point in the
reader are visible to the advancer after the barrier. Since `begin` is
sequenced-before `end`, if `end` is visible, `begin` must also be visible.
Therefore, Case 4 is impossible: the advancer will never underestimate the
number of active readers.
### Update Visibility Barrier
In addition to the barrier in `has_active_readers`, the RSEQ backend requires a
barrier to ensure that pointer updates made by writers are visible to readers
that start *after* the grace period begins.
Without this barrier, a race could occur:
1. A writer updates a pointer and queues a callback.
2. The state machine advances the generation (e.g., from 0 to 1).
3. A reader starts on another CPU, reads the new generation (1), and registers
in the new generation's counters. However, because it doesn't use memory
barriers, it might still see the *old* pointer value if the store hasn't
propagated.
4. The state machine checks for active readers for generation 0. Since the
reader registered in generation 1, it is ignored.
5. The grace period for generation 0 ends, and the callback runs, freeing the
old data while the reader is still accessing it.
To prevent this race, `rcu_grace_period()` issues a
`zx_system_barrier(ZX_SYSTEM_BARRIER_DATA_MEMORY)` before advancing the
generation counter. This forces all prior stores (including the pointer
updates) to be visible to all CPUs. Consequently, any reader that observes the
new generation is guaranteed to see the new pointer value.