src/lib/fuchsia-rcu/design.md - fuchsia - Git at Google

 # Fuchsia RCU Design

 This document describes how the Fuchsia RCU crate uses atomics and the C++20
 memory model to synchronize access to data.

 ## Usage Model

 You can use `RcuCell` to create a cell that can be used concurrently from
 multiple threads. Many threads can read from the cell concurrently and do not
 block on writers. When the cell is written, reads may continue to see the old
 value of the cell for some period of time.

 ```rust
 struct MyStruct {
   a: usize,
   b: usize,
 }

 struct SharedStruct {
   foo: RcuCell<MyStruct>
 }
 ```

 ### Read

 To read the value of the cell, use the `read()` method:

 ```rust
 fn my_reader(x: Arc<SharedStruct>) {
   let guard = x.foo.read();
   println!("foo.a: {}", guard.a);
   println!("foo.b: {}", guard.b);
 }
 ```

 The values of `a` and `b` will be consistent with each other as long as their
 are read from the same `RcuReadGuard` returned from `read()`. However, another
 thread running concurrently might see different values for `a` and `b` when
 reading this cell.

 ### Write

 To write the value of the cell, use the `set()` method:

 ```rust
 fn my_writer(x: Arc<SharedStruct>) {
   x.foo.set(MyStruct { a: 42, b: 24 });
 }
 ```

 After `set` returns, future calls to `read()` will observe the new values for
 `a` and `b`, but concurrent readers can still observe the old values. The
 storage for the old values will not be reclaimed until all the concurrent
 readers have dropped their read guards and the RCU state machine has made
 sufficient progress.

 ### Progress requirements

 In order for the RCU state machine to make progress, the program using
 `fuchsia-rcu` must periodically call `rcu_synchronize`. Otherwise, memory
 allocated during `RcuCell::set` will never be freed.

 ## Low-level interface

 `RcuCell` and similar high-level data structures are built on top of a low-level
 interface to the RCU state machine. This interface synchronizes access to
 objects referenced through `std::sync::atomic::AtomicPtr`.

 The RCU library also provides `RcuPtr`, which is a thin wrapper around an
 `AtomicPtr` and the low-level interface described below. Clients that require
 low-level access to the RCU state machine should use `RcuPtr` rather than
 calling the low-level interface directly.

 ### Read

 The interface for reading objects managed by RCU is as follows:

 ```rust
 fn rcu_read_lock()
 fn rcu_read_unlock()
 fn rcu_read_pointer(ptr: &AtomicPtr<T>) -> *const T
 ```

 To read from an object, a client must first call `rcu_read_lock()` and then call
 `rcu_read_pointer()`. The returned pointer remains valid until the balancing
 call to `rcu_read_unlock()`, after which time the client must not dereference
 the pointer.

 A given thread can acquire nested RCU read locks. Threads running concurrently
 might receive pointers to different objects when calling `rcu_read_pointer()`
 with the same `AtomicPtr`.

 ## Write

 The interface for writing objects managed by RCU is as follows:

 ```rust
 fn rcu_assign_pointer(ptr: &AtomicPtr<T>, new_ptr: *mut T)
 fn rcu_replace_pointer(ptr: &AtomicPtr<T>, new_ptr: *mut T) -> *mut T
 fn rcu_call(callback: impl FnOnce() + Send + Sync + 'static)
 fn rcu_drop<T: Send + Sync + 'static>(value: T)
 ```

 To write an object, a client first creates a new instance of the object and then
 uses either `rcu_assign_pointer()` or `rcu_replace_pointer()` to store a pointer
 to that object in an `AtomicPtr`. The `rcu_assign_pointer()` operation discards
 whatever value was previously stored in this pointer, whereas the
 `rcu_replace_pointer()` operation returns the previous value of the `AtomicPtr`.

 Typically, the caller wishes to clean up whatever object was previously stored
 in the `AtomicPtr`. However, the caller cannot clean up that object immediately
 because there might be other threads that are currently reading from that
 object. Instead, the caller uses `rcu_call()` to schedule a callback that will
 run after all the currently in-flight readers have completed.

 The RCU library provides `rcu_drop()`, which is a convenience function that
 schedules the given object to be dropped once all the currently in-flight
 readers have completed.

 ### Progress

 The interface for progressing the RCU state machine is as follows:

 ```rust
 fn rcu_synchronize()
 ```

 Clients must call at least one of these functions periodically to ensure that
 callbacks scheduled with `rcu_call()` eventually happen.

 The `rcu_synchronize()` function blocks until the RCU state machine has advanced
 sufficiently to call all the callbacks that were scheduled prior to calling
 `rcu_synchronize()`.

 ## State Machine

 To show that the RCU state machine provides the necessary synchronization
 properties, we need to show two properties about a given object `o` protected by
 the RCU mechanism:

  1. All the writes to `o` happen before all the reads from `o`.
  2. All the reads from `o` happen before we run the RCU callback (i.e., the
     callback scheduled with `rcu_call()`) associated with `o`.

 First, we will state the forms of the relevant operations and then we will show
 that they have desired synchronization properties.

 ### States

 #### Idle

 The RCU state machine starts in the *Idle* state. In this state, readers can
 begin read operations, which are counted using the `read_counters`. The state
 machine remains in this state until the next call to `rcu_synchronize`.

 There are no preconditions for leaving the *Idle* state. The post condition for
 leaving the *Idle* state is that the `callback_chain` has been moved to the
 `waiting_callbacks` queue, the `generation` counter has been increased, and the
 state machine is in the *Waiting* state.

 #### Waiting

 In the *Waiting* state, existing readers complete and decrement their
 `read_counter`. New readers can begin read operations, and increment a different
 `read_counter`. The state machine remains in this state until the precondition
 for leaving the *Waiting* state has been obtained.

 The precondition for leaving the *Waiting* state is that the `read_counter` for
 the previous generation has reached zero. This condition indicates that all the
 read operations that were in flight when the state machine entered the *Waiting*
 state have completed.

 The postcondition for leaving the *Waiting* state is that the front entry in the
 `waiting_callbacks` queue has been removed (advancing all the callbacks in the
 queue) and the state machine is in the *Idle* state.

 After leaving the *Waiting* state, the set of callbacks removed from the
 `waiting_callbacks` queue run, potentially on a different thread.

 ### Operations

 #### Writes

 All the writes to `o` have the following form:

 ```rust
 [create and initialize o]
 let p = [address of o];
 rcu_assign_ptr(&ptr, p); // or rcu_replace_pointer()
 ```

 #### Reads

 All the reads from `o` have the following form:

 ```rust
 rcu_read_lock();
 let p = rcu_read_pointer(&ptr);
 [read from o via p]
 rcu_read_unlock();
 ```

 #### Callbacks

 All the callbacks associated with `o` have the following form:

 ```rust
 let another_pointer = [address of another object or null];
 let p = rcu_replace_pointer(&ptr, another_pointer);
 rcu_call(|| {
   [operate on p, e.g., freeing o]
 });
 ```

 ### Writes Happen Before Reads

 Readers call `rcu_read_pointer()` before reading from `o`, which contains an
 `Ordering::Acquire` load of the `AtomicPtr` to `o` at synchronization point
 `[D]`. This load _synchronizes-with_ the `rcu_assign_ptr()` operation performed
 by the writer because `rcu_assign_ptr()` contains an `Ordering::Release` store
 of the same `AtomicPtr` at synchronization point `[E]`. For this reason, the
 write to `o` _happens-before_ the read from `o`.

 ### Reads Happen Before Callbacks Run

 Assume, without loss of generality, that the RCU state machine starts in the
 *Idle* state and that the `generation` is even. Consider a given reader thread
 reading from `o` and another thread scheduling a callback associated with `o`.
 Before we run that callback, the state machine will need to advance through the
 *Idle* and *Waiting* states twice.

 In this sequence, there are three atomic operations with `Ordering::SeqCst`:

  1. Synchronization point `[A]` in `rcu_read_lock()`.
  2. Synchronization point `[C1]` the first time through the *Waiting* state in
     `rcu_synchronize()`.
  3. Synchronization point `[C2]` the second time through the *Waiting* state in
     `rcu_synchronize()`.

 The memory model guarantees that these three operations happen in a single total
 order. The mutex that protects the `rcu_control_block` ensures that `[C1]` is
 always before `[C2]` in the single total order. The argument for correctness is
 different depending on whether `[A]` is before or after `[C1]` in the single
 total order.

 #### `[A]` is before `[C1]`

 When the reader loads the `generation` count, the reader will either load an
 even or an odd generation number. If the `generation` count is even, then the
 load at `[C1]` will _synchronize-with_ the decrement in `rcu_read_unlock()`
 (synchronization point `[B]`) because both operate on `read_counter[0]` (recall
 that we assumed the `generation` count was originally even). If the loaded
 `generation` count is odd, then the load at `[C2]` will _synchronize-with_ the
 decrement in `rcu_read_unlock()` because both operate on `read_counter[1]`.
 Either way, the read from `o` _happens-before_ the state machine exits
 the *Waiting* state the second time.

 #### `[C1]` is before `[A]`

 We will show the _happens-before_ relation for the following synchronization
 points, in order: `[F]` in `rcu_replace_pointer()`, `[G]` in `rcu_call()`, `[H]`
 in the *Idle* state of `rcu_synchronize()`, `[C1]` in the *Waiting* state
 of `rcu_synchronize()`, `[A]` in `rcu_read_lock()`, and `[D]` in
 `rcu_read_pointer()`:

  * `[F]` _happens-before_ `[G]` because `[F]` is _sequenced-before_ `[G]`.
  * `[G]` _happens-before_ `[H]` because `[G]` _synchronizes-with_ `[H]`.
  * `[H]` _happens-before_ `[C1]` because of the mutex that protects the
    `rcu_control_block`.
  * `[C1]` _happens-before_ `[A]` because `[C1]` precedes `[A]` in the single
    total order.
  * `[A]` _happens-before_ `[D]` because `[A]` is _sequenced-before_ `[D]`.

 Therefore, the pointer to `o` stored in the `AtomicPtr` is replaced with another
 value before being loaded by the reader, which means the reader does not
 actually read from `o`. The callback runs after all the reads from `o` because
 there are no reads from `o`.

 *Note:* The logic described above suggests that the sequential consistency of
 `[C1]` and `[A]` combined with the synchronization between `[G]` and `[H]` is
 sufficient. However, this reasoning relies on transitivity across three
 different threads (Writer -> Synchronizer -> Reader), which is not automatically
 guaranteed by the memory model. A model checker (see
 https://fxbug.dev/484397559) has shown that this sequence does not strictly
 guarantee that the reader observes the new pointer. As a result, `rcu_call` now
 includes explicit synchronization with `rcu_read_lock` to cover this case.

 ## RSEQ Backend

 When the `rseq_backend` feature is enabled, the RCU implementation uses
 Restartable Sequences (RSEQ) and membarriers to optimize the read-side critical
 path. This approach avoids atomic read-modify-write operations on shared cache
 lines, which significantly improves scalability on many-core systems.

 ### RSEQ Primitives

 RSEQ allows us to perform per-CPU operations atomically with respect to other
 threads on the same CPU. We use this to maintain per-CPU read counters.

 The per-CPU state consists of two pairs of counters (one pair for each
 generation):

 - `begin`: Incremented when entering a read-side critical section.
 - `end`: Incremented when exiting a read-side critical section.

 A given read-side critical section might begin on one CPU and end on a different
 CPU. In that case, the `begin` counter will be incremented on the first CPU, and
 the `end` counter will be incremented on the second CPU. For this reason, the
 absolute value of the `begin` and `end` counters is not meaningful, only the
 difference between the `begin` and `end` counters for a given generation is
 meaningful.

 To determine if there are any active readers for a given generation, we sum the
 negated `end` counter and the `begin` counter over all CPUs. If the sum is
 zero, then there are no active readers.

 ### Barrier Pairing

 1.  **Read-Side**: `rcu_read_lock()` and `rcu_read_unlock()` issue a
     `compiler_fence(Ordering::SeqCst)`. This barrier prevents the compiler from
     reordering the critical section outside the counter increments.

 2.  **Writer-Side**: `rcu_synchronize()` (specifically `has_active_readers`)
     issues a system barrier (`zx_system_barrier`).

 The pairing works as follows:

 -   `rcu_read_lock()`:
     1.  Increment `begin` (RSEQ).
     2.  `CompilerBarrier`.
     3.  Critical Section.

 -   `rcu_read_unlock()`:
     1.  Critical Section.
     2.  `CompilerBarrier`.
     3.  Increment `end` (RSEQ).

 -   `rcu_synchronize()` (checking for quiescence):
     1.  Sum all `end` counters (negated).
     2.  System barrier (`zx_system_barrier`).
     3.  Sum all `begin` counters (positive).

 If `Sum(begin) - Sum(end) == 0`, then all readers that started before the RSEQ
 barrier have completed.

 The critical ordering is reading `end` *before* the barrier and `begin` *after*
 the barrier. There are four cases to consider:

  1. The advancer observes the increment to `begin` but not to `end`. This
     observation implies that there is an active reader and the advancer will
     continue to wait.
  2. The advancer observes the increment to both `begin` and `end`. This
     observation implies the read critical section is complete and the advancer
     can proceed.
  3. The advancer observes the increment to neither `begin` nor `end`. This
     observation implies that the read critical section belongs to the next
     generation and the advancer can proceed.
  4. The advancer observes the increment to `end` but not to `begin`. This
     observation would be problematic because the advancer would incorrectly
     think there were fewer active readers than there actually were.

 The correctness of this barrier pairing relies on the total order `S` of memory
 operations guarantees by the memory model. Specifically, the
 `compiler_fence(Ordering::SeqCst)` in `rcu_read_lock()` ensures that the
 increment to `begin` is _sequenced-before_ the critical section, and the
 critical section is _sequenced-before_ the increment to `end`.

 The `zx_system_barrier(ZX_SYSTEM_BARRIER_DATA_MEMORY)` in `has_active_readers` ensures that if the advancer
 observes the increment to `end` (at step 1), it must also observe the
 increment to `begin` (at step 3). This is because the store to `begin`
 _happens-before_ the store to `end` in the reader thread (due to program order
 and the compiler fence). When `zx_system_barrier` acts as a memory barrier, it
 ensures that all stores sequenced-before the barrier interruption point in the
 reader are visible to the advancer after the barrier. Since `begin` is
 sequenced-before `end`, if `end` is visible, `begin` must also be visible.
 Therefore, Case 4 is impossible: the advancer will never underestimate the
 number of active readers.

 ### Update Visibility Barrier

 In addition to the barrier in `has_active_readers`, the RSEQ backend requires a
 barrier to ensure that pointer updates made by writers are visible to readers
 that start *after* the grace period begins.

 Without this barrier, a race could occur:
 1. A writer updates a pointer and queues a callback.
 2. The state machine advances the generation (e.g., from 0 to 1).
 3. A reader starts on another CPU, reads the new generation (1), and registers
    in the new generation's counters. However, because it doesn't use memory
    barriers, it might still see the *old* pointer value if the store hasn't
    propagated.
 4. The state machine checks for active readers for generation 0. Since the
    reader registered in generation 1, it is ignored.
 5. The grace period for generation 0 ends, and the callback runs, freeing the
    old data while the reader is still accessing it.

 To prevent this race, `rcu_grace_period()` issues a
 `zx_system_barrier(ZX_SYSTEM_BARRIER_DATA_MEMORY)` before advancing the
 generation counter. This forces all prior stores (including the pointer
 updates) to be visible to all CPUs. Consequently, any reader that observes the
 new generation is guaranteed to see the new pointer value.
	# Fuchsia RCU Design

	This document describes how the Fuchsia RCU crate uses atomics and the C++20
	memory model to synchronize access to data.

	## Usage Model

	You can use `RcuCell` to create a cell that can be used concurrently from
	multiple threads. Many threads can read from the cell concurrently and do not
	block on writers. When the cell is written, reads may continue to see the old
	value of the cell for some period of time.

	```rust
	struct MyStruct {
	a: usize,
	b: usize,
	}

	struct SharedStruct {
	foo: RcuCell<MyStruct>
	}
	```

	### Read

	To read the value of the cell, use the `read()` method:

	```rust
	fn my_reader(x: Arc<SharedStruct>) {
	let guard = x.foo.read();
	println!("foo.a: {}", guard.a);
	println!("foo.b: {}", guard.b);
	}
	```

	The values of `a` and `b` will be consistent with each other as long as their
	are read from the same `RcuReadGuard` returned from `read()`. However, another
	thread running concurrently might see different values for `a` and `b` when
	reading this cell.

	### Write

	To write the value of the cell, use the `set()` method:

	```rust
	fn my_writer(x: Arc<SharedStruct>) {
	x.foo.set(MyStruct { a: 42, b: 24 });
	}
	```

	After `set` returns, future calls to `read()` will observe the new values for
	`a` and `b`, but concurrent readers can still observe the old values. The
	storage for the old values will not be reclaimed until all the concurrent
	readers have dropped their read guards and the RCU state machine has made
	sufficient progress.

	### Progress requirements

	In order for the RCU state machine to make progress, the program using
	`fuchsia-rcu` must periodically call `rcu_synchronize`. Otherwise, memory
	allocated during `RcuCell::set` will never be freed.

	## Low-level interface

	`RcuCell` and similar high-level data structures are built on top of a low-level
	interface to the RCU state machine. This interface synchronizes access to
	objects referenced through `std::sync::atomic::AtomicPtr`.

	The RCU library also provides `RcuPtr`, which is a thin wrapper around an
	`AtomicPtr` and the low-level interface described below. Clients that require
	low-level access to the RCU state machine should use `RcuPtr` rather than
	calling the low-level interface directly.

	### Read

	The interface for reading objects managed by RCU is as follows:

	```rust
	fn rcu_read_lock()
	fn rcu_read_unlock()
	fn rcu_read_pointer(ptr: &AtomicPtr<T>) -> *const T
	```

	To read from an object, a client must first call `rcu_read_lock()` and then call
	`rcu_read_pointer()`. The returned pointer remains valid until the balancing
	call to `rcu_read_unlock()`, after which time the client must not dereference
	the pointer.

	A given thread can acquire nested RCU read locks. Threads running concurrently
	might receive pointers to different objects when calling `rcu_read_pointer()`
	with the same `AtomicPtr`.

	## Write

	The interface for writing objects managed by RCU is as follows:

	```rust
	fn rcu_assign_pointer(ptr: &AtomicPtr<T>, new_ptr: *mut T)
	fn rcu_replace_pointer(ptr: &AtomicPtr<T>, new_ptr: mut T) -> mut T
	fn rcu_call(callback: impl FnOnce() + Send + Sync + 'static)
	fn rcu_drop<T: Send + Sync + 'static>(value: T)
	```

	To write an object, a client first creates a new instance of the object and then
	uses either `rcu_assign_pointer()` or `rcu_replace_pointer()` to store a pointer
	to that object in an `AtomicPtr`. The `rcu_assign_pointer()` operation discards
	whatever value was previously stored in this pointer, whereas the
	`rcu_replace_pointer()` operation returns the previous value of the `AtomicPtr`.

	Typically, the caller wishes to clean up whatever object was previously stored
	in the `AtomicPtr`. However, the caller cannot clean up that object immediately
	because there might be other threads that are currently reading from that
	object. Instead, the caller uses `rcu_call()` to schedule a callback that will
	run after all the currently in-flight readers have completed.

	The RCU library provides `rcu_drop()`, which is a convenience function that
	schedules the given object to be dropped once all the currently in-flight
	readers have completed.

	### Progress

	The interface for progressing the RCU state machine is as follows:

	```rust
	fn rcu_synchronize()
	```

	Clients must call at least one of these functions periodically to ensure that
	callbacks scheduled with `rcu_call()` eventually happen.

	The `rcu_synchronize()` function blocks until the RCU state machine has advanced
	sufficiently to call all the callbacks that were scheduled prior to calling
	`rcu_synchronize()`.

	## State Machine

	To show that the RCU state machine provides the necessary synchronization
	properties, we need to show two properties about a given object `o` protected by
	the RCU mechanism:

	1. All the writes to `o` happen before all the reads from `o`.
	2. All the reads from `o` happen before we run the RCU callback (i.e., the
	callback scheduled with `rcu_call()`) associated with `o`.

	First, we will state the forms of the relevant operations and then we will show
	that they have desired synchronization properties.

	### States

	#### Idle

	The RCU state machine starts in the Idle state. In this state, readers can
	begin read operations, which are counted using the `read_counters`. The state
	machine remains in this state until the next call to `rcu_synchronize`.

	There are no preconditions for leaving the Idle state. The post condition for
	leaving the Idle state is that the `callback_chain` has been moved to the
	`waiting_callbacks` queue, the `generation` counter has been increased, and the
	state machine is in the Waiting state.

	#### Waiting

	In the Waiting state, existing readers complete and decrement their
	`read_counter`. New readers can begin read operations, and increment a different
	`read_counter`. The state machine remains in this state until the precondition
	for leaving the Waiting state has been obtained.

	The precondition for leaving the Waiting state is that the `read_counter` for
	the previous generation has reached zero. This condition indicates that all the
	read operations that were in flight when the state machine entered the Waiting
	state have completed.

	The postcondition for leaving the Waiting state is that the front entry in the
	`waiting_callbacks` queue has been removed (advancing all the callbacks in the
	queue) and the state machine is in the Idle state.

	After leaving the Waiting state, the set of callbacks removed from the
	`waiting_callbacks` queue run, potentially on a different thread.

	### Operations

	#### Writes

	All the writes to `o` have the following form:

	```rust
	[create and initialize o]
	let p = [address of o];
	rcu_assign_ptr(&ptr, p); // or rcu_replace_pointer()
	```

	#### Reads

	All the reads from `o` have the following form:

	```rust
	rcu_read_lock();
	let p = rcu_read_pointer(&ptr);
	[read from o via p]
	rcu_read_unlock();
	```

	#### Callbacks

	All the callbacks associated with `o` have the following form:

	```rust
	let another_pointer = [address of another object or null];
	let p = rcu_replace_pointer(&ptr, another_pointer);
	rcu_call(\|\| {
	[operate on p, e.g., freeing o]
	});
	```

	### Writes Happen Before Reads

	Readers call `rcu_read_pointer()` before reading from `o`, which contains an
	`Ordering::Acquire` load of the `AtomicPtr` to `o` at synchronization point
	`[D]`. This load _synchronizes-with_ the `rcu_assign_ptr()` operation performed
	by the writer because `rcu_assign_ptr()` contains an `Ordering::Release` store
	of the same `AtomicPtr` at synchronization point `[E]`. For this reason, the
	write to `o` _happens-before_ the read from `o`.

	### Reads Happen Before Callbacks Run

	Assume, without loss of generality, that the RCU state machine starts in the
	Idle state and that the `generation` is even. Consider a given reader thread
	reading from `o` and another thread scheduling a callback associated with `o`.
	Before we run that callback, the state machine will need to advance through the
	Idle and Waiting states twice.

	In this sequence, there are three atomic operations with `Ordering::SeqCst`:

	1. Synchronization point `[A]` in `rcu_read_lock()`.
	2. Synchronization point `[C1]` the first time through the Waiting state in
	`rcu_synchronize()`.
	3. Synchronization point `[C2]` the second time through the Waiting state in
	`rcu_synchronize()`.

	The memory model guarantees that these three operations happen in a single total
	order. The mutex that protects the `rcu_control_block` ensures that `[C1]` is
	always before `[C2]` in the single total order. The argument for correctness is
	different depending on whether `[A]` is before or after `[C1]` in the single
	total order.

	#### `[A]` is before `[C1]`

	When the reader loads the `generation` count, the reader will either load an
	even or an odd generation number. If the `generation` count is even, then the
	load at `[C1]` will _synchronize-with_ the decrement in `rcu_read_unlock()`
	(synchronization point `[B]`) because both operate on `read_counter[0]` (recall
	that we assumed the `generation` count was originally even). If the loaded
	`generation` count is odd, then the load at `[C2]` will _synchronize-with_ the
	decrement in `rcu_read_unlock()` because both operate on `read_counter[1]`.
	Either way, the read from `o` _happens-before_ the state machine exits
	the Waiting state the second time.

	#### `[C1]` is before `[A]`

	We will show the _happens-before_ relation for the following synchronization
	points, in order: `[F]` in `rcu_replace_pointer()`, `[G]` in `rcu_call()`, `[H]`
	in the Idle state of `rcu_synchronize()`, `[C1]` in the Waiting state
	of `rcu_synchronize()`, `[A]` in `rcu_read_lock()`, and `[D]` in
	`rcu_read_pointer()`:

	* `[F]` _happens-before_ `[G]` because `[F]` is _sequenced-before_ `[G]`.
	* `[G]` _happens-before_ `[H]` because `[G]` _synchronizes-with_ `[H]`.
	* `[H]` _happens-before_ `[C1]` because of the mutex that protects the
	`rcu_control_block`.
	* `[C1]` _happens-before_ `[A]` because `[C1]` precedes `[A]` in the single
	total order.
	* `[A]` _happens-before_ `[D]` because `[A]` is _sequenced-before_ `[D]`.

	Therefore, the pointer to `o` stored in the `AtomicPtr` is replaced with another
	value before being loaded by the reader, which means the reader does not
	actually read from `o`. The callback runs after all the reads from `o` because
	there are no reads from `o`.

	Note: The logic described above suggests that the sequential consistency of
	`[C1]` and `[A]` combined with the synchronization between `[G]` and `[H]` is
	sufficient. However, this reasoning relies on transitivity across three
	different threads (Writer -> Synchronizer -> Reader), which is not automatically
	guaranteed by the memory model. A model checker (see
	https://fxbug.dev/484397559) has shown that this sequence does not strictly
	guarantee that the reader observes the new pointer. As a result, `rcu_call` now
	includes explicit synchronization with `rcu_read_lock` to cover this case.

	## RSEQ Backend

	When the `rseq_backend` feature is enabled, the RCU implementation uses
	Restartable Sequences (RSEQ) and membarriers to optimize the read-side critical
	path. This approach avoids atomic read-modify-write operations on shared cache
	lines, which significantly improves scalability on many-core systems.

	### RSEQ Primitives

	RSEQ allows us to perform per-CPU operations atomically with respect to other
	threads on the same CPU. We use this to maintain per-CPU read counters.

	The per-CPU state consists of two pairs of counters (one pair for each
	generation):

	- `begin`: Incremented when entering a read-side critical section.
	- `end`: Incremented when exiting a read-side critical section.

	A given read-side critical section might begin on one CPU and end on a different
	CPU. In that case, the `begin` counter will be incremented on the first CPU, and
	the `end` counter will be incremented on the second CPU. For this reason, the
	absolute value of the `begin` and `end` counters is not meaningful, only the
	difference between the `begin` and `end` counters for a given generation is
	meaningful.

	To determine if there are any active readers for a given generation, we sum the
	negated `end` counter and the `begin` counter over all CPUs. If the sum is
	zero, then there are no active readers.

	### Barrier Pairing

	1. Read-Side: `rcu_read_lock()` and `rcu_read_unlock()` issue a
	`compiler_fence(Ordering::SeqCst)`. This barrier prevents the compiler from
	reordering the critical section outside the counter increments.

	2. Writer-Side: `rcu_synchronize()` (specifically `has_active_readers`)
	issues a system barrier (`zx_system_barrier`).

	The pairing works as follows:

	- `rcu_read_lock()`:
	1. Increment `begin` (RSEQ).
	2. `CompilerBarrier`.
	3. Critical Section.

	- `rcu_read_unlock()`:
	1. Critical Section.
	2. `CompilerBarrier`.
	3. Increment `end` (RSEQ).

	- `rcu_synchronize()` (checking for quiescence):
	1. Sum all `end` counters (negated).
	2. System barrier (`zx_system_barrier`).
	3. Sum all `begin` counters (positive).

	If `Sum(begin) - Sum(end) == 0`, then all readers that started before the RSEQ
	barrier have completed.

	The critical ordering is reading `end` before the barrier and `begin` after
	the barrier. There are four cases to consider:

	1. The advancer observes the increment to `begin` but not to `end`. This
	observation implies that there is an active reader and the advancer will
	continue to wait.
	2. The advancer observes the increment to both `begin` and `end`. This
	observation implies the read critical section is complete and the advancer
	can proceed.
	3. The advancer observes the increment to neither `begin` nor `end`. This
	observation implies that the read critical section belongs to the next
	generation and the advancer can proceed.
	4. The advancer observes the increment to `end` but not to `begin`. This
	observation would be problematic because the advancer would incorrectly
	think there were fewer active readers than there actually were.

	The correctness of this barrier pairing relies on the total order `S` of memory
	operations guarantees by the memory model. Specifically, the
	`compiler_fence(Ordering::SeqCst)` in `rcu_read_lock()` ensures that the
	increment to `begin` is _sequenced-before_ the critical section, and the
	critical section is _sequenced-before_ the increment to `end`.

	The `zx_system_barrier(ZX_SYSTEM_BARRIER_DATA_MEMORY)` in `has_active_readers` ensures that if the advancer
	observes the increment to `end` (at step 1), it must also observe the
	increment to `begin` (at step 3). This is because the store to `begin`
	_happens-before_ the store to `end` in the reader thread (due to program order
	and the compiler fence). When `zx_system_barrier` acts as a memory barrier, it
	ensures that all stores sequenced-before the barrier interruption point in the
	reader are visible to the advancer after the barrier. Since `begin` is
	sequenced-before `end`, if `end` is visible, `begin` must also be visible.
	Therefore, Case 4 is impossible: the advancer will never underestimate the
	number of active readers.

	### Update Visibility Barrier

	In addition to the barrier in `has_active_readers`, the RSEQ backend requires a
	barrier to ensure that pointer updates made by writers are visible to readers
	that start after the grace period begins.

	Without this barrier, a race could occur:
	1. A writer updates a pointer and queues a callback.
	2. The state machine advances the generation (e.g., from 0 to 1).
	3. A reader starts on another CPU, reads the new generation (1), and registers
	in the new generation's counters. However, because it doesn't use memory
	barriers, it might still see the old pointer value if the store hasn't
	propagated.
	4. The state machine checks for active readers for generation 0. Since the
	reader registered in generation 1, it is ignored.
	5. The grace period for generation 0 ends, and the callback runs, freeing the
	old data while the reader is still accessing it.

	To prevent this race, `rcu_grace_period()` issues a
	`zx_system_barrier(ZX_SYSTEM_BARRIER_DATA_MEMORY)` before advancing the
	generation counter. This forces all prior stores (including the pointer
	updates) to be visible to all CPUs. Consequently, any reader that observes the
	new generation is guaranteed to see the new pointer value.