atomics: update documentation

Some of the constraints on operand sizes have been relaxed, so adjust the
documentation.

Deprecate atomic_mb_read and atomic_mb_set; it is not really possible to
use them correctly because they do not interoperate with sequentially-consistent
RMW operations.

Finally, extend the memory barrier pairing section to cover acquire and
release semantics in general, roughly based on the KVM Forum 2016 talk,
"<atomic.h> weapons".

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
diff --git a/docs/devel/atomics.rst b/docs/devel/atomics.rst
index 83ed3d6..445c3b3 100644
--- a/docs/devel/atomics.rst
+++ b/docs/devel/atomics.rst
@@ -11,10 +11,15 @@
 The most basic tool is locking.  Mutexes, condition variables and
 semaphores are used in QEMU, and should be the default approach to
 synchronization.  Anything else is considerably harder, but it's
-also justified more often than one would like.  The two tools that
-are provided by ``qemu/atomic.h`` are memory barriers and atomic operations.
+also justified more often than one would like;
+the most performance-critical parts of QEMU in particular require
+a very low level approach to concurrency, involving memory barriers
+and atomic operations.  The semantics of concurrent memory accesses are governed
+by the C11 memory model.
 
-Macros defined by ``qemu/atomic.h`` fall in three camps:
+QEMU provides a header, ``qemu/atomic.h``, which wraps C11 atomics to
+provide better portability and a less verbose syntax.  ``qemu/atomic.h``
+provides macros that fall in three camps:
 
 - compiler barriers: ``barrier()``;
 
@@ -24,13 +29,21 @@
 
 - sequentially consistent atomic access: everything else.
 
+In general, use of ``qemu/atomic.h`` should be wrapped with more easily
+used data structures (e.g. the lock-free singly-linked list operations
+``QSLIST_INSERT_HEAD_ATOMIC`` and ``QSLIST_MOVE_ATOMIC``) or synchronization
+primitives (such as RCU, ``QemuEvent`` or ``QemuLockCnt``).  Bare use of
+atomic operations and memory barriers should be limited to inter-thread
+checking of flags and documented thoroughly.
+
+
 
 Compiler memory barrier
 =======================
 
-``barrier()`` prevents the compiler from moving the memory accesses either
-side of it to the other side.  The compiler barrier has no direct effect
-on the CPU, which may then reorder things however it wishes.
+``barrier()`` prevents the compiler from moving the memory accesses on
+either side of it to the other side.  The compiler barrier has no direct
+effect on the CPU, which may then reorder things however it wishes.
 
 ``barrier()`` is mostly used within ``qemu/atomic.h`` itself.  On some
 architectures, CPU guarantees are strong enough that blocking compiler
@@ -73,7 +86,8 @@
     typeof(*ptr) atomic_cmpxchg(ptr, old, new)
 
 all of which return the old value of ``*ptr``.  These operations are
-polymorphic; they operate on any type that is as wide as a pointer.
+polymorphic; they operate on any type that is as wide as a pointer or
+smaller.
 
 Similar operations return the new value of ``*ptr``::
 
@@ -85,36 +99,28 @@
     typeof(*ptr) atomic_or_fetch(ptr, val)
     typeof(*ptr) atomic_xor_fetch(ptr, val)
 
-Sequentially consistent loads and stores can be done using::
-
-    atomic_fetch_add(ptr, 0) for loads
-    atomic_xchg(ptr, val) for stores
-
-However, they are quite expensive on some platforms, notably POWER and
-Arm.  Therefore, qemu/atomic.h provides two primitives with slightly
-weaker constraints::
+``qemu/atomic.h`` also provides loads and stores that cannot be reordered
+with each other::
 
     typeof(*ptr) atomic_mb_read(ptr)
     void         atomic_mb_set(ptr, val)
 
-The semantics of these primitives map to Java volatile variables,
-and are strongly related to memory barriers as used in the Linux
-kernel (see below).
+However these do not provide sequential consistency and, in particular,
+they do not participate in the total ordering enforced by
+sequentially-consistent operations.  For this reason they are deprecated.
+They should instead be replaced with any of the following (ordered from
+easiest to hardest):
 
-As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
-be reordered with each other, and it is also not possible to reorder
-"normal" accesses around them.
+- accesses inside a mutex or spinlock
 
-However, and this is the important difference between
-atomic_mb_read/atomic_mb_set and sequential consistency, it is important
-for both threads to access the same volatile variable.  It is not the
-case that everything visible to thread A when it writes volatile field f
-becomes visible to thread B after it reads volatile field g. The store
-and load have to "match" (i.e., be performed on the same volatile
-field) to achieve the right semantics.
+- lightweight synchronization primitives such as ``QemuEvent``
 
+- RCU operations (``atomic_rcu_read``, ``atomic_rcu_set``) when publishing
+  or accessing a new version of a data structure
 
-These operations operate on any type that is as wide as an int or smaller.
+- other atomic accesses: ``atomic_read`` and ``atomic_load_acquire`` for
+  loads, ``atomic_set`` and ``atomic_store_release`` for stores, ``smp_mb``
+  to forbid reordering subsequent loads before a store.
 
 
 Weak atomic access and manual memory barriers
@@ -122,9 +128,24 @@
 
 Compared to sequentially consistent atomic access, programming with
 weaker consistency models can be considerably more complicated.
-In general, if the algorithm you are writing includes both writes
-and reads on the same side, it is generally simpler to use sequentially
-consistent primitives.
+The only guarantees that you can rely upon in this case are:
+
+- atomic accesses will not cause data races (and hence undefined behavior);
+  ordinary accesses instead cause data races if they are concurrent with
+  other accesses of which at least one is a write.  In order to ensure this,
+  the compiler will not optimize accesses out of existence, create unsolicited
+  accesses, or perform other similar optimzations.
+
+- acquire operations will appear to happen, with respect to the other
+  components of the system, before all the LOAD or STORE operations
+  specified afterwards.
+
+- release operations will appear to happen, with respect to the other
+  components of the system, after all the LOAD or STORE operations
+  specified before.
+
+- release operations will *synchronize with* acquire operations;
+  see :ref:`acqrel` for a detailed explanation.
 
 When using this model, variables are accessed with:
 
@@ -142,9 +163,9 @@
 
 - ``atomic_store_release()``, which guarantees the STORE to appear to
   happen, with respect to the other components of the system,
-  after all the LOAD or STORE operations specified afterwards.
+  after all the LOAD or STORE operations specified before.
   Operations coming after ``atomic_store_release()`` can still be
-  reordered after it.
+  reordered before it.
 
 Restrictions to the ordering of accesses can also be specified
 using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``,
@@ -208,168 +229,188 @@
   dependency and a full read barrier or better is required.
 
 
-This is the set of barriers that is required *between* two ``atomic_read()``
-and ``atomic_set()`` operations to achieve sequential consistency:
+Memory barriers and ``atomic_load_acquire``/``atomic_store_release`` are
+mostly used when a data structure has one thread that is always a writer
+and one thread that is always a reader:
 
-   +----------------+-------------------------------------------------------+
-   |                |                 2nd operation                         |
-   |                +------------------+-----------------+------------------+
-   | 1st operation  | (after last)     | atomic_read     | atomic_set       |
-   +----------------+------------------+-----------------+------------------+
-   | (before first) | ..               | none            | smp_mb_release() |
-   +----------------+------------------+-----------------+------------------+
-   | atomic_read    | smp_mb_acquire() | smp_rmb() [1]_  | [2]_             |
-   +----------------+------------------+-----------------+------------------+
-   | atomic_set     | none             | smp_mb() [3]_   | smp_wmb()        |
-   +----------------+------------------+-----------------+------------------+
+    +----------------------------------+----------------------------------+
+    | thread 1                         | thread 2                         |
+    +==================================+==================================+
+    | ::                               | ::                               |
+    |                                  |                                  |
+    |   atomic_store_release(&a, x);   |   y = atomic_load_acquire(&b);   |
+    |   atomic_store_release(&b, y);   |   x = atomic_load_acquire(&a);   |
+    +----------------------------------+----------------------------------+
 
-   .. [1] Or smp_read_barrier_depends().
+In this case, correctness is easy to check for using the "pairing"
+trick that is explained below.
 
-   .. [2] This requires a load-store barrier.  This is achieved by
-          either smp_mb_acquire() or smp_mb_release().
+Sometimes, a thread is accessing many variables that are otherwise
+unrelated to each other (for example because, apart from the current
+thread, exactly one other thread will read or write each of these
+variables).  In this case, it is possible to "hoist" the barriers
+outside a loop.  For example:
 
-   .. [3] This requires a store-load barrier.  On most machines, the only
-          way to achieve this is a full barrier.
+    +------------------------------------------+----------------------------------+
+    | before                                   | after                            |
+    +==========================================+==================================+
+    | ::                                       | ::                               |
+    |                                          |                                  |
+    |   n = 0;                                 |   n = 0;                         |
+    |   for (i = 0; i < 10; i++)               |   for (i = 0; i < 10; i++)       |
+    |     n += atomic_load_acquire(&a[i]);     |     n += atomic_read(&a[i]);     |
+    |                                          |   smp_mb_acquire();              |
+    +------------------------------------------+----------------------------------+
+    | ::                                       | ::                               |
+    |                                          |                                  |
+    |                                          |   smp_mb_release();              |
+    |   for (i = 0; i < 10; i++)               |   for (i = 0; i < 10; i++)       |
+    |     atomic_store_release(&a[i], false);  |     atomic_set(&a[i], false);    |
+    +------------------------------------------+----------------------------------+
+
+Splitting a loop can also be useful to reduce the number of barriers:
+
+    +------------------------------------------+----------------------------------+
+    | before                                   | after                            |
+    +==========================================+==================================+
+    | ::                                       | ::                               |
+    |                                          |                                  |
+    |   n = 0;                                 |     smp_mb_release();            |
+    |   for (i = 0; i < 10; i++) {             |     for (i = 0; i < 10; i++)     |
+    |     atomic_store_release(&a[i], false);  |       atomic_set(&a[i], false);  |
+    |     smp_mb();                            |     smb_mb();                    |
+    |     n += atomic_read(&b[i]);             |     n = 0;                       |
+    |   }                                      |     for (i = 0; i < 10; i++)     |
+    |                                          |       n += atomic_read(&b[i]);   |
+    +------------------------------------------+----------------------------------+
+
+In this case, a ``smp_mb_release()`` is also replaced with a (possibly cheaper, and clearer
+as well) ``smp_wmb()``:
+
+    +------------------------------------------+----------------------------------+
+    | before                                   | after                            |
+    +==========================================+==================================+
+    | ::                                       | ::                               |
+    |                                          |                                  |
+    |                                          |     smp_mb_release();            |
+    |   for (i = 0; i < 10; i++) {             |     for (i = 0; i < 10; i++)     |
+    |     atomic_store_release(&a[i], false);  |       atomic_set(&a[i], false);  |
+    |     atomic_store_release(&b[i], false);  |     smb_wmb();                   |
+    |   }                                      |     for (i = 0; i < 10; i++)     |
+    |                                          |       atomic_set(&b[i], false);  |
+    +------------------------------------------+----------------------------------+
 
 
-You can see that the two possible definitions of ``atomic_mb_read()``
-and ``atomic_mb_set()`` are the following:
+.. _acqrel:
 
-  1) | atomic_mb_read(p)   = atomic_read(p); smp_mb_acquire()
-     | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb()
+Acquire/release pairing and the *synchronizes-with* relation
+------------------------------------------------------------
 
-  2) | atomic_mb_read(p)   = smp_mb() atomic_read(p); smp_mb_acquire()
-     | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v);
+Atomic operations other than ``atomic_set()`` and ``atomic_read()`` have
+either *acquire* or *release* semantics [#rmw]_.  This has two effects:
 
-Usually the former is used, because ``smp_mb()`` is expensive and a program
-normally has more reads than writes.  Therefore it makes more sense to
-make ``atomic_mb_set()`` the more expensive operation.
+.. [#rmw] Read-modify-write operations can have both---acquire applies to the
+          read part, and release to the write.
 
-There are two common cases in which atomic_mb_read and atomic_mb_set
-generate too many memory barriers, and thus it can be useful to manually
-place barriers, or use atomic_load_acquire/atomic_store_release instead:
+- within a thread, they are ordered either before subsequent operations
+  (for acquire) or after previous operations (for release).
 
-- when a data structure has one thread that is always a writer
-  and one thread that is always a reader, manual placement of
-  memory barriers makes the write side faster.  Furthermore,
-  correctness is easy to check for in this case using the "pairing"
-  trick that is explained below:
+- if a release operation in one thread *synchronizes with* an acquire operation
+  in another thread, the ordering constraints propagates from the first to the
+  second thread.  That is, everything before the release operation in the
+  first thread is guaranteed to *happen before* everything after the
+  acquire operation in the second thread.
 
-    +----------------------------------------------------------------------+
-    | thread 1                                                             |
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |   (other writes)                  |                                  |
-    |   atomic_mb_set(&a, x)            |   atomic_store_release(&a, x)    |
-    |   atomic_mb_set(&b, y)            |   atomic_store_release(&b, y)    |
-    +-----------------------------------+----------------------------------+
+The concept of acquire and release semantics is not exclusive to atomic
+operations; almost all higher-level synchronization primitives also have
+acquire or release semantics.  For example:
 
-    +----------------------------------------------------------------------+
-    | thread 2                                                             |
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |   y = atomic_mb_read(&b)          |   y = atomic_load_acquire(&b)    |
-    |   x = atomic_mb_read(&a)          |   x = atomic_load_acquire(&a)    |
-    |   (other reads)                   |                                  |
-    +-----------------------------------+----------------------------------+
+- ``pthread_mutex_lock`` has acquire semantics, ``pthread_mutex_unlock`` has
+  release semantics and synchronizes with a ``pthread_mutex_lock`` for the
+  same mutex.
 
-  Note that the barrier between the stores in thread 1, and between
-  the loads in thread 2, has been optimized here to a write or a
-  read memory barrier respectively.  On some architectures, notably
-  ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as
-  smp_mb, but smp_rmb and/or smp_wmb are more efficient.
+- ``pthread_cond_signal`` and ``pthread_cond_broadcast`` have release semantics;
+  ``pthread_cond_wait`` has both release semantics (synchronizing with
+  ``pthread_mutex_lock``) and acquire semantics (synchronizing with
+  ``pthread_mutex_unlock`` and signaling of the condition variable).
 
-- sometimes, a thread is accessing many variables that are otherwise
-  unrelated to each other (for example because, apart from the current
-  thread, exactly one other thread will read or write each of these
-  variables).  In this case, it is possible to "hoist" the implicit
-  barriers provided by ``atomic_mb_read()`` and ``atomic_mb_set()`` outside
-  a loop.  For example, the above definition ``atomic_mb_read()`` gives
-  the following transformation:
+- ``pthread_create`` has release semantics and synchronizes with the start
+  of the new thread; ``pthread_join`` has acquire semantics and synchronizes
+  with the exiting of the thread.
 
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |   n = 0;                          |   n = 0;                         |
-    |   for (i = 0; i < 10; i++)        |   for (i = 0; i < 10; i++)       |
-    |     n += atomic_mb_read(&a[i]);   |     n += atomic_read(&a[i]);     |
-    |                                   |   smp_mb_acquire();              |
-    +-----------------------------------+----------------------------------+
+- ``qemu_event_set`` has release semantics, ``qemu_event_wait`` has
+  acquire semantics.
 
-  Similarly, atomic_mb_set() can be transformed as follows:
+For example, in the following example there are no atomic accesses, but still
+thread 2 is relying on the *synchronizes-with* relation between ``pthread_exit``
+(release) and ``pthread_join`` (acquire):
 
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |                                   |   smp_mb_release();              |
-    |   for (i = 0; i < 10; i++)        |   for (i = 0; i < 10; i++)       |
-    |     atomic_mb_set(&a[i], false);  |     atomic_set(&a[i], false);    |
-    |                                   |   smp_mb();                      |
-    +-----------------------------------+----------------------------------+
+      +----------------------+-------------------------------+
+      | thread 1             | thread 2                      |
+      +======================+===============================+
+      | ::                   | ::                            |
+      |                      |                               |
+      |   *a = 1;            |                               |
+      |   pthread_exit(a);   |   pthread_join(thread1, &a);  |
+      |                      |   x = *a;                     |
+      +----------------------+-------------------------------+
 
+Synchronization between threads basically descends from this pairing of
+a release operation and an acquire operation.  Therefore, atomic operations
+other than ``atomic_set()`` and ``atomic_read()`` will almost always be
+paired with another operation of the opposite kind: an acquire operation
+will pair with a release operation and vice versa.  This rule of thumb is
+extremely useful; in the case of QEMU, however, note that the other
+operation may actually be in a driver that runs in the guest!
 
-  The other thread can still use ``atomic_mb_read()``/``atomic_mb_set()``.
+``smp_read_barrier_depends()``, ``smp_rmb()``, ``smp_mb_acquire()``,
+``atomic_load_acquire()`` and ``atomic_rcu_read()`` all count
+as acquire operations.  ``smp_wmb()``, ``smp_mb_release()``,
+``atomic_store_release()`` and ``atomic_rcu_set()`` all count as release
+operations.  ``smp_mb()`` counts as both acquire and release, therefore
+it can pair with any other atomic operation.  Here is an example:
 
-The two tricks can be combined.  In this case, splitting a loop in
-two lets you hoist the barriers out of the loops _and_ eliminate the
-expensive ``smp_mb()``:
+      +----------------------+------------------------------+
+      | thread 1             | thread 2                     |
+      +======================+==============================+
+      | ::                   | ::                           |
+      |                      |                              |
+      |   atomic_set(&a, 1); |                              |
+      |   smp_wmb();         |                              |
+      |   atomic_set(&b, 2); |   x = atomic_read(&b);       |
+      |                      |   smp_rmb();                 |
+      |                      |   y = atomic_read(&a);       |
+      +----------------------+------------------------------+
 
-    +-----------------------------------+----------------------------------+
-    | before                            | after                            |
-    +===================================+==================================+
-    | ::                                | ::                               |
-    |                                   |                                  |
-    |                                   |     smp_mb_release();            |
-    |   for (i = 0; i < 10; i++) {      |     for (i = 0; i < 10; i++)     |
-    |     atomic_mb_set(&a[i], false);  |       atomic_set(&a[i], false);  |
-    |     atomic_mb_set(&b[i], false);  |     smb_wmb();                   |
-    |   }                               |     for (i = 0; i < 10; i++)     |
-    |                                   |       atomic_set(&a[i], false);  |
-    |                                   |     smp_mb();                    |
-    +-----------------------------------+----------------------------------+
+Note that a load-store pair only counts if the two operations access the
+same variable: that is, a store-release on a variable ``x`` *synchronizes
+with* a load-acquire on a variable ``x``, while a release barrier
+synchronizes with any acquire operation.  The following example shows
+correct synchronization:
 
+      +--------------------------------+--------------------------------+
+      | thread 1                       | thread 2                       |
+      +================================+================================+
+      | ::                             | ::                             |
+      |                                |                                |
+      |   atomic_set(&a, 1);           |                                |
+      |   atomic_store_release(&b, 2); |   x = atomic_load_acquire(&b); |
+      |                                |   y = atomic_read(&a);         |
+      +--------------------------------+--------------------------------+
 
-Memory barrier pairing
-----------------------
-
-A useful rule of thumb is that memory barriers should always, or almost
-always, be paired with another barrier.  In the case of QEMU, however,
-note that the other barrier may actually be in a driver that runs in
-the guest!
-
-For the purposes of pairing, ``smp_read_barrier_depends()`` and ``smp_rmb()``
-both count as read barriers.  A read barrier shall pair with a write
-barrier or a full barrier; a write barrier shall pair with a read
-barrier or a full barrier.  A full barrier can pair with anything.
-For example:
-
-      +--------------------+------------------------------+
-      | thread 1           | thread 2                     |
-      +====================+==============================+
-      | ::                 | ::                           |
-      |                    |                              |
-      |   a = 1;           |                              |
-      |   smp_wmb();       |                              |
-      |   b = 2;           |   x = b;                     |
-      |                    |   smp_rmb();                 |
-      |                    |   y = a;                     |
-      +--------------------+------------------------------+
+Acquire and release semantics of higher-level primitives can also be
+relied upon for the purpose of establishing the *synchronizes with*
+relation.
 
 Note that the "writing" thread is accessing the variables in the
 opposite order as the "reading" thread.  This is expected: stores
-before the write barrier will normally match the loads after the
-read barrier, and vice versa.  The same is true for more than 2
-access and for data dependency barriers:
+before a release operation will normally match the loads after
+the acquire operation, and vice versa.  In fact, this happened already
+in the ``pthread_exit``/``pthread_join`` example above.
+
+Finally, this more complex example has more than two accesses and data
+dependency barriers.  It also does not use atomic accesses whenever there
+cannot be a data race:
 
       +----------------------+------------------------------+
       | thread 1             | thread 2                     |
@@ -380,19 +421,15 @@
       |   smp_wmb();         |                              |
       |   x->i = 2;          |                              |
       |   smp_wmb();         |                              |
-      |   a = x;             |  x = a;                      |
+      |   atomic_set(&a, x); |  x = atomic_read(&a);        |
       |                      |  smp_read_barrier_depends(); |
       |                      |  y = x->i;                   |
       |                      |  smp_read_barrier_depends(); |
       |                      |  z = b[y];                   |
       +----------------------+------------------------------+
 
-``smp_wmb()`` also pairs with ``atomic_mb_read()`` and ``smp_mb_acquire()``.
-and ``smp_rmb()`` also pairs with ``atomic_mb_set()`` and ``smp_mb_release()``.
-
-
-Comparison with Linux kernel memory barriers
-============================================
+Comparison with Linux kernel primitives
+=======================================
 
 Here is a list of differences between Linux kernel atomic operations
 and memory barriers, and the equivalents in QEMU:
@@ -426,19 +463,43 @@
          ``atomic_cmpxchg``    returns the old value of the variable
          ===================== =========================================
 
-  In QEMU, the second kind does not exist.  Currently Linux has
-  atomic_fetch_or only.  QEMU provides and, or, inc, dec, add, sub.
+  In QEMU, the second kind is named ``atomic_OP_fetch``.
 
 - different atomic read-modify-write operations in Linux imply
   a different set of memory barriers; in QEMU, all of them enforce
-  sequential consistency, which means they imply full memory barriers
-  before and after the operation.
+  sequential consistency.
 
-- Linux does not have an equivalent of ``atomic_mb_set()``.  In particular,
-  note that ``smp_store_mb()`` is a little weaker than ``atomic_mb_set()``.
-  ``atomic_mb_read()`` compiles to the same instructions as Linux's
-  ``smp_load_acquire()``, but this should be treated as an implementation
-  detail.
+- in QEMU, ``atomic_read()`` and ``atomic_set()`` do not participate in
+  the total ordering enforced by sequentially-consistent operations.
+  This is because QEMU uses the C11 memory model.  The following example
+  is correct in Linux but not in QEMU:
+
+      +----------------------------------+--------------------------------+
+      | Linux (correct)                  | QEMU (incorrect)               |
+      +==================================+================================+
+      | ::                               | ::                             |
+      |                                  |                                |
+      |   a = atomic_fetch_add(&x, 2);   |   a = atomic_fetch_add(&x, 2); |
+      |   b = READ_ONCE(&y);             |   b = atomic_read(&y);         |
+      +----------------------------------+--------------------------------+
+
+  because the read of ``y`` can be moved (by either the processor or the
+  compiler) before the write of ``x``.
+
+  Fixing this requires an ``smp_mb()`` memory barrier between the write
+  of ``x`` and the read of ``y``.  In the common case where only one thread
+  writes ``x``, it is also possible to write it like this:
+
+      +--------------------------------+
+      | QEMU (correct)                 |
+      +================================+
+      | ::                             |
+      |                                |
+      |   a = atomic_read(&x);         |
+      |   atomic_set(&x, a + 2);       |
+      |   smp_mb();                    |
+      |   b = atomic_read(&y);         |
+      +--------------------------------+
 
 Sources
 =======