| ========================= |
| Atomic operations in QEMU |
| ========================= |
| |
| CPUs perform independent memory operations effectively in random order. |
| but this can be a problem for CPU-CPU interaction (including interactions |
| between QEMU and the guest). Multi-threaded programs use various tools |
| to instruct the compiler and the CPU to restrict the order to something |
| that is consistent with the expectations of the programmer. |
| |
| The most basic tool is locking. Mutexes, condition variables and |
| semaphores are used in QEMU, and should be the default approach to |
| synchronization. Anything else is considerably harder, but it's |
| also justified more often than one would like. The two tools that |
| are provided by ``qemu/atomic.h`` are memory barriers and atomic operations. |
| |
| Macros defined by ``qemu/atomic.h`` fall in three camps: |
| |
| - compiler barriers: ``barrier()``; |
| |
| - weak atomic access and manual memory barriers: ``atomic_read()``, |
| ``atomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, ``smp_mb_acquire()``, |
| ``smp_mb_release()``, ``smp_read_barrier_depends()``; |
| |
| - sequentially consistent atomic access: everything else. |
| |
| |
| Compiler memory barrier |
| ======================= |
| |
| ``barrier()`` prevents the compiler from moving the memory accesses either |
| side of it to the other side. The compiler barrier has no direct effect |
| on the CPU, which may then reorder things however it wishes. |
| |
| ``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some |
| architectures, CPU guarantees are strong enough that blocking compiler |
| optimizations already ensures the correct order of execution. In this |
| case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple |
| compiler barriers. |
| |
| Still, ``barrier()`` can be useful when writing code that can be interrupted |
| by signal handlers. |
| |
| |
| Sequentially consistent atomic access |
| ===================================== |
| |
| Most of the operations in the ``qemu/atomic.h`` header ensure *sequential |
| consistency*, where "the result of any execution is the same as if the |
| operations of all the processors were executed in some sequential order, |
| and the operations of each individual processor appear in this sequence |
| in the order specified by its program". |
| |
| ``qemu/atomic.h`` provides the following set of atomic read-modify-write |
| operations:: |
| |
| void atomic_inc(ptr) |
| void atomic_dec(ptr) |
| void atomic_add(ptr, val) |
| void atomic_sub(ptr, val) |
| void atomic_and(ptr, val) |
| void atomic_or(ptr, val) |
| |
| typeof(*ptr) atomic_fetch_inc(ptr) |
| typeof(*ptr) atomic_fetch_dec(ptr) |
| typeof(*ptr) atomic_fetch_add(ptr, val) |
| typeof(*ptr) atomic_fetch_sub(ptr, val) |
| typeof(*ptr) atomic_fetch_and(ptr, val) |
| typeof(*ptr) atomic_fetch_or(ptr, val) |
| typeof(*ptr) atomic_fetch_xor(ptr, val) |
| typeof(*ptr) atomic_fetch_inc_nonzero(ptr) |
| typeof(*ptr) atomic_xchg(ptr, val) |
| typeof(*ptr) atomic_cmpxchg(ptr, old, new) |
| |
| all of which return the old value of ``*ptr``. These operations are |
| polymorphic; they operate on any type that is as wide as a pointer. |
| |
| Similar operations return the new value of ``*ptr``:: |
| |
| typeof(*ptr) atomic_inc_fetch(ptr) |
| typeof(*ptr) atomic_dec_fetch(ptr) |
| typeof(*ptr) atomic_add_fetch(ptr, val) |
| typeof(*ptr) atomic_sub_fetch(ptr, val) |
| typeof(*ptr) atomic_and_fetch(ptr, val) |
| typeof(*ptr) atomic_or_fetch(ptr, val) |
| typeof(*ptr) atomic_xor_fetch(ptr, val) |
| |
| Sequentially consistent loads and stores can be done using:: |
| |
| atomic_fetch_add(ptr, 0) for loads |
| atomic_xchg(ptr, val) for stores |
| |
| However, they are quite expensive on some platforms, notably POWER and |
| Arm. Therefore, qemu/atomic.h provides two primitives with slightly |
| weaker constraints:: |
| |
| typeof(*ptr) atomic_mb_read(ptr) |
| void atomic_mb_set(ptr, val) |
| |
| The semantics of these primitives map to Java volatile variables, |
| and are strongly related to memory barriers as used in the Linux |
| kernel (see below). |
| |
| As long as you use atomic_mb_read and atomic_mb_set, accesses cannot |
| be reordered with each other, and it is also not possible to reorder |
| "normal" accesses around them. |
| |
| However, and this is the important difference between |
| atomic_mb_read/atomic_mb_set and sequential consistency, it is important |
| for both threads to access the same volatile variable. It is not the |
| case that everything visible to thread A when it writes volatile field f |
| becomes visible to thread B after it reads volatile field g. The store |
| and load have to "match" (i.e., be performed on the same volatile |
| field) to achieve the right semantics. |
| |
| |
| These operations operate on any type that is as wide as an int or smaller. |
| |
| |
| Weak atomic access and manual memory barriers |
| ============================================= |
| |
| Compared to sequentially consistent atomic access, programming with |
| weaker consistency models can be considerably more complicated. |
| In general, if the algorithm you are writing includes both writes |
| and reads on the same side, it is generally simpler to use sequentially |
| consistent primitives. |
| |
| When using this model, variables are accessed with: |
| |
| - ``atomic_read()`` and ``atomic_set()``; these prevent the compiler from |
| optimizing accesses out of existence and creating unsolicited |
| accesses, but do not otherwise impose any ordering on loads and |
| stores: both the compiler and the processor are free to reorder |
| them. |
| |
| - ``atomic_load_acquire()``, which guarantees the LOAD to appear to |
| happen, with respect to the other components of the system, |
| before all the LOAD or STORE operations specified afterwards. |
| Operations coming before ``atomic_load_acquire()`` can still be |
| reordered after it. |
| |
| - ``atomic_store_release()``, which guarantees the STORE to appear to |
| happen, with respect to the other components of the system, |
| after all the LOAD or STORE operations specified afterwards. |
| Operations coming after ``atomic_store_release()`` can still be |
| reordered after it. |
| |
| Restrictions to the ordering of accesses can also be specified |
| using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, |
| ``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``. |
| |
| Memory barriers control the order of references to shared memory. |
| They come in six kinds: |
| |
| - ``smp_rmb()`` guarantees that all the LOAD operations specified before |
| the barrier will appear to happen before all the LOAD operations |
| specified after the barrier with respect to the other components of |
| the system. |
| |
| In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not |
| required to have any effect on stores. |
| |
| - ``smp_wmb()`` guarantees that all the STORE operations specified before |
| the barrier will appear to happen before all the STORE operations |
| specified after the barrier with respect to the other components of |
| the system. |
| |
| In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not |
| required to have any effect on loads. |
| |
| - ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before |
| the barrier will appear to happen before all the LOAD or STORE operations |
| specified after the barrier with respect to the other components of |
| the system. |
| |
| - ``smp_mb_release()`` guarantees that all the STORE operations specified *after* |
| the barrier will appear to happen after all the LOAD or STORE operations |
| specified *before* the barrier with respect to the other components of |
| the system. |
| |
| - ``smp_mb()`` guarantees that all the LOAD and STORE operations specified |
| before the barrier will appear to happen before all the LOAD and |
| STORE operations specified after the barrier with respect to the other |
| components of the system. |
| |
| ``smp_mb()`` puts a partial ordering on both loads and stores. It is |
| stronger than both a read and a write memory barrier; it implies both |
| ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs |
| coming before the barrier from overtaking LOADs coming after the |
| barrier and vice versa. |
| |
| - ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On |
| most processors, whenever two loads are performed such that the |
| second depends on the result of the first (e.g., the first load |
| retrieves the address to which the second load will be directed), |
| the processor will guarantee that the first LOAD will appear to happen |
| before the second with respect to the other components of the system. |
| However, this is not always true---for example, it was not true on |
| Alpha processors. Whenever this kind of access happens to shared |
| memory (that is not protected by a lock), a read barrier is needed, |
| and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``. |
| |
| Note that the first load really has to have a _data_ dependency and not |
| a control dependency. If the address for the second load is dependent |
| on the first load, but the dependency is through a conditional rather |
| than actually loading the address itself, then it's a _control_ |
| dependency and a full read barrier or better is required. |
| |
| |
| This is the set of barriers that is required *between* two ``atomic_read()`` |
| and ``atomic_set()`` operations to achieve sequential consistency: |
| |
| +----------------+-------------------------------------------------------+ |
| | | 2nd operation | |
| | +------------------+-----------------+------------------+ |
| | 1st operation | (after last) | atomic_read | atomic_set | |
| +----------------+------------------+-----------------+------------------+ |
| | (before first) | .. | none | smp_mb_release() | |
| +----------------+------------------+-----------------+------------------+ |
| | atomic_read | smp_mb_acquire() | smp_rmb() [1]_ | [2]_ | |
| +----------------+------------------+-----------------+------------------+ |
| | atomic_set | none | smp_mb() [3]_ | smp_wmb() | |
| +----------------+------------------+-----------------+------------------+ |
| |
| .. [1] Or smp_read_barrier_depends(). |
| |
| .. [2] This requires a load-store barrier. This is achieved by |
| either smp_mb_acquire() or smp_mb_release(). |
| |
| .. [3] This requires a store-load barrier. On most machines, the only |
| way to achieve this is a full barrier. |
| |
| |
| You can see that the two possible definitions of ``atomic_mb_read()`` |
| and ``atomic_mb_set()`` are the following: |
| |
| 1) | atomic_mb_read(p) = atomic_read(p); smp_mb_acquire() |
| | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb() |
| |
| 2) | atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire() |
| | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); |
| |
| Usually the former is used, because ``smp_mb()`` is expensive and a program |
| normally has more reads than writes. Therefore it makes more sense to |
| make ``atomic_mb_set()`` the more expensive operation. |
| |
| There are two common cases in which atomic_mb_read and atomic_mb_set |
| generate too many memory barriers, and thus it can be useful to manually |
| place barriers, or use atomic_load_acquire/atomic_store_release instead: |
| |
| - when a data structure has one thread that is always a writer |
| and one thread that is always a reader, manual placement of |
| memory barriers makes the write side faster. Furthermore, |
| correctness is easy to check for in this case using the "pairing" |
| trick that is explained below: |
| |
| +----------------------------------------------------------------------+ |
| | thread 1 | |
| +-----------------------------------+----------------------------------+ |
| | before | after | |
| +===================================+==================================+ |
| | :: | :: | |
| | | | |
| | (other writes) | | |
| | atomic_mb_set(&a, x) | atomic_store_release(&a, x) | |
| | atomic_mb_set(&b, y) | atomic_store_release(&b, y) | |
| +-----------------------------------+----------------------------------+ |
| |
| +----------------------------------------------------------------------+ |
| | thread 2 | |
| +-----------------------------------+----------------------------------+ |
| | before | after | |
| +===================================+==================================+ |
| | :: | :: | |
| | | | |
| | y = atomic_mb_read(&b) | y = atomic_load_acquire(&b) | |
| | x = atomic_mb_read(&a) | x = atomic_load_acquire(&a) | |
| | (other reads) | | |
| +-----------------------------------+----------------------------------+ |
| |
| Note that the barrier between the stores in thread 1, and between |
| the loads in thread 2, has been optimized here to a write or a |
| read memory barrier respectively. On some architectures, notably |
| ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as |
| smp_mb, but smp_rmb and/or smp_wmb are more efficient. |
| |
| - sometimes, a thread is accessing many variables that are otherwise |
| unrelated to each other (for example because, apart from the current |
| thread, exactly one other thread will read or write each of these |
| variables). In this case, it is possible to "hoist" the implicit |
| barriers provided by ``atomic_mb_read()`` and ``atomic_mb_set()`` outside |
| a loop. For example, the above definition ``atomic_mb_read()`` gives |
| the following transformation: |
| |
| +-----------------------------------+----------------------------------+ |
| | before | after | |
| +===================================+==================================+ |
| | :: | :: | |
| | | | |
| | n = 0; | n = 0; | |
| | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | |
| | n += atomic_mb_read(&a[i]); | n += atomic_read(&a[i]); | |
| | | smp_mb_acquire(); | |
| +-----------------------------------+----------------------------------+ |
| |
| Similarly, atomic_mb_set() can be transformed as follows: |
| |
| +-----------------------------------+----------------------------------+ |
| | before | after | |
| +===================================+==================================+ |
| | :: | :: | |
| | | | |
| | | smp_mb_release(); | |
| | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | |
| | atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); | |
| | | smp_mb(); | |
| +-----------------------------------+----------------------------------+ |
| |
| |
| The other thread can still use ``atomic_mb_read()``/``atomic_mb_set()``. |
| |
| The two tricks can be combined. In this case, splitting a loop in |
| two lets you hoist the barriers out of the loops _and_ eliminate the |
| expensive ``smp_mb()``: |
| |
| +-----------------------------------+----------------------------------+ |
| | before | after | |
| +===================================+==================================+ |
| | :: | :: | |
| | | | |
| | | smp_mb_release(); | |
| | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | |
| | atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); | |
| | atomic_mb_set(&b[i], false); | smb_wmb(); | |
| | } | for (i = 0; i < 10; i++) | |
| | | atomic_set(&a[i], false); | |
| | | smp_mb(); | |
| +-----------------------------------+----------------------------------+ |
| |
| |
| Memory barrier pairing |
| ---------------------- |
| |
| A useful rule of thumb is that memory barriers should always, or almost |
| always, be paired with another barrier. In the case of QEMU, however, |
| note that the other barrier may actually be in a driver that runs in |
| the guest! |
| |
| For the purposes of pairing, ``smp_read_barrier_depends()`` and ``smp_rmb()`` |
| both count as read barriers. A read barrier shall pair with a write |
| barrier or a full barrier; a write barrier shall pair with a read |
| barrier or a full barrier. A full barrier can pair with anything. |
| For example: |
| |
| +--------------------+------------------------------+ |
| | thread 1 | thread 2 | |
| +====================+==============================+ |
| | :: | :: | |
| | | | |
| | a = 1; | | |
| | smp_wmb(); | | |
| | b = 2; | x = b; | |
| | | smp_rmb(); | |
| | | y = a; | |
| +--------------------+------------------------------+ |
| |
| Note that the "writing" thread is accessing the variables in the |
| opposite order as the "reading" thread. This is expected: stores |
| before the write barrier will normally match the loads after the |
| read barrier, and vice versa. The same is true for more than 2 |
| access and for data dependency barriers: |
| |
| +----------------------+------------------------------+ |
| | thread 1 | thread 2 | |
| +======================+==============================+ |
| | :: | :: | |
| | | | |
| | b[2] = 1; | | |
| | smp_wmb(); | | |
| | x->i = 2; | | |
| | smp_wmb(); | | |
| | a = x; | x = a; | |
| | | smp_read_barrier_depends(); | |
| | | y = x->i; | |
| | | smp_read_barrier_depends(); | |
| | | z = b[y]; | |
| +----------------------+------------------------------+ |
| |
| ``smp_wmb()`` also pairs with ``atomic_mb_read()`` and ``smp_mb_acquire()``. |
| and ``smp_rmb()`` also pairs with ``atomic_mb_set()`` and ``smp_mb_release()``. |
| |
| |
| Comparison with Linux kernel memory barriers |
| ============================================ |
| |
| Here is a list of differences between Linux kernel atomic operations |
| and memory barriers, and the equivalents in QEMU: |
| |
| - atomic operations in Linux are always on a 32-bit int type and |
| use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic |
| and use normal C types. |
| |
| - Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee |
| at all. Linux 4.1 updated them to implement volatile |
| semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``). |
| |
| QEMU's ``atomic_read`` and ``atomic_set`` implement C11 atomic relaxed |
| semantics if the compiler supports it, and volatile semantics otherwise. |
| Both semantics prevent the compiler from doing certain transformations; |
| the difference is that atomic accesses are guaranteed to be atomic, |
| while volatile accesses aren't. Thus, in the volatile case we just cross |
| our fingers hoping that the compiler will generate atomic accesses, |
| since we assume the variables passed are machine-word sized and |
| properly aligned. |
| |
| No barriers are implied by ``atomic_read`` and ``atomic_set`` in either Linux |
| or QEMU. |
| |
| - atomic read-modify-write operations in Linux are of three kinds: |
| |
| ===================== ========================================= |
| ``atomic_OP`` returns void |
| ``atomic_OP_return`` returns new value of the variable |
| ``atomic_fetch_OP`` returns the old value of the variable |
| ``atomic_cmpxchg`` returns the old value of the variable |
| ===================== ========================================= |
| |
| In QEMU, the second kind does not exist. Currently Linux has |
| atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub. |
| |
| - different atomic read-modify-write operations in Linux imply |
| a different set of memory barriers; in QEMU, all of them enforce |
| sequential consistency, which means they imply full memory barriers |
| before and after the operation. |
| |
| - Linux does not have an equivalent of ``atomic_mb_set()``. In particular, |
| note that ``smp_store_mb()`` is a little weaker than ``atomic_mb_set()``. |
| ``atomic_mb_read()`` compiles to the same instructions as Linux's |
| ``smp_load_acquire()``, but this should be treated as an implementation |
| detail. |
| |
| Sources |
| ======= |
| |
| - ``Documentation/memory-barriers.txt`` from the Linux kernel |