This doc describes how we implement a complete memory and speculation barrier on various architectures. These barriers serialize the thread of execution by preventing any later instructions from starting until all previous instructions and all memory operations have entirely finished.
The memory operations we want to wait for obviously include reads and writes, but also cache flushes.
We need a barrier like this in a few places:
AC
) bit in EFLAGS
on x86.MFENCE
LFENCE
The load fence (LFENCE
) instruction serializes the instruction stream -- that is, it waits for all prior instructions to complete before allowing any later instructions to start.
However, on x86 a store is considered “complete” when it starts (enters the store buffer), not when it‘s visible. And, as the name suggests, LFENCE
doesn’t take special care to wait for stores to reach memory. In order to serialize all memory operations we also need a memory fence (MFENCE
), which waits for all prior reads, writes, and CLFLUSH
and CLFLUSHOPT
instructions to be entirely finished before completing.
The LFENCE
has to come after the MFENCE
because only LFENCE
serializes the instruction stream.
LFENCE
on AMDTechnically, LFENCE
is not documented as serializing the instruction stream on AMD processors. In early 2018 AMD described a model-specific register (MSR) that system software can set to make LFENCE
serializing on all existing and future AMD processors. (Software techniques for managing speculation on AMD processors, page 3, mitigation G-2.)
This MSR has been set in Linux since kernel version 4.15.
Another option for an unprivileged, fully-serializing instruction is CPUID
. One reason to avoid CPUID
is that it can cause an exit to the hypervisor when running in a virtual machine. Hosts use this to control which CPU features are discoverable in the guest VM, sometimes to present a homogenous level of functionality when VMs can be migrated across different hosts. The net result is that CPUID
can be very slow with high variance.
For the specific case of timing an instruction sequence, we could use RDTSCP
. Before reading from the timestamp counter, RDTSCP
first waits for all previous instructions and loads from memory to finish. It does not stop later instructions from starting, so it can't be used to build a full barrier.
DSB SY
ISB
The data synchronization barrier (DSB
) instruction waits for all memory accesses and “cache maintenance instructions” to finish before completing, and prevents instructions later in program order from beginning almost any work until the DSB
completes.
The two exceptions are:
These might seem innocuous, but empirically we've seen the second item extends to registers we want to wait to read, e.g. the timestamp counter.
To build a complete barrier, we add the instruction synchronization barrier (ISB
) instruction. ISB
ensures that all later instructions are fetched from memory and decoded after the ISB
completes and that “context-changing operations” executed before the ISB
are visible to instructions after.
ISYNC
SYNC
The synchronize (SYNC
) instruction waits for all preceding instructions to complete before any subsequent instructions are initiated. It also waits until almost all preceding memory operations have completed, with the exception of those initiated by “instruction cache block invalidate” (ICBI
), i.e. instruction cache flush.
To wait for these last accesses, we also issue the instruction synchronize (ISYNC
) instruction. ISYNC
has the same serializing effect on the instruction stream as SYNC
, but doesn't enforce order of any memory accesses except those caused by a preceding ICBI
.
There's no obvious reason the order of these two instructions should matter. Linux uses ISYNC; SYNC
.
The barrier must never be implemented as an indirect function call (e.g. vtable
lookup or shared library export), since it's possible for the call itself to be mispredicted and for speculative execution to continue in an unintended direction.
It's safest for implementations to always be inlined into the caller.