docs/devel/multi-thread-tcg.txt - third_party/qemu - Git at Google

 Copyright (c) 2015-2016 Linaro Ltd.

 This work is licensed under the terms of the GNU GPL, version 2 or
 later. See the COPYING file in the top-level directory.

 Introduction
 ============

 This document outlines the design for multi-threaded TCG system-mode
 emulation. The current user-mode emulation mirrors the thread
 structure of the translated executable. Some of the work will be
 applicable to both system and linux-user emulation.

 The original system-mode TCG implementation was single threaded and
 dealt with multiple CPUs with simple round-robin scheduling. This
 simplified a lot of things but became increasingly limited as systems
 being emulated gained additional cores and per-core performance gains
 for host systems started to level off.

 vCPU Scheduling
 ===============

 We introduce a new running mode where each vCPU will run on its own
 user-space thread. This will be enabled by default for all FE/BE
 combinations that have had the required work done to support this
 safely.

 In the general case of running translated code there should be no
 inter-vCPU dependencies and all vCPUs should be able to run at full
 speed. Synchronisation will only be required while accessing internal
 shared data structures or when the emulated architecture requires a
 coherent representation of the emulated machine state.

 Shared Data Structures
 ======================

 Main Run Loop
 -------------

 Even when there is no code being generated there are a number of
 structures associated with the hot-path through the main run-loop.
 These are associated with looking up the next translation block to
 execute. These include:

     tb_jmp_cache (per-vCPU, cache of recent jumps)
     tb_ctx.htable (global hash table, phys address->tb lookup)

 As TB linking only occurs when blocks are in the same page this code
 is critical to performance as looking up the next TB to execute is the
 most common reason to exit the generated code.

 DESIGN REQUIREMENT: Make access to lookup structures safe with
 multiple reader/writer threads. Minimise any lock contention to do it.

 The hot-path avoids using locks where possible. The tb_jmp_cache is
 updated with atomic accesses to ensure consistent results. The fall
 back QHT based hash table is also designed for lockless lookups. Locks
 are only taken when code generation is required or TranslationBlocks
 have their block-to-block jumps patched.

 Global TCG State
 ----------------

 ### User-mode emulation
 We need to protect the entire code generation cycle including any post
 generation patching of the translated code. This also implies a shared
 translation buffer which contains code running on all cores. Any
 execution path that comes to the main run loop will need to hold a
 mutex for code generation. This also includes times when we need flush
 code or entries from any shared lookups/caches. Structures held on a
 per-vCPU basis won't need locking unless other vCPUs will need to
 modify them.

 DESIGN REQUIREMENT: Add locking around all code generation and TB
 patching.

 (Current solution)

 Code generation is serialised with mmap_lock().

 ### !User-mode emulation
 Each vCPU has its own TCG context and associated TCG region, thereby
 requiring no locking.

 Translation Blocks
 ------------------

 Currently the whole system shares a single code generation buffer
 which when full will force a flush of all translations and start from
 scratch again. Some operations also force a full flush of translations
 including:

   - debugging operations (breakpoint insertion/removal)
   - some CPU helper functions

 This is done with the async_safe_run_on_cpu() mechanism to ensure all
 vCPUs are quiescent when changes are being made to shared global
 structures.

 More granular translation invalidation events are typically due
 to a change of the state of a physical page:

   - code modification (self modify code, patching code)
   - page changes (new page mapping in linux-user mode)

 While setting the invalid flag in a TranslationBlock will stop it
 being used when looked up in the hot-path there are a number of other
 book-keeping structures that need to be safely cleared.

 Any TranslationBlocks which have been patched to jump directly to the
 now invalid blocks need the jump patches reversing so they will return
 to the C code.

 There are a number of look-up caches that need to be properly updated
 including the:

   - jump lookup cache
   - the physical-to-tb lookup hash table
   - the global page table

 The global page table (l1_map) which provides a multi-level look-up
 for PageDesc structures which contain pointers to the start of a
 linked list of all Translation Blocks in that page (see page_next).

 Both the jump patching and the page cache involve linked lists that
 the invalidated TranslationBlock needs to be removed from.

 DESIGN REQUIREMENT: Safely handle invalidation of TBs
                       - safely patch/revert direct jumps
                       - remove central PageDesc lookup entries
                       - ensure lookup caches/hashes are safely updated

 (Current solution)

 The direct jump themselves are updated atomically by the TCG
 tb_set_jmp_target() code. Modification to the linked lists that allow
 searching for linked pages are done under the protection of tb->jmp_lock,
 where tb is the destination block of a jump. Each origin block keeps a
 pointer to its destinations so that the appropriate lock can be acquired before
 iterating over a jump list.

 The global page table is a lockless radix tree; cmpxchg is used
 to atomically insert new elements.

 The lookup caches are updated atomically and the lookup hash uses QHT
 which is designed for concurrent safe lookup.

 Parallel code generation is supported. QHT is used at insertion time
 as the synchronization point across threads, thereby ensuring that we only
 keep track of a single TranslationBlock for each guest code block.

 Memory maps and TLBs
 --------------------

 The memory handling code is fairly critical to the speed of memory
 access in the emulated system. The SoftMMU code is designed so the
 hot-path can be handled entirely within translated code. This is
 handled with a per-vCPU TLB structure which once populated will allow
 a series of accesses to the page to occur without exiting the
 translated code. It is possible to set flags in the TLB address which
 will ensure the slow-path is taken for each access. This can be done
 to support:

   - Memory regions (dividing up access to PIO, MMIO and RAM)
   - Dirty page tracking (for code gen, SMC detection, migration and display)
   - Virtual TLB (for translating guest address->real address)

 When the TLB tables are updated by a vCPU thread other than their own
 we need to ensure it is done in a safe way so no inconsistent state is
 seen by the vCPU thread.

 Some operations require updating a number of vCPUs TLBs at the same
 time in a synchronised manner.

 DESIGN REQUIREMENTS:

   - TLB Flush All/Page
     - can be across-vCPUs
     - cross vCPU TLB flush may need other vCPU brought to halt
     - change may need to be visible to the calling vCPU immediately
   - TLB Flag Update
     - usually cross-vCPU
     - want change to be visible as soon as possible
   - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
     - This is a per-vCPU table - by definition can't race
     - updated by its own thread when the slow-path is forced

 (Current solution)

 We have updated cputlb.c to defer operations when a cross-vCPU
 operation with async_run_on_cpu() which ensures each vCPU sees a
 coherent state when it next runs its work (in a few instructions
 time).

 A new set up operations (tlb_flush_*_all_cpus) take an additional flag
 which when set will force synchronisation by setting the source vCPUs
 work as "safe work" and exiting the cpu run loop. This ensure by the
 time execution restarts all flush operations have completed.

 TLB flag updates are all done atomically and are also protected by the
 corresponding page lock.

 (Known limitation)

 Not really a limitation but the wait mechanism is overly strict for
 some architectures which only need flushes completed by a barrier
 instruction. This could be a future optimisation.

 Emulated hardware state
 -----------------------

 Currently thanks to KVM work any access to IO memory is automatically
 protected by the global iothread mutex, also known as the BQL (Big
 Qemu Lock). Any IO region that doesn't use global mutex is expected to
 do its own locking.

 However IO memory isn't the only way emulated hardware state can be
 modified. Some architectures have model specific registers that
 trigger hardware emulation features. Generally any translation helper
 that needs to update more than a single vCPUs of state should take the
 BQL.

 As the BQL, or global iothread mutex is shared across the system we
 push the use of the lock as far down into the TCG code as possible to
 minimise contention.

 (Current solution)

 MMIO access automatically serialises hardware emulation by way of the
 BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
 and also defer the reset/startup of vCPUs to the vCPU context by way
 of async_run_on_cpu().

 Updates to interrupt state are also protected by the BQL as they can
 often be cross vCPU.

 Memory Consistency
 ==================

 Between emulated guests and host systems there are a range of memory
 consistency models. Even emulating weakly ordered systems on strongly
 ordered hosts needs to ensure things like store-after-load re-ordering
 can be prevented when the guest wants to.

 Memory Barriers
 ---------------

 Barriers (sometimes known as fences) provide a mechanism for software
 to enforce a particular ordering of memory operations from the point
 of view of external observers (e.g. another processor core). They can
 apply to any memory operations as well as just loads or stores.

 The Linux kernel has an excellent write-up on the various forms of
 memory barrier and the guarantees they can provide [1].

 Barriers are often wrapped around synchronisation primitives to
 provide explicit memory ordering semantics. However they can be used
 by themselves to provide safe lockless access by ensuring for example
 a change to a signal flag will only be visible once the changes to
 payload are.

 DESIGN REQUIREMENT: Add a new tcg_memory_barrier op

 This would enforce a strong load/store ordering so all loads/stores
 complete at the memory barrier. On single-core non-SMP strongly
 ordered backends this could become a NOP.

 Aside from explicit standalone memory barrier instructions there are
 also implicit memory ordering semantics which comes with each guest
 memory access instruction. For example all x86 load/stores come with
 fairly strong guarantees of sequential consistency whereas Arm has
 special variants of load/store instructions that imply acquire/release
 semantics.

 In the case of a strongly ordered guest architecture being emulated on
 a weakly ordered host the scope for a heavy performance impact is
 quite high.

 DESIGN REQUIREMENTS: Be efficient with use of memory barriers
        - host systems with stronger implied guarantees can skip some barriers
        - merge consecutive barriers to the strongest one

 (Current solution)

 The system currently has a tcg_gen_mb() which will add memory barrier
 operations if code generation is being done in a parallel context. The
 tcg_optimize() function attempts to merge barriers up to their
 strongest form before any load/store operations. The solution was
 originally developed and tested for linux-user based systems. All
 backends have been converted to emit fences when required. So far the
 following front-ends have been updated to emit fences when required:

     - target-i386
     - target-arm
     - target-aarch64
     - target-alpha
     - target-mips

 Memory Control and Maintenance
 ------------------------------

 This includes a class of instructions for controlling system cache
 behaviour. While QEMU doesn't model cache behaviour these instructions
 are often seen when code modification has taken place to ensure the
 changes take effect.

 Synchronisation Primitives
 --------------------------

 There are two broad types of synchronisation primitives found in
 modern ISAs: atomic instructions and exclusive regions.

 The first type offer a simple atomic instruction which will guarantee
 some sort of test and conditional store will be truly atomic w.r.t.
 other cores sharing access to the memory. The classic example is the
 x86 cmpxchg instruction.

 The second type offer a pair of load/store instructions which offer a
 guarantee that a region of memory has not been touched between the
 load and store instructions. An example of this is Arm's ldrex/strex
 pair where the strex instruction will return a flag indicating a
 successful store only if no other CPU has accessed the memory region
 since the ldrex.

 Traditionally TCG has generated a series of operations that work
 because they are within the context of a single translation block so
 will have completed before another CPU is scheduled. However with
 the ability to have multiple threads running to emulate multiple CPUs
 we will need to explicitly expose these semantics.

 DESIGN REQUIREMENTS:
   - Support classic atomic instructions
   - Support load/store exclusive (or load link/store conditional) pairs
   - Generic enough infrastructure to support all guest architectures
 CURRENT OPEN QUESTIONS:
   - How problematic is the ABA problem in general?

 (Current solution)

 The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
 can be used directly or combined to emulate other instructions like
 Arm's ldrex/strex instructions. While they are susceptible to the ABA
 problem so far common guests have not implemented patterns where
 this may be a problem - typically presenting a locking ABI which
 assumes cmpxchg like semantics.

 The code also includes a fall-back for cases where multi-threaded TCG
 ops can't work (e.g. guest atomic width > host atomic width). In this
 case an EXCP_ATOMIC exit occurs and the instruction is emulated with
 an exclusive lock which ensures all emulation is serialised.

 While the atomic helpers look good enough for now there may be a need
 to look at solutions that can more closely model the guest
 architectures semantics.

 ==========

 [1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
	Copyright (c) 2015-2016 Linaro Ltd.

	This work is licensed under the terms of the GNU GPL, version 2 or
	later. See the COPYING file in the top-level directory.

	Introduction
	============

	This document outlines the design for multi-threaded TCG system-mode
	emulation. The current user-mode emulation mirrors the thread
	structure of the translated executable. Some of the work will be
	applicable to both system and linux-user emulation.

	The original system-mode TCG implementation was single threaded and
	dealt with multiple CPUs with simple round-robin scheduling. This
	simplified a lot of things but became increasingly limited as systems
	being emulated gained additional cores and per-core performance gains
	for host systems started to level off.

	vCPU Scheduling
	===============

	We introduce a new running mode where each vCPU will run on its own
	user-space thread. This will be enabled by default for all FE/BE
	combinations that have had the required work done to support this
	safely.

	In the general case of running translated code there should be no
	inter-vCPU dependencies and all vCPUs should be able to run at full
	speed. Synchronisation will only be required while accessing internal
	shared data structures or when the emulated architecture requires a
	coherent representation of the emulated machine state.

	Shared Data Structures
	======================

	Main Run Loop
	-------------

	Even when there is no code being generated there are a number of
	structures associated with the hot-path through the main run-loop.
	These are associated with looking up the next translation block to
	execute. These include:

	tb_jmp_cache (per-vCPU, cache of recent jumps)
	tb_ctx.htable (global hash table, phys address->tb lookup)

	As TB linking only occurs when blocks are in the same page this code
	is critical to performance as looking up the next TB to execute is the
	most common reason to exit the generated code.

	DESIGN REQUIREMENT: Make access to lookup structures safe with
	multiple reader/writer threads. Minimise any lock contention to do it.

	The hot-path avoids using locks where possible. The tb_jmp_cache is
	updated with atomic accesses to ensure consistent results. The fall
	back QHT based hash table is also designed for lockless lookups. Locks
	are only taken when code generation is required or TranslationBlocks
	have their block-to-block jumps patched.

	Global TCG State
	----------------

	### User-mode emulation
	We need to protect the entire code generation cycle including any post
	generation patching of the translated code. This also implies a shared
	translation buffer which contains code running on all cores. Any
	execution path that comes to the main run loop will need to hold a
	mutex for code generation. This also includes times when we need flush
	code or entries from any shared lookups/caches. Structures held on a
	per-vCPU basis won't need locking unless other vCPUs will need to
	modify them.

	DESIGN REQUIREMENT: Add locking around all code generation and TB
	patching.

	(Current solution)

	Code generation is serialised with mmap_lock().

	### !User-mode emulation
	Each vCPU has its own TCG context and associated TCG region, thereby
	requiring no locking.

	Translation Blocks
	------------------

	Currently the whole system shares a single code generation buffer
	which when full will force a flush of all translations and start from
	scratch again. Some operations also force a full flush of translations
	including:

	- debugging operations (breakpoint insertion/removal)
	- some CPU helper functions

	This is done with the async_safe_run_on_cpu() mechanism to ensure all
	vCPUs are quiescent when changes are being made to shared global
	structures.

	More granular translation invalidation events are typically due
	to a change of the state of a physical page:

	- code modification (self modify code, patching code)
	- page changes (new page mapping in linux-user mode)

	While setting the invalid flag in a TranslationBlock will stop it
	being used when looked up in the hot-path there are a number of other
	book-keeping structures that need to be safely cleared.

	Any TranslationBlocks which have been patched to jump directly to the
	now invalid blocks need the jump patches reversing so they will return
	to the C code.

	There are a number of look-up caches that need to be properly updated
	including the:

	- jump lookup cache
	- the physical-to-tb lookup hash table
	- the global page table

	The global page table (l1_map) which provides a multi-level look-up
	for PageDesc structures which contain pointers to the start of a
	linked list of all Translation Blocks in that page (see page_next).

	Both the jump patching and the page cache involve linked lists that
	the invalidated TranslationBlock needs to be removed from.

	DESIGN REQUIREMENT: Safely handle invalidation of TBs
	- safely patch/revert direct jumps
	- remove central PageDesc lookup entries
	- ensure lookup caches/hashes are safely updated

	(Current solution)

	The direct jump themselves are updated atomically by the TCG
	tb_set_jmp_target() code. Modification to the linked lists that allow
	searching for linked pages are done under the protection of tb->jmp_lock,
	where tb is the destination block of a jump. Each origin block keeps a
	pointer to its destinations so that the appropriate lock can be acquired before
	iterating over a jump list.

	The global page table is a lockless radix tree; cmpxchg is used
	to atomically insert new elements.

	The lookup caches are updated atomically and the lookup hash uses QHT
	which is designed for concurrent safe lookup.

	Parallel code generation is supported. QHT is used at insertion time
	as the synchronization point across threads, thereby ensuring that we only
	keep track of a single TranslationBlock for each guest code block.

	Memory maps and TLBs
	--------------------

	The memory handling code is fairly critical to the speed of memory
	access in the emulated system. The SoftMMU code is designed so the
	hot-path can be handled entirely within translated code. This is
	handled with a per-vCPU TLB structure which once populated will allow
	a series of accesses to the page to occur without exiting the
	translated code. It is possible to set flags in the TLB address which
	will ensure the slow-path is taken for each access. This can be done
	to support:

	- Memory regions (dividing up access to PIO, MMIO and RAM)
	- Dirty page tracking (for code gen, SMC detection, migration and display)
	- Virtual TLB (for translating guest address->real address)

	When the TLB tables are updated by a vCPU thread other than their own
	we need to ensure it is done in a safe way so no inconsistent state is
	seen by the vCPU thread.

	Some operations require updating a number of vCPUs TLBs at the same
	time in a synchronised manner.

	DESIGN REQUIREMENTS:

	- TLB Flush All/Page
	- can be across-vCPUs
	- cross vCPU TLB flush may need other vCPU brought to halt
	- change may need to be visible to the calling vCPU immediately
	- TLB Flag Update
	- usually cross-vCPU
	- want change to be visible as soon as possible
	- TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
	- This is a per-vCPU table - by definition can't race
	- updated by its own thread when the slow-path is forced

	(Current solution)

	We have updated cputlb.c to defer operations when a cross-vCPU
	operation with async_run_on_cpu() which ensures each vCPU sees a
	coherent state when it next runs its work (in a few instructions
	time).

	A new set up operations (tlb_flush_*_all_cpus) take an additional flag
	which when set will force synchronisation by setting the source vCPUs
	work as "safe work" and exiting the cpu run loop. This ensure by the
	time execution restarts all flush operations have completed.

	TLB flag updates are all done atomically and are also protected by the
	corresponding page lock.

	(Known limitation)

	Not really a limitation but the wait mechanism is overly strict for
	some architectures which only need flushes completed by a barrier
	instruction. This could be a future optimisation.

	Emulated hardware state
	-----------------------

	Currently thanks to KVM work any access to IO memory is automatically
	protected by the global iothread mutex, also known as the BQL (Big
	Qemu Lock). Any IO region that doesn't use global mutex is expected to
	do its own locking.

	However IO memory isn't the only way emulated hardware state can be
	modified. Some architectures have model specific registers that
	trigger hardware emulation features. Generally any translation helper
	that needs to update more than a single vCPUs of state should take the
	BQL.

	As the BQL, or global iothread mutex is shared across the system we
	push the use of the lock as far down into the TCG code as possible to
	minimise contention.

	(Current solution)

	MMIO access automatically serialises hardware emulation by way of the
	BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
	and also defer the reset/startup of vCPUs to the vCPU context by way
	of async_run_on_cpu().

	Updates to interrupt state are also protected by the BQL as they can
	often be cross vCPU.

	Memory Consistency
	==================

	Between emulated guests and host systems there are a range of memory
	consistency models. Even emulating weakly ordered systems on strongly
	ordered hosts needs to ensure things like store-after-load re-ordering
	can be prevented when the guest wants to.

	Memory Barriers
	---------------

	Barriers (sometimes known as fences) provide a mechanism for software
	to enforce a particular ordering of memory operations from the point
	of view of external observers (e.g. another processor core). They can
	apply to any memory operations as well as just loads or stores.

	The Linux kernel has an excellent write-up on the various forms of
	memory barrier and the guarantees they can provide [1].

	Barriers are often wrapped around synchronisation primitives to
	provide explicit memory ordering semantics. However they can be used
	by themselves to provide safe lockless access by ensuring for example
	a change to a signal flag will only be visible once the changes to
	payload are.

	DESIGN REQUIREMENT: Add a new tcg_memory_barrier op

	This would enforce a strong load/store ordering so all loads/stores
	complete at the memory barrier. On single-core non-SMP strongly
	ordered backends this could become a NOP.

	Aside from explicit standalone memory barrier instructions there are
	also implicit memory ordering semantics which comes with each guest
	memory access instruction. For example all x86 load/stores come with
	fairly strong guarantees of sequential consistency whereas Arm has
	special variants of load/store instructions that imply acquire/release
	semantics.

	In the case of a strongly ordered guest architecture being emulated on
	a weakly ordered host the scope for a heavy performance impact is
	quite high.

	DESIGN REQUIREMENTS: Be efficient with use of memory barriers
	- host systems with stronger implied guarantees can skip some barriers
	- merge consecutive barriers to the strongest one

	(Current solution)

	The system currently has a tcg_gen_mb() which will add memory barrier
	operations if code generation is being done in a parallel context. The
	tcg_optimize() function attempts to merge barriers up to their
	strongest form before any load/store operations. The solution was
	originally developed and tested for linux-user based systems. All
	backends have been converted to emit fences when required. So far the
	following front-ends have been updated to emit fences when required:

	- target-i386
	- target-arm
	- target-aarch64
	- target-alpha
	- target-mips

	Memory Control and Maintenance
	------------------------------

	This includes a class of instructions for controlling system cache
	behaviour. While QEMU doesn't model cache behaviour these instructions
	are often seen when code modification has taken place to ensure the
	changes take effect.

	Synchronisation Primitives
	--------------------------

	There are two broad types of synchronisation primitives found in
	modern ISAs: atomic instructions and exclusive regions.

	The first type offer a simple atomic instruction which will guarantee
	some sort of test and conditional store will be truly atomic w.r.t.
	other cores sharing access to the memory. The classic example is the
	x86 cmpxchg instruction.

	The second type offer a pair of load/store instructions which offer a
	guarantee that a region of memory has not been touched between the
	load and store instructions. An example of this is Arm's ldrex/strex
	pair where the strex instruction will return a flag indicating a
	successful store only if no other CPU has accessed the memory region
	since the ldrex.

	Traditionally TCG has generated a series of operations that work
	because they are within the context of a single translation block so
	will have completed before another CPU is scheduled. However with
	the ability to have multiple threads running to emulate multiple CPUs
	we will need to explicitly expose these semantics.

	DESIGN REQUIREMENTS:
	- Support classic atomic instructions
	- Support load/store exclusive (or load link/store conditional) pairs
	- Generic enough infrastructure to support all guest architectures
	CURRENT OPEN QUESTIONS:
	- How problematic is the ABA problem in general?

	(Current solution)

	The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
	can be used directly or combined to emulate other instructions like
	Arm's ldrex/strex instructions. While they are susceptible to the ABA
	problem so far common guests have not implemented patterns where
	this may be a problem - typically presenting a locking ABI which
	assumes cmpxchg like semantics.

	The code also includes a fall-back for cases where multi-threaded TCG
	ops can't work (e.g. guest atomic width > host atomic width). In this
	case an EXCP_ATOMIC exit occurs and the instruction is emulated with
	an exclusive lock which ensures all emulation is serialised.

	While the atomic helpers look good enough for now there may be a need
	to look at solutions that can more closely model the guest
	architectures semantics.

	==========

	[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt