docs/kernel_scheduling.md - zircon - Git at Google

 Zircon Scheduling
 =================

 Background
 ==========

 The primary responsibility of any scheduler is to share the limited
 resource of processor time between all threads that wish to use it. In a
 general purpose operating system, it tries to do so in a fair way,
 ensuring that all threads are allowed to make some progress.

 Our scheduler is an evolution of LK’s scheduler. As such it
 started as a minimal scheduler implementation and was extended to meet
 our needs as the project grew.

 Design
 ======

 #### Overview

 In essence there is a scheduler running on each logical CPU in the machine.
 These schedulers run independently and use IPI (Inter-Processor
 Interrupts) to coordinate. However each CPU is responsible for
 scheduling the threads that are running on it. See [*CPU
 Assignment*](#cpu-assignment-and-migration) below for how we decide
 which CPU a thread is on, and how/when it migrates.

 Each CPU has its own set of priority queues. One for each priority level
 in the system, currently 32. Note that these are fifo queues, not the data
 structure known as a priority queue. In each queue is an ordered list of
 runnable threads awaiting execution. When it is time for a new thread to run,
 the scheduler simply looks at the highest numbered queue that contains a thread,
 pops the head off of that queue and runs that thread.See
 [*Priority Management*](#priority-management) below for more details
 about how it decides which thread should be in which queue. If there are no
 threads in the queues to run it will instead run the idle thread, see [*Realtime
 and Idle Threads*](#realtime-and-idle-threads) below for more details.

 Each thread is assigned the same timeslice size (THREAD_INITIAL_TIME_SLICE)
 when it is picked to start running. If it uses its whole timeslice it will be
 reinserted at the end of the appropriate priority queue. However if it has
 some of its timeslice remaining from a previous run it will be inserted at the
 head of the priority queue so it will be able to resume as quickly as possible.
 When it is picked back up again it will only run for the remainder of its
 previous timeslice.

 When the scheduler selects a new thread from the priority queue it sets
 the CPU's preemption timer for either a full timeslice, or the remainder of the
 previous timeslice. When that timer fires the scheduler will stop execution on
 that thread, add it to the appropriate queue, select another thread and start
 over again.

 If a thread blocks waiting for a shared resource then it's taken out of
 its priority queue and is placed in a wait queue for the shared resource.
 When it is unblocked it will be reinserted in the appropriate priority
 queue of an eligible CPU ([*CPU
 Assignment*](#cpu-assignment-and-migration)) and if it had remaining timeslice
 to run it will be added to the front of the queue for expedited handling.

 #### Priority Management

 There are three different factors used to determine the effective
 priority of a thread, the effective priority being what is used to
 determine which queue it will be in.

 The first factor is the base priority, which is simply the thread’s
 requested priority. There are currently 32 levels with 0 being the
 lowest and 31 being the highest.

 The second factor is the priority boost. This is a value bounded between
 \[-MAX_PRIORITY_ADJ, MAX_PRIORITY_ADJ\] used to offset the base priority, it is
 modified by the following cases:

 -   When a thread is unblocked, after waiting on a shared resource or
     sleeping, it is given a one point boost.

 -   When a thread yields (volunteers to give up control), or volunteers
     to reschedule, its boost is decremented by one but is capped at 0
     (won’t go negative).

 -   When a thread is preempted and has used up its entire timeslice, its
     boost is decremented by one but is able to go negative.

 The third factor is its inherited priority. If the thread is in control
 of a shared resource and it is blocking another thread of a higher
 priority then it is given a temporary boost up to that thread’s priority
 to allow it to finish quickly and allow the higher priority thread to
 resume.

 The effective priority of the thread is either the inherited priority,
 if it has one, or the base priority plus its boost. When this priority
 changes, due to any of the factors changing, the scheduler will move it
 to a new priority queue and reschedule the CPU. Allowing it to have
 control if it is now the highest priority task, or relinquish control if
 it is no longer highest.

 The intent in this system is to ensure that interactive threads are
 serviced quickly. These are usually the threads that interact directly
 with the user and cause user-perceivable latency. These threads usually
 do little work and spend most of their time blocked awaiting another
 user event. So they get the priority boost from unblocking while
 background threads that do most of the processing receive the priority
 penalty for using their entire timeslice.

 #### CPU Assignment and Migration

 Threads are able to request which CPUs on which they wish to run using a
 CPU affinity mask, a 32 bit mask where 0b001 is CPU 1, 0b100 is CPU 3,
 and 0b101 is either CPU 1 or CPU 3. This mask is usually respected but
 if the CPUs it requests are all inactive it will be assigned to another
 CPU. Also notable, if it is “pinned” to a CPU, that is its mask contains
 only one CPU, and that CPU becomes inactive the thread will sit
 unserviced until that CPU becomes active again. See [*CPU
 Activation*](#cpu-activation) below for details.

 When selecting a CPU for a thread the scheduler will choose, in order:

 1.  The CPU doing the selection, if it is **idle** and in the affinity mask.

 2.  The CPU the thread last ran on, if it is **idle** and in the affinity mask.

 3.  Any **idle** CPU in the affinity mask.

 4.  The CPU the thread last ran on, if it is active and in the affinity mask.

 5.  The CPU doing the selection, if it is the only one in the affinity mask or
     all cpus in the mask are not active.

 6.  Any active CPU in the affinity mask.

 If the thread is running on a CPU not in its affinity mask (due to case
 5 above) the scheduler will try to rectify this everytime the thread is
 preempted, yields, or voluntarily reschedules. Also if the thread
 changes its affinity mask the scheduler may migrate it.

 Every time a thread comes back from waiting on a shared resource or
 sleeping and needs to be assigned a priority queue, the scheduler will
 re-evaluate its CPU choice for the thread, using the above logic, and
 may move it.

 #### CPU Activation

 When a CPU is being deactivated, that is shutdown and removed from the
 system, the scheduler will transition all running threads onto other
 CPUs. The only exception is threads that are “pinned”, that is they only
 have the deactivating CPU in their affinity mask, these threads are put
 back into the run queue where they will sit unserviced until the CPU is
 reactivated.

 When a CPU is reactivated it will service the waiting pinned threads and
 threads that are running on non-Affinity CPUs should be migrated back
 pretty quickly by their CPUs scheduler due to the above rules. There is
 no active rebalancing of threads to the newly awakened CPU, but as it
 should be idle more often, it should see some migration due to the logic
 laid out above in [*CPU Assignment and
 Migration*](#cpu-assignment-and-migration).

 #### Realtime and Idle Threads

 These are special threads that are treated a little differently.

 The idle thread runs when no other threads are runnable. There is one on
 each CPU and it lives outside of the priority queues, but effectively in
 a priority queue of -1. It is used to track idle time and can be used by
 platform implementations for a low power wait mode.

 Realtime threads (marked with THREAD_FLAG_REAL_TIME) are allowed to run without
 preemption and will run until they block, yield, or manually reschedule.
	Zircon Scheduling
	=================

	Background
	==========

	The primary responsibility of any scheduler is to share the limited
	resource of processor time between all threads that wish to use it. In a
	general purpose operating system, it tries to do so in a fair way,
	ensuring that all threads are allowed to make some progress.

	Our scheduler is an evolution of LK’s scheduler. As such it
	started as a minimal scheduler implementation and was extended to meet
	our needs as the project grew.

	Design
	======

	#### Overview

	In essence there is a scheduler running on each logical CPU in the machine.
	These schedulers run independently and use IPI (Inter-Processor
	Interrupts) to coordinate. However each CPU is responsible for
	scheduling the threads that are running on it. See [*CPU
	Assignment*](#cpu-assignment-and-migration) below for how we decide
	which CPU a thread is on, and how/when it migrates.

	Each CPU has its own set of priority queues. One for each priority level
	in the system, currently 32. Note that these are fifo queues, not the data
	structure known as a priority queue. In each queue is an ordered list of
	runnable threads awaiting execution. When it is time for a new thread to run,
	the scheduler simply looks at the highest numbered queue that contains a thread,
	pops the head off of that queue and runs that thread.See
	[Priority Management](#priority-management) below for more details
	about how it decides which thread should be in which queue. If there are no
	threads in the queues to run it will instead run the idle thread, see [*Realtime
	and Idle Threads*](#realtime-and-idle-threads) below for more details.

	Each thread is assigned the same timeslice size (THREAD_INITIAL_TIME_SLICE)
	when it is picked to start running. If it uses its whole timeslice it will be
	reinserted at the end of the appropriate priority queue. However if it has
	some of its timeslice remaining from a previous run it will be inserted at the
	head of the priority queue so it will be able to resume as quickly as possible.
	When it is picked back up again it will only run for the remainder of its
	previous timeslice.

	When the scheduler selects a new thread from the priority queue it sets
	the CPU's preemption timer for either a full timeslice, or the remainder of the
	previous timeslice. When that timer fires the scheduler will stop execution on
	that thread, add it to the appropriate queue, select another thread and start
	over again.

	If a thread blocks waiting for a shared resource then it's taken out of
	its priority queue and is placed in a wait queue for the shared resource.
	When it is unblocked it will be reinserted in the appropriate priority
	queue of an eligible CPU ([*CPU
	Assignment*](#cpu-assignment-and-migration)) and if it had remaining timeslice
	to run it will be added to the front of the queue for expedited handling.

	#### Priority Management

	There are three different factors used to determine the effective
	priority of a thread, the effective priority being what is used to
	determine which queue it will be in.

	The first factor is the base priority, which is simply the thread’s
	requested priority. There are currently 32 levels with 0 being the
	lowest and 31 being the highest.

	The second factor is the priority boost. This is a value bounded between
	\[-MAX_PRIORITY_ADJ, MAX_PRIORITY_ADJ\] used to offset the base priority, it is
	modified by the following cases:

	- When a thread is unblocked, after waiting on a shared resource or
	sleeping, it is given a one point boost.

	- When a thread yields (volunteers to give up control), or volunteers
	to reschedule, its boost is decremented by one but is capped at 0
	(won’t go negative).

	- When a thread is preempted and has used up its entire timeslice, its
	boost is decremented by one but is able to go negative.

	The third factor is its inherited priority. If the thread is in control
	of a shared resource and it is blocking another thread of a higher
	priority then it is given a temporary boost up to that thread’s priority
	to allow it to finish quickly and allow the higher priority thread to
	resume.

	The effective priority of the thread is either the inherited priority,
	if it has one, or the base priority plus its boost. When this priority
	changes, due to any of the factors changing, the scheduler will move it
	to a new priority queue and reschedule the CPU. Allowing it to have
	control if it is now the highest priority task, or relinquish control if
	it is no longer highest.

	The intent in this system is to ensure that interactive threads are
	serviced quickly. These are usually the threads that interact directly
	with the user and cause user-perceivable latency. These threads usually
	do little work and spend most of their time blocked awaiting another
	user event. So they get the priority boost from unblocking while
	background threads that do most of the processing receive the priority
	penalty for using their entire timeslice.

	#### CPU Assignment and Migration

	Threads are able to request which CPUs on which they wish to run using a
	CPU affinity mask, a 32 bit mask where 0b001 is CPU 1, 0b100 is CPU 3,
	and 0b101 is either CPU 1 or CPU 3. This mask is usually respected but
	if the CPUs it requests are all inactive it will be assigned to another
	CPU. Also notable, if it is “pinned” to a CPU, that is its mask contains
	only one CPU, and that CPU becomes inactive the thread will sit
	unserviced until that CPU becomes active again. See [*CPU
	Activation*](#cpu-activation) below for details.

	When selecting a CPU for a thread the scheduler will choose, in order:

	1. The CPU doing the selection, if it is idle and in the affinity mask.

	2. The CPU the thread last ran on, if it is idle and in the affinity mask.

	3. Any idle CPU in the affinity mask.

	4. The CPU the thread last ran on, if it is active and in the affinity mask.

	5. The CPU doing the selection, if it is the only one in the affinity mask or
	all cpus in the mask are not active.

	6. Any active CPU in the affinity mask.

	If the thread is running on a CPU not in its affinity mask (due to case
	5 above) the scheduler will try to rectify this everytime the thread is
	preempted, yields, or voluntarily reschedules. Also if the thread
	changes its affinity mask the scheduler may migrate it.

	Every time a thread comes back from waiting on a shared resource or
	sleeping and needs to be assigned a priority queue, the scheduler will
	re-evaluate its CPU choice for the thread, using the above logic, and
	may move it.

	#### CPU Activation

	When a CPU is being deactivated, that is shutdown and removed from the
	system, the scheduler will transition all running threads onto other
	CPUs. The only exception is threads that are “pinned”, that is they only
	have the deactivating CPU in their affinity mask, these threads are put
	back into the run queue where they will sit unserviced until the CPU is
	reactivated.

	When a CPU is reactivated it will service the waiting pinned threads and
	threads that are running on non-Affinity CPUs should be migrated back
	pretty quickly by their CPUs scheduler due to the above rules. There is
	no active rebalancing of threads to the newly awakened CPU, but as it
	should be idle more often, it should see some migration due to the logic
	laid out above in [*CPU Assignment and
	Migration*](#cpu-assignment-and-migration).

	#### Realtime and Idle Threads

	These are special threads that are treated a little differently.

	The idle thread runs when no other threads are runnable. There is one on
	each CPU and it lives outside of the priority queues, but effectively in
	a priority queue of -1. It is used to track idle time and can be used by
	platform implementations for a low power wait mode.

	Realtime threads (marked with THREAD_FLAG_REAL_TIME) are allowed to run without
	preemption and will run until they block, yield, or manually reschedule.