blob: ac937a3cdf7276d13ccbfe060bfa2d7cb8070552 [file] [log] [blame] [view]
# Zircon Fair Scheduler
## Introduction
As part of the overall scheduler development effort, Zircon is moving to a new
fair scheduler as the primary scheduler for the system. This document discusses
the properties of the scheduler and how to enable it for testing prior to
## Enabling the Fair Scheduler
The fair scheduler is disabled by default. The new scheduler is enabled at
compile time by setting the GN build argument `enable_fair_scheduler` to true.
You can set this variable in your GN invocation like this:
gn gen build-zircon --args='enable_fair_scheduler=true'
## Detailed Scheduler Tracing
The new scheduler includes detailed tracing instrumentation to analyze the
behavior of the scheduler and its interaction with and impact on the competing
workloads in the system. Detailed tracing is enabled at compile time by setting
the GN build argument `detailed_scheduler_tracing` to true.
You can set this variable in your GN invocation like this:
gn gen build-zircon --args='enable_fair_scheduler=true detailed_scheduler_tracing=true'
Use the `kernel:sched` trace category to include the detailed scheduler
information in your trace session. It's a good idea to also include the
`kernel:irq` category because interrupts can cause scheduler activity that might
otherwise appear unconnected to other events.
fx traceutil record -categories kernel:sched,kernel:irq,<other categories> -stream -duration 4s -buffer-size 64
### Summary of Scheduler Events
The detailed scheduler events are primarily duration and flow events. The
events appear in Chromium Trace Viewer in the timelines labeled `cpu-0`
through `cpu-N`, where `N` is the number of CPUs in the system. These timelines
represent per-CPU activity in the kernel, which includes interrupts and
thread scheduling.
The fair scheduler emits duration events including the following:
* **sched_block**: The active thread blocks on a Zircon object, futex,
kernel-internal lock.
* **sched_unblock**: The active thread unblocks another thread due to
interacting with a Zircon object, futex, or kernel-internal lock.
* **sched_unblock_list**: A variation of **sched_block** when the action of the
active thread may wake one or more threads at once (e.g.
* **sched_yield**: The active thread called `zx_thread_yield`.
* **sched_preempt**: An interrupt requests re-evaluation of the run queue. This
is due to either the time slice timer expiring, another CPU requesting a
reschedule after modifying the CPU's run queue, or a hardware interrupt
handler waking up a thread on the CPU.
* **sched_reschedule**: A kernel operation changed the run queue of the current
CPU and a different thread _might_ need to run.
The fair scheduler emits flow events including the following:
* **sched_latency**: A flow that connects the point in time right after a thread
enters the run queue to the point in time right before the (potentially
different) target CPU context switches to the thread. This flow event is
useful for visualizing cross-CPU scheduling activity and observing the
runnable-to-running scheduler latency of a thread at any point in time.
A NOTE ABOUT FLOW EVENTS: Sometimes displaying flow events is not enabled by
default in Chromium Trace Viewer. Use the `View Options` menu in the upper right
corner of the page and make sure the `Flow events` checkbox is checked.
You can also disable flow events display if there are too many and the rendering
becomes too slow. Try zooming into a smaller region before enabling flow event
display for better performance in very large traces.
## Fair Scheduling Overview
Fair scheduling is a discipline that divides CPU bandwidth between competing
threads, such that each receives a weighted proportion of the CPU over time.
In this discipline, each thread is assigned a weight, which is somewhat similar
to a priority in other scheduling disciplines. Threads receive CPU time in
proportion to their weight, relative to the weights of other competing threads.
This proportional bandwidth distribution has useful properties that make fair
scheduling a good choice as the primary scheduling discipline in a general
purpose operating system.
Briefly, these properties are:
* **Intuitive bandwidth allocation mechanism**: A thread with twice the weight
of another thread will receive approximately twice the CPU time, relative to
the other thread over time. Whereas, a thread with the same weight as another
will receive approximately the same CPU time, relative to the other thread
over time.
* **Starvation free for all threads**: Proportional bandwidth division ensures
that all competing threads receive CPU time in a timely manner, regardless of
how low the thread weight is relative to other threads. Notably, this property
prevents unbounded priority inversion.
* **Fair response to system overload**: When the system is overloaded, all
threads share proportionally in the slowdown. Solving overload conditions is
often simpler than managing complex priority interactions required in other
scheduling disciplines.
* **Stability under evolving demands**: Adapts well to a wide range of workloads
with minimal intervention compared to other scheduling disciplines.
A NOTE ABOUT DEADLINES: While fair scheduling is appropriate for the vast
majority of workloads, there are some tasks that require very specific timing
and/or do not adapt well to overload conditions. For example, these workloads
include low-latency audio / graphics, high-frequency sensors, and high-rate /
low-latency networking. These specialized tasks are better served with a
deadline scheduler, which is planned for later in the Zircon scheduler
development cycle.
## Fair Scheduling in Zircon
The Zircon fair scheduler is based primarily on the Weighted Fair Queuing (WFQ)
discipline, with insights from other similar queuing and scheduling disciplines.
Adopting aspects of the Worst-Case Fair Weighted Fair Queuing (WF2Q) discipline,
a modification of WFQ, is planned to improve control over tuning of latency
versus throughput.
The following subsections outline the algorithm as implemented in Zircon. From
here on, "fair scheduler" and "Zircon fair scheduler" are used interchangeably.
### Ordering Thread Execution
One of the primary jobs of the scheduler is to decide which order to execute
competing threads on the CPU. The fair scheduler makes these decisions
separately on each CPU. Essentially, each CPU runs a separate instance of the
scheduler and manages its own run queue.
In this approach, a thread may compete only on one CPU at a time. A thread can
be in one of three states: _ready_, _running_ or _blocked_ (other states are not
relevant to this discussion.) For each CPU, at most one thread is in the
_running_ state at any time: this thread executes on the CPU, all other
competing threads await execution in the _ready_ state, while blocked threads
are not in competition. The threads in the _ready_ state are enqueued in the
CPU's run queue; the order of threads in the run queue determines which thread
runs next.
The fair scheduler, unlike **O(1)** scheduling disciplines such as priority
round-robin (RR), uses an ordering criteria to compare and order threads in the
run queue. This is implemented using a balanced binary tree, and means that
scheduling decisions generally cost **O(log n)** to perform. While this is more
expensive than an **O(1)** scheduler, the result is a near-optimal worst case
delay bound (queuing time) for all competing threads.
### Ordering Criteria
Two concepts are used to order threads in the run queue: _virtual timeline_ and
per-thread _normalized rate_. The _virtual timeline_ tracks when each thread in
the run queue would finish a _normalized time slice_ if it ran to completion.
A _normalized time slice_ is proportional to the thread's _normalized rate_,
which in turn is inversely proportional to the thread's weight. Threads are
ordered in the run queue by ascending _finish time_ in the _virtual timeline_.
The inverse proportional relationship to weight causes higher weighed threads
to be inserted closer to the front of the run queue than lower weighted threads
with similar arrival times. However, this is bounded over time: the longer a
thread waits in the run queue, the less likely a newly arriving thread, however
highly weighted, will be inserted before it. This property is key to the
fairness of the scheduler.
The following sections define the scheduler in more precise terms.
### Per-Thread Scheduling State
For each thread **P[i]** we define the following state:
* Weight **w[i]**: Real number representing the relative weight of the thread.
* Start Time **s[i]**: The start time of the thread in the CPU's virtual
* Finish Time **f[i]**: The finish time of the thread in the CPU's virtual
* Time Slice **t[i]**: The size of the time slice for the current period.
### Per-CPU Scheduling State
For each CPU **C[j]** we define the following state:
* Number of Threads **n[j]**: The number of runnable threads competing on the
* Scheduling Period **p[j]**: The period in which all competing threads on the
CPU execute approximately once.
* Total Weight **W[j]**: The sum of the weights of the threads competing on the
When a thread enters competition for a CPU, its weight is added to the CPU's
total weight. Likewise, when a thread blocks or is migrated to another CPU the
thread's weight is subtracted from the CPU's total weight. The total includes
the weights of the _ready_ threads and the current _running_ thread.
### Tunable State
We define the following tunable state, which may either be global or per-CPU:
* Minimum Granularity **M**: The smallest time slice allocated to any thread.
* Target Latency **L**: The target scheduling period for the CPU unless there
are too many threads to give each thread as least one minimum granularity time
### Definitions
We define the following relationships for the key scheduler variables:
#### Scheduling Period
The scheduling period controls the size of time slices. When there are few
threads competing, the scheduling period defaults to the _target latency_. This
results in larger time slices and fewer preemptions, improving throughput and
potentially power consumption. When the number of threads is too large the
scheduling period stretches such that each task receives at least the _minimum
granularity_ time slice.
Let **N** be the maximum number of competing threads before period stretching.
**N** = floor(**L** / **M**)
**p[j]** = **n[j]** > **N** --> **M** * **n[j]**, **L**
#### Virtual Timeline
When a thread enters the run queue, either by newly joining the competition for
the CPU or completing a time slice, the thread's _virtual_ start and finish time
are computed. As the current fair scheduler is based on WFQ, the finish time is
used to select the position for the thread in the run queue relative to other
threads. Later, when WF2Q is implemented, both the start and finish time are
Some WFQ implementations use the thread's actual time slice to calculate the
_normalized time slice_ for the timeline. However, the actual time slice depends
on the total weight of the CPU (see below), a value that changes as threads enter
competition. The Zircon fair scheduler instead uses the scheduling period as an
idealized uniform time slice for the _virtual timeline_, because its value
changes less dramatically. Using a uniform value for all threads avoids skewing
the _virtual timeline_ unfairly in favor threads that join early.
Let **T** be the system time of CPU **C[j]** when thread **P[i]** enters the run
**s[i]** = **T**
**f[i]** = **s[i]** + **p[j]** / **w[i]**
### Time Slice
When a thread is selected to run, its time slice is calculated based on its
relative rate and the scheduling period.
Let **g** be the integer number of _minimum granularity_ units **M** in the
current _scheduling period_ **p[j]** of CPU **C[j]**.
Let **R** be the relative rate of thread **P[i]**.
**g** = floor(**p[j]** / **M**)
**R** = **w[i]** / **W[j]**
**t[i]** = ceil(**g** * **R**) * **M**
This definition ensures that **t[i]** is an integer multiple of the _minimum
granularity_ **M**, while remaining approximately proportional to the relative
rate of the thread.
### Yield
Yielding immediately expires the thread's time slice and returns it to the run
queue. This behavior is similar to yielding in **O(1)** scheduling: the yielding
thread is guaranteed to queue behind threads of the same or greater weight.
However, the yielding thread may or may not skip ahead of lower weight threads,
depending on how long other threads have been waiting to run.