As part of the overall scheduler development effort, Zircon is moving to a new fair scheduler as the primary scheduler for the system. This document discusses the properties of the scheduler and how to enable it for testing prior to roll-out.
The fair scheduler is disabled by default. The new scheduler is enabled at compile time by setting the GN build argument enable_fair_scheduler
to true.
You can set this variable in your GN invocation like this:
gn gen build-zircon --args='enable_fair_scheduler=true'
The new scheduler includes detailed tracing instrumentation to analyze the behavior of the scheduler and its interaction with and impact on the competing workloads in the system. Detailed tracing is enabled at compile time by setting the GN build argument detailed_scheduler_tracing
to true.
You can set this variable in your GN invocation like this:
gn gen build-zircon --args='enable_fair_scheduler=true detailed_scheduler_tracing=true'
Use the kernel:sched
trace category to include the detailed scheduler information in your trace session. It's a good idea to also include the kernel:irq
category because interrupts can cause scheduler activity that might otherwise appear unconnected to other events.
ffx trace start --categories kernel:sched,kernel:irq,<other categories> --duration 4 --buffer-size 64
The detailed scheduler events are primarily duration and flow events. The events appear in Chromium Trace Viewer in the timelines labeled cpu-0
through cpu-N
, where N
is the number of CPUs in the system. These timelines represent per-CPU activity in the kernel, which includes interrupts and thread scheduling.
The fair scheduler emits duration events including the following:
wait_queue_wake_all
).zx_thread_yield
.The fair scheduler emits flow events including the following:
A NOTE ABOUT FLOW EVENTS: Sometimes displaying flow events is not enabled by default in Chromium Trace Viewer. Use the View Options
menu in the upper right corner of the page and make sure the Flow events
checkbox is checked.
You can also disable flow events display if there are too many and the rendering becomes too slow. Try zooming into a smaller region before enabling flow event display for better performance in very large traces.
Fair scheduling is a discipline that divides CPU bandwidth between competing threads, such that each receives a weighted proportion of the CPU over time. In this discipline, each thread is assigned a weight, which is somewhat similar to a priority in other scheduling disciplines. Threads receive CPU time in proportion to their weight, relative to the weights of other competing threads. This proportional bandwidth distribution has useful properties that make fair scheduling a good choice as the primary scheduling discipline in a general purpose operating system.
Briefly, these properties are:
A NOTE ABOUT DEADLINES: While fair scheduling is appropriate for the vast majority of workloads, there are some tasks that require very specific timing and/or do not adapt well to overload conditions. For example, these workloads include low-latency audio / graphics, high-frequency sensors, and high-rate / low-latency networking. These specialized tasks are better served with a deadline scheduler, which is planned for later in the Zircon scheduler development cycle.
The Zircon fair scheduler is based primarily on the Weighted Fair Queuing (WFQ) discipline, with insights from other similar queuing and scheduling disciplines. Adopting aspects of the Worst-Case Fair Weighted Fair Queuing (WF2Q) discipline, a modification of WFQ, is planned to improve control over tuning of latency versus throughput.
The following subsections outline the algorithm as implemented in Zircon. From here on, “fair scheduler” and “Zircon fair scheduler” are used interchangeably.
One of the primary jobs of the scheduler is to decide which order to execute competing threads on the CPU. The fair scheduler makes these decisions separately on each CPU. Essentially, each CPU runs a separate instance of the scheduler and manages its own run queue.
In this approach, a thread may compete only on one CPU at a time. A thread can be in one of three states: ready, running or blocked (other states are not relevant to this discussion.) For each CPU, at most one thread is in the running state at any time: this thread executes on the CPU, all other competing threads await execution in the ready state, while blocked threads are not in competition. The threads in the ready state are enqueued in the CPU's run queue; the order of threads in the run queue determines which thread runs next.
The fair scheduler, unlike O(1) scheduling disciplines such as priority round-robin (RR), uses an ordering criteria to compare and order threads in the run queue. This is implemented using a balanced binary tree, and means that scheduling decisions generally cost O(log n) to perform. While this is more expensive than an O(1) scheduler, the result is a near-optimal worst case delay bound (queuing time) for all competing threads.
Two concepts are used to order threads in the run queue: virtual timeline and per-thread normalized rate. The virtual timeline tracks when each thread in the run queue would finish a normalized time slice if it ran to completion. A normalized time slice is proportional to the thread‘s normalized rate, which in turn is inversely proportional to the thread’s weight. Threads are ordered in the run queue by ascending finish time in the virtual timeline.
The inverse proportional relationship to weight causes higher weighed threads to be inserted closer to the front of the run queue than lower weighted threads with similar arrival times. However, this is bounded over time: the longer a thread waits in the run queue, the less likely a newly arriving thread, however highly weighted, will be inserted before it. This property is key to the fairness of the scheduler.
The following sections define the scheduler in more precise terms.
For each thread P[i] we define the following state:
For each CPU C[j] we define the following state:
When a thread enters competition for a CPU, its weight is added to the CPU‘s total weight. Likewise, when a thread blocks or is migrated to another CPU the thread’s weight is subtracted from the CPU's total weight. The total includes the weights of the ready threads and the current running thread.
We define the following tunable state, which may either be global or per-CPU:
We define the following relationships for the key scheduler variables:
The scheduling period controls the size of time slices. When there are few threads competing, the scheduling period defaults to the target latency. This results in larger time slices and fewer preemptions, improving throughput and potentially power consumption. When the number of threads is too large the scheduling period stretches such that each task receives at least the minimum granularity time slice.
Let N be the maximum number of competing threads before period stretching.
N = floor(L / M)
p[j] = n[j] > N --> M * n[j], L
When a thread enters the run queue, either by newly joining the competition for the CPU or completing a time slice, the thread's virtual start and finish time are computed. As the current fair scheduler is based on WFQ, the finish time is used to select the position for the thread in the run queue relative to other threads. Later, when WF2Q is implemented, both the start and finish time are considered.
Some WFQ implementations use the thread's actual time slice to calculate the normalized time slice for the timeline. However, the actual time slice depends on the total weight of the CPU (see below), a value that changes as threads enter competition. The Zircon fair scheduler instead uses the scheduling period as an idealized uniform time slice for the virtual timeline, because its value changes less dramatically. Using a uniform value for all threads avoids skewing the virtual timeline unfairly in favor threads that join early.
Let T be the system time of CPU C[j] when thread P[i] enters the run queue.
s[i] = T
f[i] = s[i] + p[j] / w[i]
When a thread is selected to run, its time slice is calculated based on its relative rate and the scheduling period.
Let g be the integer number of minimum granularity units M in the current scheduling period p[j] of CPU C[j].
Let R be the relative rate of thread P[i].
g = floor(p[j] / M)
R = w[i] / W[j]
t[i] = ceil(g * R) * M
This definition ensures that t[i] is an integer multiple of the minimum granularity M, while remaining approximately proportional to the relative rate of the thread.
Yielding immediately expires the thread's time slice and returns it to the run queue. This behavior is similar to yielding in O(1) scheduling: the yielding thread is guaranteed to queue behind threads of the same or greater weight. However, the yielding thread may or may not skip ahead of lower weight threads, depending on how long other threads have been waiting to run.