| Zircon Scheduling |
| ================= |
| |
| Background |
| ========== |
| |
| The primary responsibility of any scheduler is to share the limited |
| resource of processor time between all threads that wish to use it. In a |
| general purpose operating system, it tries to do so in a fair way, |
| ensuring that all threads are allowed to make some progress. |
| |
| Our scheduler is an evolution of LK’s scheduler. As such it |
| started as a minimal scheduler implementation and was extended to meet |
| our needs as the project grew. |
| |
| Design |
| ====== |
| |
| #### Overview |
| |
| In essence there is a scheduler running on each logical CPU in the machine. |
| These schedulers run independently and use IPI (Inter-Processor |
| Interrupts) to coordinate. However each CPU is responsible for |
| scheduling the threads that are running on it. See [*CPU |
| Assignment*](#cpu-assignment-and-migration) below for how we decide |
| which CPU a thread is on, and how/when it migrates. |
| |
| Each CPU has its own set of priority queues. One for each priority level |
| in the system, currently 32. Note that these are fifo queues, not the data |
| structure known as a priority queue. In each queue is an ordered list of |
| runnable threads awaiting execution. When it is time for a new thread to run, |
| the scheduler simply looks at the highest numbered queue that contains a thread, |
| pops the head off of that queue and runs that thread.See |
| [*Priority Management*](#priority-management) below for more details |
| about how it decides which thread should be in which queue. If there are no |
| threads in the queues to run it will instead run the idle thread, see [Realtime |
| and idle threads](#realtime-and-idle-threads) below for more details. |
| |
| Each thread is assigned the same timeslice size (THREAD_INITIAL_TIME_SLICE) |
| when it is picked to start running. If it uses its whole timeslice it will be |
| reinserted at the end of the appropriate priority queue. However if it has |
| some of its timeslice remaining from a previous run it will be inserted at the |
| head of the priority queue so it will be able to resume as quickly as possible. |
| When it is picked back up again it will only run for the remainder of its |
| previous timeslice. |
| |
| When the scheduler selects a new thread from the priority queue it sets |
| the CPU's preemption timer for either a full timeslice, or the remainder of the |
| previous timeslice. When that timer fires the scheduler will stop execution on |
| that thread, add it to the appropriate queue, select another thread and start |
| over again. |
| |
| If a thread blocks waiting for a shared resource then it's taken out of |
| its priority queue and is placed in a wait queue for the shared resource. |
| When it is unblocked it will be reinserted in the appropriate priority |
| queue of an eligible CPU ([*CPU |
| Assignment*](#cpu-assignment-and-migration)) and if it had remaining timeslice |
| to run it will be added to the front of the queue for expedited handling. |
| |
| #### Priority management {#priority-management} |
| |
| There are three different factors used to determine the effective |
| priority of a thread, the effective priority being what is used to |
| determine which queue it will be in. |
| |
| The first factor is the base priority, which is simply the thread’s |
| requested priority. There are currently 32 levels with 0 being the |
| lowest and 31 being the highest. |
| |
| The second factor is the priority boost. This is a value bounded between |
| \[-MAX_PRIORITY_ADJ, MAX_PRIORITY_ADJ\] used to offset the base priority, it is |
| modified by the following cases: |
| |
| - When a thread is unblocked, after waiting on a shared resource or |
| sleeping, it is given a one point boost. |
| |
| - When a thread yields (volunteers to give up control), or volunteers |
| to reschedule, its boost is decremented by one but is capped at 0 |
| (won’t go negative). |
| |
| - When a thread is preempted and has used up its entire timeslice, its |
| boost is decremented by one but is able to go negative. |
| |
| The third factor is its inherited priority. If the thread is in control |
| of a shared resource and it is blocking another thread of a higher |
| priority then it is given a temporary boost up to that thread’s priority |
| to allow it to finish quickly and allow the higher priority thread to |
| resume. |
| |
| The effective priority of the thread is either the inherited priority, |
| if it has one, or the base priority plus its boost. When this priority |
| changes, due to any of the factors changing, the scheduler will move it |
| to a new priority queue and reschedule the CPU. Allowing it to have |
| control if it is now the highest priority task, or relinquish control if |
| it is no longer highest. |
| |
| The intent in this system is to ensure that interactive threads are |
| serviced quickly. These are usually the threads that interact directly |
| with the user and cause user-perceivable latency. These threads usually |
| do little work and spend most of their time blocked awaiting another |
| user event. So they get the priority boost from unblocking while |
| background threads that do most of the processing receive the priority |
| penalty for using their entire timeslice. |
| |
| #### CPU assignment and migration {#cpu-assignment-and-migration} |
| |
| Threads are able to request which CPUs on which they wish to run using a |
| CPU affinity mask, a 32 bit mask where 0b001 is CPU 1, 0b100 is CPU 3, |
| and 0b101 is either CPU 1 or CPU 3. This mask is usually respected but |
| if the CPUs it requests are all inactive it will be assigned to another |
| CPU. Also notable, if it is “pinned” to a CPU, that is its mask contains |
| only one CPU, and that CPU becomes inactive the thread will sit |
| unserviced until that CPU becomes active again. See [CPU |
| activation](#cpu-activation) below for details. |
| |
| When selecting a CPU for a thread the scheduler will choose, in order: |
| |
| 1. The CPU doing the selection, if it is **idle** and in the affinity mask. |
| |
| 2. The CPU the thread last ran on, if it is **idle** and in the affinity mask. |
| |
| 3. Any **idle** CPU in the affinity mask. |
| |
| 4. The CPU the thread last ran on, if it is active and in the affinity mask. |
| |
| 5. The CPU doing the selection, if it is the only one in the affinity mask or |
| all cpus in the mask are not active. |
| |
| 6. Any active CPU in the affinity mask. |
| |
| If the thread is running on a CPU not in its affinity mask (due to case |
| 5 above) the scheduler will try to rectify this every time the thread is |
| preempted, yields, or voluntarily reschedules. Also if the thread |
| changes its affinity mask the scheduler may migrate it. |
| |
| Every time a thread comes back from waiting on a shared resource or |
| sleeping and needs to be assigned a priority queue, the scheduler will |
| re-evaluate its CPU choice for the thread, using the above logic, and |
| may move it. |
| |
| #### CPU activation {#cpu-activation} |
| |
| When a CPU is being deactivated, that is shutdown and removed from the |
| system, the scheduler will transition all running threads onto other |
| CPUs. The only exception is threads that are “pinned”, that is they only |
| have the deactivating CPU in their affinity mask, these threads are put |
| back into the run queue where they will sit unserviced until the CPU is |
| reactivated. |
| |
| When a CPU is reactivated it will service the waiting pinned threads and |
| threads that are running on non-Affinity CPUs should be migrated back |
| pretty quickly by their CPUs scheduler due to the above rules. There is |
| no active rebalancing of threads to the newly awakened CPU, but as it |
| should be idle more often, it should see some migration due to the logic |
| laid out above in [CPU assignment and |
| migration](#cpu-assignment-and-migration). |
| |
| #### Realtime and idle threads {#realtime-and-idle-threads} |
| |
| These are special threads that are treated a little differently. |
| |
| The idle thread runs when no other threads are runnable. There is one on |
| each CPU and it lives outside of the priority queues, but effectively in |
| a priority queue of -1. It is used to track idle time and can be used by |
| platform implementations for a low power wait mode. |
| |
| Realtime threads (marked with `THREAD_FLAG_REAL_TIME`) are allowed to run without |
| preemption and will run until they block, yield, or manually reschedule. |