| <!-- mdformat off(templates not supported) --> |
| {% set rfcid = "RFC-0123" %} |
| {% include "docs/contribute/governance/rfcs/_common/_rfc_header.md" %} |
| # {{ rfc.name }}: {{ rfc.title }} |
| <!-- SET the `rfcid` VAR ABOVE. DO NOT EDIT ANYTHING ELSE ABOVE THIS LINE. --> |
| |
| <!-- mdformat on --> |
| <!-- This should begin with an H2 element (for example, ## Summary).--> |
| |
| ## Summary |
| |
| This RFC proposes a mechanism by which a userspace agent may interact with the |
| kernel regarding CPU performance, both to update the performance scales used by |
| the kernel scheduler and to query its state. |
| |
| ## Motivation |
| |
| In order to schedule work effectively across CPUs in heterogeneous |
| architectures such as big.LITTLE, the Zircon kernel scheduler models the |
| relative performances of CPUs. At time of writing, the [performance |
| scales](#performance-scale) that describe these relative performances are |
| static, provided by data in the ZBI. |
| |
| When performing thermal CPU throttling of a big.LITTLE system, the frequencies |
| of big and little cores are typically not scaled by identical factors, so their |
| relative performances change dynamically. Unlike most other operating systems, |
| in Fuchsia, modifications to core frequencies are performed in userspace, and |
| the scheduler must be notified across the kernel boundary of changes to relative |
| CPU performances. That communication necessitates new syscalls. |
| |
| ## Design |
| |
| ### Performance scale {#performance-scale} |
| |
| #### Concept {#performance-scale-concept} |
| |
| Before considering the proposed syscalls, it is useful to understand the concept |
| of performance scale, which already exists within the kernel scheduler. |
| Performance scale describes the ratio of the performance of a CPU operating at |
| its current speed to a system-dependent reference performance, where performance |
| can be measured using any suitable metric, such as |
| [DMIPS](https://en.wikipedia.org/wiki/Dhrystone). At time of writing — but |
| not necessarily in the future — the reference performance is that of the |
| most powerful CPU operating at its maximum speed, so 1.0 is the maximum |
| performance scale value. Typically, a vendor provides a performance value for |
| each CPU operating at a nominal speed, and performance is assumed to vary |
| linearly with CPU frequency. |
| |
| For example, on a big.LITTLE system, a vendor might provide performance data |
| indicating that a big core at its maximum speed performs twice the DMIPS as a |
| little core operating at its own maximum speed. If the reference performance |
| corresponds to a big core running at its maximum speed, then that operating |
| condition corresponds to performance scale 1.0, while a little core at its |
| maximum speed would have performance scale 0.5. Reducing a big core's speed |
| by 25% gives it a new performance scale of 0.75, while reducing the little |
| core's speed by 25% changes its performance scale to 0.375. |
| |
| More precisely, if f<sub>ref</sub> is a reference frequency with known |
| performance scale s<sub>ref</sub>, then frequency f<sub>new</sub> has |
| performance scale |
| s<sub>new</sub>=s<sub>ref</sub>f<sub>new</sub>/f<sub>ref</sub>. In general, one |
| reference frequency is required for each distinct CPU architecture in the |
| system. |
| |
| Typically, only a fixed number of frequency combinations are supported by a |
| given system. For example, it is typical that CPUs in the same cluster must have |
| the same frequency, and that each cluster only supports a relatively small |
| number of distinct frequencies. However, it is beyond the scope of the kernel to |
| track which performance scales are valid. As such, the kernel trusts userspace |
| to provide realistic values, and it will use values provided via the proposed |
| API to the best of its ability. |
| |
| #### Fixed point representation {#performance-scale-representation} |
| |
| To avoid using floating point numbers, performance scales are represented using |
| fixed point numbers, specified by a struct |
| |
| ```c |
| typedef struct zx_cpu_performance_scale { |
| uint32_t integral_part; |
| uint32_t fractional_part; // Increments of 2**-32 |
| } zx_cpu_performance_scale_t; |
| ``` |
| |
| `integral_part` and `fractional_part` describe the integer and fractional parts, |
| respectively, with `fractional_part` specifying increments of 2<sup>-32</sup>. |
| Conversion between real and fixed point representations should be done according |
| to the following functions: |
| |
| ```c++ |
| zx_status_t ToFixedPoint(double real, zx_cpu_performance_scale_t* scale) { |
| double integer; |
| double fraction = std::modf(real, &integer); |
| |
| // Converting from double to fixed point should fail if the input's integer |
| // part is too large. |
| if (integer > static_cast<double>(UINT32_MAX)) { |
| return ZX_ERR_INVALID_ARGS; |
| } |
| |
| scale->integral_part = static_cast<uint32_t>(integer); |
| |
| // Rounding down the fractional part is suggested but should not matter |
| // much in practice. A difference of 1 in the output is a difference of only |
| // 2**-32 in the corresponding real value. |
| scale->fractional_part = static_cast<uint32_t>(std::ldexp(fraction, 32)); |
| |
| return ZX_OK; |
| } |
| |
| double FromFixedPoint(zx_cpu_performance_scale_t scale) { |
| return static_cast<double>(scale.integral_part) |
| + std::ldexp(scale.fractional_part, -32); |
| } |
| ``` |
| |
| ### Syscall 1: `zx_system_set_performance_info` |
| |
| The first syscall allows a userspace agent to set performance scales used by the |
| kernel scheduler: |
| |
| ```c |
| zx_status_t zx_system_set_performance_info( |
| zx_handle_t resource, |
| uint32_t topic, |
| const void* new_info, |
| size_t info_count |
| ); |
| ``` |
| |
| Its arguments are: |
| |
| - `resource`: A resource that grants permission to this call. Must be |
| `ZX_RSRC_SYSTEM_CPU_BASE`, a new resource introduced specifically for this |
| API, or the call will fail. |
| |
| - `topic`: The type of performance referenced by this call. Must be |
| `ZX_CPU_PERF_SCALE`, which will be defined upon proposal implementation. |
| |
| - `new_info`: A valid `zx_cpu_performance_info_t[]`, whose elements are |
| specified by |
| |
| ```c |
| typedef struct zx_cpu_performance_info { |
| uint32_t logical_cpu_number; |
| zx_cpu_performance_scale_t performance_scale; |
| } zx_cpu_performance_info_t; |
| ``` |
| |
| where `zx_cpu_performance_t` is [defined |
| above](#performance-scale-representation). |
| |
| `logical_cpu_number` specifies the CPU whose info is described by the struct, |
| using the same numbering scheme utilized by the kernel. Each |
| `logical_cpu_number` must be a valid CPU identifier. Elements of `new_info` |
| must be sorted in order of strictly increasing `logical_cpu_number` (and |
| consequently, each `logical_cpu_number` may appear only once). |
| |
| `performance_scale` represents the new performance scale for the indicated |
| CPU, and it should correspond to the CPU's new frequency as [described |
| previously](#performance-scale-concept). However, the kernel does not validate |
| inputs against supported CPU frequencies; any positive value is allowed as an |
| input. |
| |
| An input scale of `{.integral_part = 0, .fractional_part = 0}` is invalid so |
| as not to be confused with a request to offline a core, a procedure with a |
| distinct mechanism that is expected to have a different API in the future. |
| |
| The kernel may internally override a valid input with the nearest value that |
| the scheduler can utilize. For example, at time of writing, the maximum |
| supported performance scale is 1.0. Therefore, if `performance_scale` |
| represents a value larger than 1.0, then the kernel will internally clamp it |
| to `{.integral_part = 1, .fractional_part = 0}`. |
| |
| If the call to `zx_system_set_performance_info` fails, then the kernel takes |
| no action, and `new_info` has no effect. |
| |
| If the call succeeds, then the kernel scheduler will utilize modified |
| performance scales corresponding to `new_info` beginning with the next |
| reschedule operation, which in general occurs sometime after the call returns. |
| The kernel will not modify its performance scales for CPUs not referenced in |
| `new_info`. |
| |
| Changes made by this call will persist until reboot or until they |
| are overridden by further use of this API. |
| |
| - `info_count`: The number of elements in `new_info`. Must be positive and no |
| greater than the number of CPUs in the system. |
| |
| #### Error conditions |
| |
| `ZX_ERR_BAD_HANDLE` |
| |
| - `resource` is not a valid handle. |
| |
| `ZX_ERR_WRONG_TYPE` |
| |
| - `resource` is not a valid resource handle or is not of kind |
| `ZX_RSRC_KIND_SYSTEM`. |
| |
| `ZX_ERR_INVALID_ARGS` |
| |
| - `topic` is not `ZX_CPU_PERF_SCALE`. |
| - `new_info` is an invalid pointer. |
| - `new_info` is not sorted by strictly increasing `logical_cpu_number`. |
| |
| `ZX_ERR_OUT_OF_RANGE` |
| |
| - `resource` is of kind `ZX_RSRC_KIND_SYSTEM` but is not equal to |
| `ZX_RSRC_SYSTEM_CPU_BASE`. |
| - `info_count` is `0` or exceeds the number of CPUs. |
| - A `logical_cpu_number` was invalid. |
| - An input `performance_scale` was `{.integral_part = 0, .fractional_part = 0}`. |
| |
| #### Intended usage {#intended-usage-set} |
| |
| `zx_system_set_performance_info` should be used to notify the kernel of |
| changes in CPU performance whenever CPU frequency is changed. The API supports |
| specification of performance scales for only a subset of CPUs because different |
| CPUs may be controlled by different entities. |
| |
| If a CPU's frequency is to be decreased, it is recommended that |
| `zx_system_set_performance_info` be called before the frequency change has |
| occurred. Doing so gives the kernel scheduler the opportunity to reduce load on |
| that CPU before its capacity is decreased. (The scheduler is expected to respond |
| quickly enough that no further coordination is needed; this expectation will be |
| confirmed once support is implemented.) |
| |
| Conversely, if a CPU's frequency is to be increased, it is recommended that |
| `zx_system_set_performance_info` be called after the frequency change has |
| occurred, notifying the scheduler of new capacity only once it is available. |
| |
| In either case, should an update to CPU frequency fail, the caller must update |
| the kernel scheduler based on the resulting CPU state. The caller should attempt |
| to determine the post-failure CPU frequency and use that to inform a separate |
| call to `zx_system_set_performance_info`. If the frequency cannot be determined |
| (e.g. if an associated driver has failed outright), the caller should make a |
| pessimistic (low) guess as to the resulting CPU speed. This recommendation may |
| evolve as it is given further consideration; see for example |
| [fxbug.dev/84685](https://fxbug.dev/84685). |
| |
| The new API will ultimately be utilized by a to-be-developed "CPU Manager" |
| component that will be responsible for userspace administration of CPUs. Rather |
| than interacting directly with CPU drivers, agents that wish to modify CPU |
| frequency will register requests with CPU Manager, which will coordinate |
| frequency changes with updates to the kernel as described in this proposal. |
| |
| CPU Manager will also take over responsibility for thermal throttling of CPU |
| — the motivating use case for this proposal — from Power Manager. |
| |
| ### Syscall 2: `zx_system_get_performance_info` |
| |
| The second syscall retrieves performance information for all CPUs: |
| |
| ```c |
| zx_status_t zx_system_get_performance_info( |
| zx_handle_t resource, |
| uint32_t topic, |
| void* info, |
| size_t info_count |
| size_t* output_count |
| ); |
| ``` |
| |
| Its arguments are: |
| |
| - `resource`: A resource that grants permission to this call. Must be |
| `ZX_RSRC_SYSTEM_CPU_BASE`. |
| |
| - `topic`: Either `ZX_CPU_PERF_SCALE` or `ZX_CPU_DEFAULT_PERF_SCALE`, which will |
| be defined upon proposal implementation. The topic determines the content |
| written to `info`, described below. |
| |
| - `info`: A valid `zx_cpu_performance_info_t[]` with length equal to the |
| number of CPUs in the system. |
| |
| If the call fails, `info` is unmodified. |
| |
| If the call succeeds, then upon return `info` contains one element for each |
| CPU, ordered by increasing `logical_cpu_number`. Each element's |
| `performance_scale` is populated based on `topic`: |
| |
| - `ZX_CPU_PERF_SCALE`: `performance_scale` stores the kernel's current |
| performance scale for the indicated CPU. The value provided reflects the |
| most recent call to `zx_system_set_performance_info` even if the next |
| reschedule operation has not yet taken place. |
| |
| - `ZX_CPU_DEFAULT_PERF_SCALE`: `performance_scale` stores the default |
| performance scale used by the kernel on boot for the indicated CPU. |
| |
| - `info_count`: Length of the `info` array; must equal the number of CPUs in the |
| system. |
| |
| - `output_count`: If the call succeeds, this will contain the number of elements |
| written to `info`. If the call fails, its value is unspecified. |
| |
| #### Error conditions |
| |
| `ZX_ERR_BAD_HANDLE` |
| |
| - `resource` is not a valid handle. |
| |
| `ZX_ERR_WRONG_TYPE` |
| |
| - `resource` is not a valid resource handle or is not of kind |
| `ZX_RSRC_KIND_SYSTEM`. |
| |
| `ZX_ERR_INVALID_ARGS` |
| |
| - `topic` is not `ZX_CPU_PERF_SCALE` or `ZX_CPU_DEFAULT_PERF_SCALE`. |
| - `info` is an invalid pointer. |
| |
| `ZX_ERR_OUT_OF_RANGE` |
| |
| - `resource` is of kind `ZX_RSRC_KIND_SYSTEM` but is not equal to |
| `ZX_RSRC_SYSTEM_CPU_BASE`. |
| - `info_count` does not equal the total number of CPUs in the system. |
| |
| #### Intended usage |
| |
| The behavior under `ZX_CPU_PERF_SCALE` allows a userspace agent to query |
| performance scales for diagnostic purposes. This may be useful, for example, for |
| an agent to assess system state when it first starts or as a signal to a crash |
| report. |
| |
| The behavior under `ZX_CPU_DEFAULT_PERF_SCALE` allows an agent to |
| confirm that the performance scales with which it is configured agree with those |
| in use by the kernel. |
| |
| ## Implementation |
| |
| ### Kernel |
| |
| - The new syscalls must be implemented, gated by a new resource |
| `ZX_RSRC_SYSTEM_CPU_BASE`. |
| |
| - The kernel scheduler must be modified to support dynamic performance scales, |
| updating them to use the most recent values provided by |
| `zx_system_set_performance_info`, and additionally exposing its currently-used |
| and default performance scales to `zx_system_get_performance_info`. |
| |
| ### Component manager |
| |
| A new protocol `CpuResource` must be defined and must be implemented by |
| Component Manager to provide the `ZX_RSRC_SYSTEM_CPU_BASE` resource. This |
| follows a pre-existing pattern for resources that gate syscalls. |
| |
| ## Performance |
| |
| The new syscalls themselves will take a negligible amount of time to execute, as |
| they simply touch a small amount of data proportional to the number of CPUs. |
| |
| Use of `zx_cpu_set_performance_info` will cause the scheduler to distribute work |
| differently, shifting work towards cores whose performance scales increase |
| relative to the sum of all performance scales, and away from those whose |
| performance scales similarly decrease. The rescheduling process itself will not |
| place a significant amount of load on the scheduler. |
| |
| Rescheduling will lead to expected changes in system performance. Testing of |
| these changes is equivalent to testing the scheduler for functional correctness |
| and is addressed in [Testing](#testing). |
| |
| ## Security considerations |
| |
| Both new syscalls are gated by the new resource handle |
| `ZX_RSRC_SYSTEM_CPU_BASE`. For `zx_system_set_performance_info`, this protection |
| addresses the clear concern of malicious interference with the scheduler. For |
| `zx_system_get_performance_info`, there is the subtler concern of data leakage; |
| an untrusted entity should not be trusted to know the kernel's performance |
| scales, which will typically provide information about the system's supported |
| P-states. |
| |
| ## Privacy considerations |
| |
| This proposal has no meaningful impact on privacy. |
| |
| ## Testing {#testing} |
| |
| - Core tests will be added to exercise basic success and failure criteria. |
| - Unit tests will be added to validate the scheduler's handling of updated |
| performance scales. They will verify that if a deadline thread is pinned to a |
| CPU, and that CPU's performance scale is modified by factor α, then the |
| actual time allotted to the thread is multiplied by 1/α. |
| |
| ## Documentation |
| |
| The Zircon syscall documentation will be updated to include the new API. |
| |
| ## Drawbacks, alternatives, and unknowns |
| |
| ### Generality |
| |
| A more general interface was considered, such as a `zx_set_cpu_properties` |
| syscall that could eventually handle additional interactions between the kernel |
| and CPUs, like offlining. Ultimately, we opted for a narrow interface because |
| very few clients of this interface are expected, keeping the cost of future |
| changes to the proposed interface relatively small. Requirements placed on a |
| more general interface would be largely guesswork at this point. |
| |
| ### Alternative call structure |
| |
| As an alternative to the set-only operation of `zx_system_set_performance_info`, |
| a combined get/set operation was considered that returns the prior performance |
| scales for CPUs whose scales were modified. This was intended as a means of |
| ensuring that the caller is capable of reverting performance scale changes |
| should lower-level execution of the associated frequency change fail. |
| |
| However, further consideration revealed that a simple reversion of changes would |
| not be sufficient. This resulted in a more complex set of [failure-handling |
| recommendations](#intended-usage-set) and led back to the simpler set-only |
| operation. |
| |
| Finally, `zx_system_get_performance_info` is needed to support hermetic testing, |
| in which case direct reversion of changes *is* appropriate, and supports |
| diagnostic use cases. |
| |
| ### Alternative CPU indexing |
| |
| We considered using an alternative scheme for indexing CPUs, such as referring |
| to them by physical CPU number. However, since the kernel has no other need for |
| such a scheme, it is most consistent with Zircon's limited scope to have the API |
| use the kernel's existing logical CPU numbers. These numbers are consistent on a |
| given system, and a client could either maintain a static per-board |
| configuration to refer to them or potentially access their configuration data |
| from the ZBI. |
| |
| ### Alternative to performance scale |
| |
| We considered that, rather than referring to performance scale directly, the new |
| API might utilize a "speed factor" that the scheduler would apply to the base |
| performance scale for a given CPU. Doing so would reduce the amount of |
| context-specific information a client would need to know; rather than |
| understanding the relative performances between CPUs, it would only need to know |
| the ratio between a CPU's new frequency and its nominal frequency. |
| |
| We opted against this approach because performance scale is intended to be used |
| in a fundamental way for CPU thermal throttling on a heterogeneous system, so |
| the one anticipated client of this API would receive no meaningful benefit from |
| using speed factors instead. Meanwhile, we would incur the cost both of defining |
| the new concept and modifying the scheduler to utilize it. |
| |
| ### Maximum performance scale |
| |
| This proposal originally represented performance scale using a `uint32_t` that |
| represented real values in \[0.0, 1.0\]. In particular, this allowed |
| representation of a maximum value of 1.0. |
| |
| While 1.0 is the maximum performance scale supported by the Zircon scheduler at |
| time of writing, we decided to allow inputs that represent values greater than |
| 1.0 to support future use cases, such as a turbo mode. Additionally, the |
| previous representation was not fixed point, so it led to values |
| that could not be directly used by the scheduler. |
| |
| #### Representation of `performance_scale` |
| |
| `performance_scale` was originally a `uint64_t`, with the upper 32 bits holding |
| the integer part and the lower 32 bits holding the fractional part. This would |
| have produced 32 bits of padding between fields in `zx_cpu_performance_info_t`, |
| which introduced a potential leakage vector. The new representation avoids that |
| pitfall. |
| |
| ### Allowed values for `performance_scale` |
| |
| Careful consideration was given to what values `zx_system_set_performance_info` |
| should allow as inputs for `performance_scale`. A value representing 0.0 was |
| determined to be too easily confused for an instruction to offline a CPU — |
| an action that Zircon does not currently support but is expected to in the |
| future using a different API. As such, a value representing 0.0 was determined |
| to be an error. |
| |
| Very small values warranted special attention as well. For example, an input of |
| `{.integral_part = 0, .fractional_part = 1}` would represent 2<sup>-32</sup>, |
| which could reasonably be treated as 0.0, effectively rendering the |
| corresponding core offline. While this would be possible to address by enforcing |
| a minimum allowed value, any such threshold would currently be arbitrary and |
| would further complicate the contract between the kernel and userspace. We felt |
| it most straightforward to treat the new API as a hinting mechanism and leave |
| the kernel with the freedom to override inputs if it needs to do so without |
| exposing internal details related to such a choice. |
| |
| ### Future work |
| |
| #### Configuration management |
| |
| Ideally, userspace agents would use the ZBI to share the exact same CPU |
| configuration data utilized by the kernel scheduler. It is unclear whether doing |
| so is currently practical. |
| |
| Additionally, care must be taken to ensure that both the kernel and userspace |
| agents associate default performance scales with the same nominal frequencies. |
| |
| #### Lower bounds on performance scales |
| |
| In principle, the scheduler can determine minimum performance scales that the |
| system should maintain based on current deadline threads and CPU load. Dynamic |
| versions of these bounds would be an important input to a userspace agent that |
| attempts to utilize lower CPU frequencies for energy efficiency. An additional |
| option to `zx_system_get_performance_info` would provide a natural means to |
| expose them. |
| |
| #### CPU attribution |
| |
| Some means should be established to associate a thread's attributed CPU time |
| with the performance of the CPU on which it was scheduled. Such association is |
| already relevant to the establishment of performance metrics that are robust to |
| scheduling on big cores versus little cores, and it becomes even more relevant |
| as we develop the machinery surrounding frequency modifications, as with this |
| proposal. |
| |
| #### Guaranteed execution of throttling agent |
| |
| Reduction of CPU frequencies when performing thermal throttling may lead to CPU |
| starvation, which in turn may make the throttling agent's process less likely to |
| be scheduled in a timely fashion. Execution of the throttling agent should be |
| prioritized in an appropriate manner. |
| |
| ## Prior art and references |
| |
| Delegation of responsibility for CPU frequency control to userspace is unusual |
| for operating systems, making prior art on this topic unavailable. |