docs/contribute/governance/rfcs/0123_cpu_performance_info.md - fuchsia - Git at Google

 <!-- mdformat off(templates not supported) -->
 {% set rfcid = "RFC-0123" %}
 {% include "docs/contribute/governance/rfcs/_common/_rfc_header.md" %}
 # {{ rfc.name }}: {{ rfc.title }}
 <!-- SET the `rfcid` VAR ABOVE. DO NOT EDIT ANYTHING ELSE ABOVE THIS LINE. -->

 <!-- mdformat on -->
 <!-- This should begin with an H2 element (for example, ## Summary).-->

 ## Summary

 This RFC proposes a mechanism by which a userspace agent may interact with the
 kernel regarding CPU performance, both to update the performance scales used by
 the kernel scheduler and to query its state.

 ## Motivation

 In order to schedule work effectively across CPUs in heterogeneous
 architectures such as big.LITTLE, the Zircon kernel scheduler models the
 relative performances of CPUs. At time of writing, the [performance
 scales](#performance-scale) that describe these relative performances are
 static, provided by data in the ZBI.

 When performing thermal CPU throttling of a big.LITTLE system, the frequencies
 of big and little cores are typically not scaled by identical factors, so their
 relative performances change dynamically. Unlike most other operating systems,
 in Fuchsia, modifications to core frequencies are performed in userspace, and
 the scheduler must be notified across the kernel boundary of changes to relative
 CPU performances. That communication necessitates new syscalls.

 ## Design

 ### Performance scale {#performance-scale}

 #### Concept {#performance-scale-concept}

 Before considering the proposed syscalls, it is useful to understand the concept
 of performance scale, which already exists within the kernel scheduler.
 Performance scale describes the ratio of the performance of a CPU operating at
 its current speed to a system-dependent reference performance, where performance
 can be measured using any suitable metric, such as
 [DMIPS](https://en.wikipedia.org/wiki/Dhrystone). At time of writing &mdash; but
 not necessarily in the future &mdash; the reference performance is that of the
 most powerful CPU operating at its maximum speed, so 1.0 is the maximum
 performance scale value. Typically, a vendor provides a performance value for
 each CPU operating at a nominal speed, and performance is assumed to vary
 linearly with CPU frequency.

 For example, on a big.LITTLE system, a vendor might provide performance data
 indicating that a big core at its maximum speed performs twice the DMIPS as a
 little core operating at its own maximum speed. If the reference performance
 corresponds to a big core running at its maximum speed, then that operating
 condition corresponds to performance scale 1.0, while a little core at its
 maximum speed would have performance scale 0.5. Reducing a big core's speed
 by 25% gives it a new performance scale of 0.75, while reducing the little
 core's speed by 25% changes its performance scale to 0.375.

 More precisely, if f<sub>ref</sub> is a reference frequency with known
 performance scale s<sub>ref</sub>, then frequency f<sub>new</sub> has
 performance scale
 s<sub>new</sub>=s<sub>ref</sub>f<sub>new</sub>/f<sub>ref</sub>. In general, one
 reference frequency is required for each distinct CPU architecture in the
 system.

 Typically, only a fixed number of frequency combinations are supported by a
 given system. For example, it is typical that CPUs in the same cluster must have
 the same frequency, and that each cluster only supports a relatively small
 number of distinct frequencies. However, it is beyond the scope of the kernel to
 track which performance scales are valid. As such, the kernel trusts userspace
 to provide realistic values, and it will use values provided via the proposed
 API to the best of its ability.

 #### Fixed point representation {#performance-scale-representation}

 To avoid using floating point numbers, performance scales are represented using
 fixed point numbers, specified by a struct

 ```c
   typedef struct zx_cpu_performance_scale {
     uint32_t integral_part;
     uint32_t fractional_part;  // Increments of 2**-32
   } zx_cpu_performance_scale_t;
 ```

 `integral_part` and `fractional_part` describe the integer and fractional parts,
 respectively, with `fractional_part` specifying increments of 2<sup>-32</sup>.
 Conversion between real and fixed point representations should be done according
 to the following functions:

 ```c++
 zx_status_t ToFixedPoint(double real, zx_cpu_performance_scale_t* scale) {
   double integer;
   double fraction = std::modf(real, &integer);

   // Converting from double to fixed point should fail if the input's integer
   // part is too large.
   if (integer > static_cast<double>(UINT32_MAX)) {
     return ZX_ERR_INVALID_ARGS;
   }

   scale->integral_part = static_cast<uint32_t>(integer);

   // Rounding down the fractional part is suggested but should not matter
   // much in practice. A difference of 1 in the output is a difference of only
   // 2**-32 in the corresponding real value.
   scale->fractional_part = static_cast<uint32_t>(std::ldexp(fraction, 32));

   return ZX_OK;
 }

 double FromFixedPoint(zx_cpu_performance_scale_t scale) {
   return static_cast<double>(scale.integral_part)
     + std::ldexp(scale.fractional_part, -32);
 }
 ```

 ### Syscall 1: `zx_system_set_performance_info`

 The first syscall allows a userspace agent to set performance scales used by the
 kernel scheduler:

 ```c
 zx_status_t zx_system_set_performance_info(
     zx_handle_t resource,
     uint32_t topic,
     const void* new_info,
     size_t info_count
 );
 ```

 Its arguments are:

 - `resource`: A resource that grants permission to this call. Must be
   `ZX_RSRC_SYSTEM_CPU_BASE`, a new resource introduced specifically for this
   API, or the call will fail.

 - `topic`: The type of performance referenced by this call. Must be
   `ZX_CPU_PERF_SCALE`, which will be defined upon proposal implementation.

 - `new_info`: A valid `zx_cpu_performance_info_t[]`, whose elements are
   specified by

   ```c
   typedef struct zx_cpu_performance_info {
       uint32_t logical_cpu_number;
       zx_cpu_performance_scale_t performance_scale;
   } zx_cpu_performance_info_t;
   ```

   where `zx_cpu_performance_t` is [defined
   above](#performance-scale-representation).

   `logical_cpu_number` specifies the CPU whose info is described by the struct,
   using the same numbering scheme utilized by the kernel. Each
   `logical_cpu_number` must be a valid CPU identifier. Elements of `new_info`
   must be sorted in order of strictly increasing `logical_cpu_number` (and
   consequently, each `logical_cpu_number` may appear only once).

   `performance_scale` represents the new performance scale for the indicated
   CPU, and it should correspond to the CPU's new frequency as [described
   previously](#performance-scale-concept). However, the kernel does not validate
   inputs against supported CPU frequencies; any positive value is allowed as an
   input.

   An input scale of `{.integral_part = 0, .fractional_part = 0}` is invalid so
   as not to be confused with a request to offline a core, a procedure with a
   distinct mechanism that is expected to have a different API in the future.

   The kernel may internally override a valid input with the nearest value that
   the scheduler can utilize. For example, at time of writing, the maximum
   supported performance scale is 1.0. Therefore, if `performance_scale`
   represents a value larger than 1.0, then the kernel will internally clamp it
   to `{.integral_part = 1, .fractional_part = 0}`.

   If the call to `zx_system_set_performance_info` fails, then the kernel takes
   no action, and `new_info` has no effect.

   If the call succeeds, then the kernel scheduler will utilize modified
   performance scales corresponding to `new_info` beginning with the next
   reschedule operation, which in general occurs sometime after the call returns.
   The kernel will not modify its performance scales for CPUs not referenced in
   `new_info`.

   Changes made by this call will persist until reboot or until they
   are overridden by further use of this API.

 - `info_count`: The number of elements in `new_info`. Must be positive and no
   greater than the number of CPUs in the system.

 #### Error conditions

 `ZX_ERR_BAD_HANDLE`

 - `resource` is not a valid handle.

 `ZX_ERR_WRONG_TYPE`

 - `resource` is not a valid resource handle or is not of kind
   `ZX_RSRC_KIND_SYSTEM`.

 `ZX_ERR_INVALID_ARGS`

 - `topic` is not `ZX_CPU_PERF_SCALE`.
 - `new_info`  is an invalid pointer.
 - `new_info` is not sorted by strictly increasing `logical_cpu_number`.

 `ZX_ERR_OUT_OF_RANGE`

 - `resource` is of kind `ZX_RSRC_KIND_SYSTEM` but is not equal to
   `ZX_RSRC_SYSTEM_CPU_BASE`.
 - `info_count` is `0` or exceeds the number of CPUs.
 - A `logical_cpu_number` was invalid.
 - An input `performance_scale` was `{.integral_part = 0, .fractional_part = 0}`.

 #### Intended usage {#intended-usage-set}

 `zx_system_set_performance_info` should be used to notify the kernel of
 changes in CPU performance whenever CPU frequency is changed. The API supports
 specification of performance scales for only a subset of CPUs because different
 CPUs may be controlled by different entities.

 If a CPU's frequency is to be decreased, it is recommended that
 `zx_system_set_performance_info` be called before the frequency change has
 occurred. Doing so gives the kernel scheduler the opportunity to reduce load on
 that CPU before its capacity is decreased. (The scheduler is expected to respond
 quickly enough that no further coordination is needed; this expectation will be
 confirmed once support is implemented.)

 Conversely, if a CPU's frequency is to be increased, it is recommended that
 `zx_system_set_performance_info` be called after the frequency change has
 occurred, notifying the scheduler of new capacity only once it is available.

 In either case, should an update to CPU frequency fail, the caller must update
 the kernel scheduler based on the resulting CPU state. The caller should attempt
 to determine the post-failure CPU frequency and use that to inform a separate
 call to `zx_system_set_performance_info`. If the frequency cannot be determined
 (e.g. if an associated driver has failed outright), the caller should make a
 pessimistic (low) guess as to the resulting CPU speed. This recommendation may
 evolve as it is given further consideration; see for example
 [fxbug.dev/84685](https://fxbug.dev/84685).

 The new API will ultimately be utilized by a to-be-developed "CPU Manager"
 component that will be responsible for userspace administration of CPUs. Rather
 than interacting directly with CPU drivers, agents that wish to modify CPU
 frequency will register requests with CPU Manager, which will coordinate
 frequency changes with updates to the kernel as described in this proposal.

 CPU Manager will also take over responsibility for thermal throttling of CPU
 &mdash; the motivating use case for this proposal &mdash; from Power Manager.

 ### Syscall 2: `zx_system_get_performance_info`

 The second syscall retrieves performance information for all CPUs:

 ```c
 zx_status_t zx_system_get_performance_info(
     zx_handle_t resource,
     uint32_t topic,
     void* info,
     size_t info_count
     size_t* output_count
 );
 ```

 Its arguments are:

 - `resource`: A resource that grants permission to this call. Must be
   `ZX_RSRC_SYSTEM_CPU_BASE`.

 - `topic`: Either `ZX_CPU_PERF_SCALE` or `ZX_CPU_DEFAULT_PERF_SCALE`, which will
   be defined upon proposal implementation. The topic determines the content
   written to `info`, described below.

 - `info`: A valid `zx_cpu_performance_info_t[]` with length equal to the
   number of CPUs in the system.

   If the call fails, `info` is unmodified.

   If the call succeeds, then upon return `info` contains one element for each
   CPU, ordered by increasing `logical_cpu_number`. Each element's
   `performance_scale` is populated based on `topic`:

      - `ZX_CPU_PERF_SCALE`: `performance_scale` stores the kernel's current
         performance scale for the indicated CPU. The value provided reflects the
         most recent call to `zx_system_set_performance_info` even if the next
         reschedule operation has not yet taken place.

      - `ZX_CPU_DEFAULT_PERF_SCALE`: `performance_scale` stores the default
         performance scale used by the kernel on boot for the indicated CPU.

 - `info_count`: Length of the `info` array; must equal the number of CPUs in the
   system.

 - `output_count`: If the call succeeds, this will contain the number of elements
   written to `info`. If the call fails, its value is unspecified.

 #### Error conditions

 `ZX_ERR_BAD_HANDLE`

 - `resource` is not a valid handle.

 `ZX_ERR_WRONG_TYPE`

 - `resource` is not a valid resource handle or is not of kind
   `ZX_RSRC_KIND_SYSTEM`.

 `ZX_ERR_INVALID_ARGS`

 - `topic` is not `ZX_CPU_PERF_SCALE` or `ZX_CPU_DEFAULT_PERF_SCALE`.
 - `info` is an invalid pointer.

 `ZX_ERR_OUT_OF_RANGE`

 - `resource` is of kind `ZX_RSRC_KIND_SYSTEM` but is not equal to
   `ZX_RSRC_SYSTEM_CPU_BASE`.
 - `info_count` does not equal the total number of CPUs in the system.

 #### Intended usage

 The behavior under `ZX_CPU_PERF_SCALE` allows a userspace agent to query
 performance scales for diagnostic purposes. This may be useful, for example, for
 an agent to assess system state when it first starts or as a signal to a crash
 report.

 The behavior under `ZX_CPU_DEFAULT_PERF_SCALE` allows an agent to
 confirm that the performance scales with which it is configured agree with those
 in use by the kernel.

 ## Implementation

 ### Kernel

 - The new syscalls must be implemented, gated by a new resource
   `ZX_RSRC_SYSTEM_CPU_BASE`.

 - The kernel scheduler must be modified to support dynamic performance scales,
   updating them to use the most recent values provided by
   `zx_system_set_performance_info`, and additionally exposing its currently-used
   and default performance scales to `zx_system_get_performance_info`.

 ### Component manager

 A new protocol `CpuResource` must be defined and must be implemented by
 Component Manager to provide the `ZX_RSRC_SYSTEM_CPU_BASE` resource. This
 follows a pre-existing pattern for resources that gate syscalls.

 ## Performance

 The new syscalls themselves will take a negligible amount of time to execute, as
 they simply touch a small amount of data proportional to the number of CPUs.

 Use of `zx_cpu_set_performance_info` will cause the scheduler to distribute work
 differently, shifting work towards cores whose performance scales increase
 relative to the sum of all performance scales, and away from those whose
 performance scales similarly decrease. The rescheduling process itself will not
 place a significant amount of load on the scheduler.

 Rescheduling will lead to expected changes in system performance. Testing of
 these changes is equivalent to testing the scheduler for functional correctness
 and is addressed in [Testing](#testing).

 ## Security considerations

 Both new syscalls are gated by the new resource handle
 `ZX_RSRC_SYSTEM_CPU_BASE`. For `zx_system_set_performance_info`, this protection
 addresses the clear concern of malicious interference with the scheduler. For
 `zx_system_get_performance_info`, there is the subtler concern of data leakage;
 an untrusted entity should not be trusted to know the kernel's performance
 scales, which will typically provide information about the system's supported
 P-states.

 ## Privacy considerations

 This proposal has no meaningful impact on privacy.

 ## Testing {#testing}

 - Core tests will be added to exercise basic success and failure criteria.
 - Unit tests will be added to validate the scheduler's handling of updated
   performance scales. They will verify that if a deadline thread is pinned to a
   CPU, and that CPU's performance scale is modified by factor &alpha;, then the
   actual time allotted to the thread is multiplied by 1/&alpha;.

 ## Documentation

 The Zircon syscall documentation will be updated to include the new API.

 ## Drawbacks, alternatives, and unknowns

 ### Generality

 A more general interface was considered, such as a `zx_set_cpu_properties`
 syscall that could eventually handle additional interactions between the kernel
 and CPUs, like offlining. Ultimately, we opted for a narrow interface because
 very few clients of this interface are expected, keeping the cost of future
 changes to the proposed interface relatively small. Requirements placed on a
 more general interface would be largely guesswork at this point.

 ### Alternative call structure

 As an alternative to the set-only operation of `zx_system_set_performance_info`,
 a combined get/set operation was considered that returns the prior performance
 scales for CPUs whose scales were modified. This was intended as a means of
 ensuring that the caller is capable of reverting performance scale changes
 should lower-level execution of the associated frequency change fail.

 However, further consideration revealed that a simple reversion of changes would
 not be sufficient. This resulted in a more complex set of [failure-handling
 recommendations](#intended-usage-set) and led back to the simpler set-only
 operation.

 Finally, `zx_system_get_performance_info` is needed to support hermetic testing,
 in which case direct reversion of changes *is* appropriate, and supports
 diagnostic use cases.

 ### Alternative CPU indexing

 We considered using an alternative scheme for indexing CPUs, such as referring
 to them by physical CPU number. However, since the kernel has no other need for
 such a scheme, it is most consistent with Zircon's limited scope to have the API
 use the kernel's existing logical CPU numbers. These numbers are consistent on a
 given system, and a client could either maintain a static per-board
 configuration to refer to them or potentially access their configuration data
 from the ZBI.

 ### Alternative to performance scale

 We considered that, rather than referring to performance scale directly, the new
 API might utilize a "speed factor" that the scheduler would apply to the base
 performance scale for a given CPU. Doing so would reduce the amount of
 context-specific information a client would need to know; rather than
 understanding the relative performances between CPUs, it would only need to know
 the ratio between a CPU's new frequency and its nominal frequency.

 We opted against this approach because performance scale is intended to be used
 in a fundamental way for CPU thermal throttling on a heterogeneous system, so
 the one anticipated client of this API would receive no meaningful benefit from
 using speed factors instead. Meanwhile, we would incur the cost both of defining
 the new concept and modifying the scheduler to utilize it.

 ### Maximum performance scale

 This proposal originally represented performance scale using a `uint32_t` that
 represented real values in \[0.0, 1.0\]. In particular, this allowed
 representation of a maximum value of 1.0.

 While 1.0 is the maximum performance scale supported by the Zircon scheduler at
 time of writing, we decided to allow inputs that represent values greater than
 1.0 to support future use cases, such as a turbo mode. Additionally, the
 previous representation was not fixed point, so it led to values
 that could not be directly used by the scheduler.

 #### Representation of `performance_scale`

 `performance_scale` was originally a `uint64_t`, with the upper 32 bits holding
 the integer part and the lower 32 bits holding the fractional part. This would
 have produced 32 bits of padding between fields in `zx_cpu_performance_info_t`,
 which introduced a potential leakage vector. The new representation avoids that
 pitfall.

 ### Allowed values for `performance_scale`

 Careful consideration was given to what values `zx_system_set_performance_info`
 should allow as inputs for `performance_scale`. A value representing 0.0 was
 determined to be too easily confused for an instruction to offline a CPU &mdash;
 an action that Zircon does not currently support but is expected to in the
 future using a different API. As such, a value representing 0.0 was determined
 to be an error.

 Very small values warranted special attention as well. For example, an input of
 `{.integral_part = 0, .fractional_part = 1}` would represent 2<sup>-32</sup>,
 which could reasonably be treated as 0.0, effectively rendering the
 corresponding core offline. While this would be possible to address by enforcing
 a minimum allowed value, any such threshold would currently be arbitrary and
 would further complicate the contract between the kernel and userspace. We felt
 it most straightforward to treat the new API as a hinting mechanism and leave
 the kernel with the freedom to override inputs if it needs to do so without
 exposing internal details related to such a choice.

 ### Future work

 #### Configuration management

 Ideally, userspace agents would use the ZBI to share the exact same CPU
 configuration data utilized by the kernel scheduler. It is unclear whether doing
 so is currently practical.

 Additionally, care must be taken to ensure that both the kernel and userspace
 agents associate default performance scales with the same nominal frequencies.

 #### Lower bounds on performance scales

 In principle, the scheduler can determine minimum performance scales that the
 system should maintain based on current deadline threads and CPU load. Dynamic
 versions of these bounds would be an important input to a userspace agent that
 attempts to utilize lower CPU frequencies for energy efficiency. An additional
 option to `zx_system_get_performance_info` would provide a natural means to
 expose them.

 #### CPU attribution

 Some means should be established to associate a thread's attributed CPU time
 with the performance of the CPU on which it was scheduled. Such association is
 already relevant to the establishment of performance metrics that are robust to
 scheduling on big cores versus little cores, and it becomes even more relevant
 as we develop the machinery surrounding frequency modifications, as with this
 proposal.

 #### Guaranteed execution of throttling agent

 Reduction of CPU frequencies when performing thermal throttling may lead to CPU
 starvation, which in turn may make the throttling agent's process less likely to
 be scheduled in a timely fashion. Execution of the throttling agent should be
 prioritized in an appropriate manner.

 ## Prior art and references

 Delegation of responsibility for CPU frequency control to userspace is unusual
 for operating systems, making prior art on this topic unavailable.