|  | # CPU performance monitor | 
|  |  | 
|  | ## Introduction | 
|  |  | 
|  | The CPU Performance Monitor Trace Provider gives the user access to the | 
|  | performance counters built into the CPU using the | 
|  | [Fuchsia tracing system](/docs/concepts/tracing/README.md). | 
|  |  | 
|  | At present this is only supported for Intel chipsets. | 
|  |  | 
|  | On Intel the Performance Monitor provides the user with statistics regarding | 
|  | many aspects the CPU. | 
|  | For a complete list of the performance events available for, e.g., | 
|  | Skylake chips see Intel Volume 3 Chapter 19.2, | 
|  | Performance Monitoring Events For 6th And 7th Generation Processors. | 
|  | Not all events (or "counters") are currently available, there's a lot(!), | 
|  | but hopefully a number of useful events are currently present. | 
|  |  | 
|  | Here are a few examples: | 
|  |  | 
|  | - cache hits/misses, for each of L1, L2, L3 | 
|  | - cycles stalled due to cache misses | 
|  | - branch mispredicts | 
|  | - instructions retired | 
|  |  | 
|  | The tracing system uses "categories" to let one specify what trace data | 
|  | to collect. Cpuperf uses these categories to simplify the specification | 
|  | of what h/w events to enable. The full set of categories can be found | 
|  | in the `.inc` files in this directory. A representative set of categories | 
|  | is described below. | 
|  |  | 
|  | To collect trace data, run `trace record` on your Fuchsia system, | 
|  | or indirectly via the `traceutil` host tool. The latter is recommended | 
|  | as it automates the download of the collected "trace.json" file to your | 
|  | desktop. | 
|  |  | 
|  | Example: | 
|  |  | 
|  | ```shell | 
|  | host$ categories="gfx" | 
|  | host$ categories="$categories,cpu:fixed:unhalted_reference_cycles" | 
|  | host$ categories="$categories,cpu:fixed:instructions_retired" | 
|  | host$ categories="$categories,cpu:l2_lines,cpu:sample:10000" | 
|  | host$ fx traceutil record --buffer-size=64 --duration=2s \ | 
|  | --categories=$categories | 
|  | Starting trace; will stop in 2 seconds... | 
|  | Stopping trace... | 
|  | Trace file written to /data/trace.json | 
|  | Downloading trace... done | 
|  | Converting trace-2017-11-12T17:55:45.json to trace-2017-11-12T17:55:45.html... done. | 
|  | ``` | 
|  |  | 
|  | After you have the `.json` file on your desktop you can load it into | 
|  | `chrome://tracing`. If you are using `traceutil` an easier way to view | 
|  | the trace is by loading the corresponding `.html` file that `traceutil` | 
|  | generates. The author finds it easiest to run `traceutil` from the top level | 
|  | Fuchsia directory, view that directory in Chrome (e.g., | 
|  | `file:///home/dje/fnl/ipt/fuchsia`), hit Refresh after each new trace | 
|  | and then view the trace file in a separate tab. | 
|  |  | 
|  | ## Basic Operation | 
|  |  | 
|  | The basic operation of performance data collection is to allocate a | 
|  | buffer for trace records for each CPU, and then set a counter (on each CPU) | 
|  | to trigger an interrupt after a pre-specified number of events occurs. | 
|  | This interrupt is called the PMI interrupt (Performance Monitor Interrupt). | 
|  | On Intel the interrupt triggers when the counter overflows, at which point | 
|  | the interrupt service routine will write various information (for example | 
|  | timestamp and program counter) to the trace buffer, reset the counter | 
|  | to re-trigger another interrupt after the pre-specified number of events, | 
|  | and return. | 
|  |  | 
|  | When tracing stops the buffer is read by the Cpuperf Trace Provider and | 
|  | converted to the trace format used by the Trace Manager. | 
|  |  | 
|  | Tracing also stops when the buffer fills. Note that an internal buffer | 
|  | is used, and thus circular and streaming modes are not (currently) supported. | 
|  | How much trace data can be collected depends on several factors: | 
|  |  | 
|  | - duration of the trace | 
|  | - size of the buffer | 
|  | - frequency of sampling | 
|  | - how frequently the counter overflows | 
|  | - whether program counter information is written to the buffer | 
|  |  | 
|  | ## Data Collection Categories | 
|  |  | 
|  | As stated earlier, the Fuchsia tracing system uses "categories" to | 
|  | let one specify what data to collect. For CPU tracing, there are categories | 
|  | to specify what counters to enable, whether to trace the os, userspace, | 
|  | or both, as well as specify the sampling frequency. | 
|  |  | 
|  | For each performance counter see the Intel documentation for further | 
|  | information. This document does not attempt to provide detailed information | 
|  | on each counter. | 
|  |  | 
|  | ### Sample Rate | 
|  |  | 
|  | Data for each counter is collected at a rate specified by the user. | 
|  | Eventually specifying a random rate will be possible. In the meantime | 
|  | the following set of rates are supported: | 
|  |  | 
|  | - cpu:sample:100 | 
|  | - cpu:sample:500 | 
|  | - cpu:sample:1000 | 
|  | - cpu:sample:5000 | 
|  | - cpu:sample:10000 | 
|  | - cpu:sample:50000 | 
|  | - cpu:sample:100000 | 
|  | - cpu:sample:500000 | 
|  | - cpu:sample:1000000 | 
|  |  | 
|  | #### Independent sampling | 
|  |  | 
|  | By default each counter is sampled independently. | 
|  | For example, if one requests "cpu:fixed:instructions_retired" | 
|  | and "arch:llc" (Last Level Cache - L3) with a sampling rate of 10000, | 
|  | then retired instructions will be sampled every 10000 "instruction retired" | 
|  | events and LLC operations will be sampled every 10000 "LLC" events, | 
|  | with the former happening far more frequently than the latter. | 
|  | Timestamps are collected with each sample so one can know how long it took | 
|  | to, for example, retire 10000 instructions. | 
|  |  | 
|  | #### Timebased sampling | 
|  |  | 
|  | A few counters are available to be used as "timebases". | 
|  | In timebase mode one counter is used to drive data collection of all counters, | 
|  | as opposed to each counter being collected at their own rate. | 
|  | This can provide a more consistent view of what's happening. On the other hand, | 
|  | doing so means we forego collecting statistical pc data for each event | 
|  | (since the only pc values we will have are those for the timebase event). | 
|  | A sample rate must be provided in addition to the timebase counter. | 
|  |  | 
|  | See below for the set of timebase counters as of this writing, | 
|  | and `garnet/bin/cpuperf_provider/intel-timebase-categories.inc` | 
|  | in the source tree for the current set. | 
|  |  | 
|  | ### Tally Mode | 
|  |  | 
|  | Tally mode is a simpler alternative to sampling mode where counts of each | 
|  | event are collected over the entire trace run and then reported. | 
|  |  | 
|  | Tally mode is enabled via a category of "cpu:tally" instead of one of | 
|  | the "cpu:sample:* categories. | 
|  |  | 
|  | Example: | 
|  |  | 
|  | ```shell | 
|  | host$ categories="cpu:l2_summary" | 
|  | host$ categories="$categories,cpu:fixed:unhalted_reference_cycles" | 
|  | host$ categories="$categories,cpu:fixed:instructions_retired" | 
|  | host$ categories="$categories,cpu:mem:bytes,cpu:mem:requests" | 
|  | host$ categories="$categories,cpu:tally" | 
|  | host$ fx traceutil record --buffer-size=64 --duration=2s \ | 
|  | --categories=$categories --report-type=tally --stdout | 
|  | ``` | 
|  |  | 
|  | ### Options | 
|  |  | 
|  | - cpu:os - collect data for code running in kernelspace. | 
|  |  | 
|  | - cpu:user - collect data for code running in userspace. | 
|  |  | 
|  | - cpu:profile_pc - collect pc data associated with each event | 
|  |  | 
|  | This is useful when wanting to know where, for example, cache misses | 
|  | are generally occurring (statistically speaking, depending upon the | 
|  | sample rate). The address space and program counter of each sample | 
|  | is included in the trace output. Doing so doubles the size of each | 
|  | trace record though, so there are tradeoffs. | 
|  |  | 
|  | ### Fixed Counters | 
|  |  | 
|  | The Intel Architecture provides three "fixed" counters: | 
|  |  | 
|  | - cpu:fixed:instructions_retired | 
|  |  | 
|  | - cpu:fixed:unhalted_core_cycles | 
|  |  | 
|  | - cpu:fixed:unhalted_reference_cycles | 
|  |  | 
|  | These counters are "fixed" in the sense that they don't use the programmable | 
|  | counters. There are three of them and each of them has a fixed use. | 
|  | The advantage of them is that they don't use up a programmable counter: | 
|  | There are dozens of counters but, depending on the model, typically only | 
|  | at most four are usable at a time. | 
|  |  | 
|  | ### Programmable Counters | 
|  |  | 
|  | There are dozens of programmable counters on Skylake (and Kaby Lake) chips. | 
|  | For a complete list see Intel Volume 3 Chapter 19.2, | 
|  | Performance Monitoring Events For 6th And 7th Generation Processors. | 
|  | For a list of the ones that are currently supported see | 
|  | `zircon/system/ulib/zircon-internal/include/lib/zircon-internal/device/cpu-trace/intel-pm-events.inc` | 
|  | and | 
|  | `zircon/system/ulib/zircon-internal/include/lib/zircon-internal/device/cpu-trace/skylake-pm-events.inc` | 
|  | in the source tree. | 
|  |  | 
|  | To simplify specifying the programmable counters they have been grouped | 
|  | into categories defined in | 
|  | `garnet/bin/cpuperf_provider/intel-pm-categories.inc` | 
|  | and | 
|  | `garnet/bin/cpuperf_provider/skylake-pm-categories.inc` | 
|  | in the source tree. See these files for a full list. | 
|  |  | 
|  | Only one of these categories may be specified at a time. | 
|  | [Later we'll provide more control over what data to collect.] | 
|  |  | 
|  | A small selection of useful categories: | 
|  |  | 
|  | - cpu:arch:llc | 
|  | - Last Level Cache (L3) references | 
|  | - Last Level Cache (L3) misses | 
|  |  | 
|  | - cpu:arch:branch | 
|  | - Branch instructions retired | 
|  | - Branch instructions mispredicted | 
|  |  | 
|  | - cpu:skl:l1_summary | 
|  | - Number of outstanding L1D misses every cycle | 
|  | - Number of outstanding L1D misses for any logical thread on this processor core | 
|  | - Number of lines brought into L1 data cache | 
|  |  | 
|  | - cpu:skl:l2_summary | 
|  | - Demand requests that missed L2 | 
|  | - All requests that missed L2 | 
|  | - All Demand Data Read requests to L2 | 
|  | - All requests to L2 | 
|  |  | 
|  | - cpu:skl:l3_summary | 
|  | - Requests originating from core that reference cache line in L3 | 
|  | - Cache miss condition for references to L3 | 
|  |  | 
|  | - cpu:skl:offcore_demand_code | 
|  | - Incremented each cycle of the number of offcore outstanding Demand Code Read transactions in SQ to uncore | 
|  | - Cycles with at least 1 offcore outstanding Demand Code Read transactions in SQ to uncore | 
|  |  | 
|  | - cpu:skl:offcore_demand_data | 
|  | - Incremented each cycle of the number of offcore outstanding Demand Data Read transactions in SQ to uncore | 
|  | - Cycles with at least 1 offcore outstanding Demand Data Read transactions in SQ to uncore | 
|  | - Cycles with at least 6 offcore outstanding Demand Data Read transactions in SQ to uncore | 
|  |  | 
|  | - cpu:skl:l1_miss_cycles | 
|  | - Cycles while L1 data miss demand load is outstanding | 
|  | - Execution stalls while L1 data miss demand load is outstanding | 
|  |  | 
|  | - cpu:skl:l2_miss_cycles | 
|  | - Cycles while L2 miss demand load is outstanding | 
|  | - Execution stalls while L2 miss demand load is outstanding | 
|  |  | 
|  | - cpu:skl:l3_miss_cycles | 
|  | - Cycles while L3 miss demand load is outstanding | 
|  | - Execution stalls while L3 miss demand load is outstanding | 
|  |  | 
|  | - cpu:skl:mem_cycles | 
|  | - Cycles while memory subsystem has an outstanding load | 
|  | - Execution stalls while memory subsystem has an outstanding load | 
|  |  | 
|  | Note: The wording of some of these events may seem odd. | 
|  | The author has tried to preserve the wording found | 
|  | in the Intel manuals, though improvements are welcome. | 
|  |  | 
|  | Note: This is just a first pass! They'll be reworked | 
|  | as the need arises. Please see the category `.inc` files | 
|  | in your source tree for an up to date list. | 
|  |  | 
|  | ### Timebase Counters | 
|  |  | 
|  | These counters may be used as timebases. | 
|  | More will be added in time. | 
|  |  | 
|  | - cpu:timebase:fixed:instructions_retired | 
|  | - same counter as cpu:fixed:instructions_retired | 
|  |  | 
|  | - cpu:timebase:fixed:unhalted_reference_cycles | 
|  | - same counter as cpu:fixed:unhalted_reference_cycles |