The CPU Performance Monitor Trace Provider gives the user access to the performance counters built into the CPU using the tracing system provided by Fuchsia.
At present this is only supported for Intel chipsets.
On Intel the Performance Monitor provides the user with statistics regarding many aspects the CPU. For a complete list of the performance events available for, e.g., Skylake chips see Intel Volume 3 Chapter 19.2, Performance Monitoring Events For 6th And 7th Generation Processors. Not all events (or “counters”) are currently available, there's a lot(!), but hopefully a number of useful events are currently present.
Here are a few examples:
The tracing system uses “categories” to let one specify what trace data to collect. Cpuperf uses these categories to simplify the specification of what h/w events to enable. The full set of categories can be found in the .inc
files in this directory. A representative set of categories is described below.
To collect trace data, run trace record
on your Fuchsia system, or indirectly via the traceutil
host tool. The latter is recommended as it automates the download of the collected “trace.json” file to your desktop.
Example:
host$ categories="gfx" host$ categories="$categories,cpu:fixed:unhalted_reference_cycles" host$ categories="$categories,cpu:fixed:instructions_retired" host$ categories="$categories,cpu:l2_lines,cpu:sample:10000" host$ fx traceutil record --buffer-size=64 --duration=2s \ --categories=$categories Starting trace; will stop in 2 seconds... Stopping trace... Trace file written to /data/trace.json Downloading trace... done Converting trace-2017-11-12T17:55:45.json to trace-2017-11-12T17:55:45.html... done.
After you have the .json
file on your desktop you can load it into chrome://tracing
. If you are using traceutil
an easier way to view the trace is by loading the corresponding .html
file that traceutil
generates. The author finds it easiest to run traceutil
from the top level Fuchsia directory, view that directory in Chrome (e.g., file:///home/dje/fnl/ipt/fuchsia
), hit Refresh after each new trace and then view the trace file in a separate tab.
The basic operation of performance data collection is to allocate a buffer for trace records for each CPU, and then set a counter (on each CPU) to trigger an interrupt after a pre-specified number of events occurs. This interrupt is called the PMI interrupt (Performance Monitor Interrupt). On Intel the interrupt triggers when the counter overflows, at which point the interrupt service routine will write various information (for example timestamp and program counter) to the trace buffer, reset the counter to re-trigger another interrupt after the pre-specified number of events, and return.
When tracing stops the buffer is read by the Cpuperf Trace Provider and converted to the trace format used by the Trace Manager.
Tracing also stops when the buffer fills. Note that an internal buffer is used, and thus circular and streaming modes are not (currently) supported. How much trace data can be collected depends on several factors:
As stated earlier, the Fuchsia tracing system uses “categories” to let one specify what data to collect. For CPU tracing, there are categories to specify what counters to enable, whether to trace the os, userspace, or both, as well as specify the sampling frequency.
For each performance counter see the Intel documentation for further information. This document does not attempt to provide detailed information on each counter.
Data for each counter is collected at a rate specified by the user. Eventually specifying a random rate will be possible. In the meantime the following set of rates are supported:
By default each counter is sampled independently. For example, if one requests “cpu:fixed:instructions_retired” and “arch:llc” (Last Level Cache - L3) with a sampling rate of 10000, then retired instructions will be sampled every 10000 “instruction retired” events and LLC operations will be sampled every 10000 “LLC” events, with the former happening far more frequently than the latter. Timestamps are collected with each sample so one can know how long it took to, for example, retire 10000 instructions.
A few counters are available to be used as “timebases”. In timebase mode one counter is used to drive data collection of all counters, as opposed to each counter being collected at their own rate. This can provide a more consistent view of what's happening. On the other hand, doing so means we forego collecting statistical pc data for each event (since the only pc values we will have are those for the timebase event). A sample rate must be provided in addition to the timebase counter.
See below for the set of timebase counters as of this writing, and garnet/bin/cpuperf_provider/intel-timebase-categories.inc
in the source tree for the current set.
Tally mode is a simpler alternative to sampling mode where counts of each event are collected over the entire trace run and then reported.
Tally mode is enabled via a category of “cpu:tally” instead of one of the "cpu:sample:* categories.
Example:
host$ categories="cpu:l2_summary" host$ categories="$categories,cpu:fixed:unhalted_reference_cycles" host$ categories="$categories,cpu:fixed:instructions_retired" host$ categories="$categories,cpu:mem:bytes,cpu:mem:requests" host$ categories="$categories,cpu:tally" host$ fx traceutil record --buffer-size=64 --duration=2s \ --categories=$categories --report-type=tally --stdout
cpu:os - collect data for code running in kernelspace.
cpu:user - collect data for code running in userspace.
cpu:profile_pc - collect pc data associated with each event
This is useful when wanting to know where, for example, cache misses are generally occurring (statistically speaking, depending upon the sample rate). The address space and program counter of each sample is included in the trace output. Doing so doubles the size of each trace record though, so there are tradeoffs.
The Intel Architecture provides three “fixed” counters:
cpu:fixed:instructions_retired
cpu:fixed:unhalted_core_cycles
cpu:fixed:unhalted_reference_cycles
These counters are “fixed” in the sense that they don‘t use the programmable counters. There are three of them and each of them has a fixed use. The advantage of them is that they don’t use up a programmable counter: There are dozens of counters but, depending on the model, typically only at most four are usable at a time.
There are dozens of programmable counters on Skylake (and Kaby Lake) chips. For a complete list see Intel Volume 3 Chapter 19.2, Performance Monitoring Events For 6th And 7th Generation Processors. For a list of the ones that are currently supported see zircon/system/ulib/zircon-internal/include/lib/zircon-internal/device/cpu-trace/intel-pm-events.inc
and zircon/system/ulib/zircon-internal/include/lib/zircon-internal/device/cpu-trace/skylake-pm-events.inc
in the source tree.
To simplify specifying the programmable counters they have been grouped into categories defined in garnet/bin/cpuperf_provider/intel-pm-categories.inc
and garnet/bin/cpuperf_provider/skylake-pm-categories.inc
in the source tree. See these files for a full list.
Only one of these categories may be specified at a time. [Later we'll provide more control over what data to collect.]
A small selection of useful categories:
cpu:arch:llc
cpu:arch:branch
cpu:skl:l1_summary
cpu:skl:l2_summary
cpu:skl:l3_summary
cpu:skl:offcore_demand_code
cpu:skl:offcore_demand_data
cpu:skl:l1_miss_cycles
cpu:skl:l2_miss_cycles
cpu:skl:l3_miss_cycles
cpu:skl:mem_cycles
Note: The wording of some of these events may seem odd. The author has tried to preserve the wording found in the Intel manuals, though improvements are welcome.
Note: This is just a first pass! They'll be reworked as the need arises. Please see the category .inc
files in your source tree for an up to date list.
These counters may be used as timebases. More will be added in time.
cpu:timebase:fixed:instructions_retired
cpu:timebase:fixed:unhalted_reference_cycles