Debugging kernel crashes with JTRACE

Overview

What it is

JTRACE is a simple, lightweight, “just-in-case” tracing system meant to assist developers in debugging kernel crashes and spontaneous reboots by recording a series of trace event records generated by instrumentation added by a developer, and automatically playing them back in the event of a kernel panic. The intent is to give developers the ability to create trace records from almost anywhere in kernel code which will give them additional information, beyond just a register dump and stack trace, to assist in understanding the sequence of events which led up to a crash. Useful properties of JTRACE include:

Lockless ring buffer data structure: Creating a trace event will never block, and will never attempt to acquire any locks, making it available for use during early init, inside of arbitrary locks, during hard IRQ time, and almost anywhere else in the kernel code.
Low overhead: Creating trace events is a relatively cheap operation, making it possible to instrument performance sensitive code with minimal timing disruption.
Optional persistence across reboot: Platform permitting, JTRACE may be configured to use some of the persistent RAM typically used for crashlog storage to assist in debugging issues involving spontaneous reboots, typically caused by triggering a hardware watchdog timer.

What it is not

JTRACE is not a production tracing system. By default, it is completely disabled, and it is never meant to be enabled in a production build. Trace points are not meant to be left in code by developers in changes which are checked in. Typical usage of JTRACE involves a developer adding trace points to their code, reproducing a crash, then modifying existing trace points or adding new ones, and repeating until they have root caused their bug. Once the bug has been found, understood, and fixed, the code which added the trace points is discarded.

Configuration

Configuration of JTRACE is controlled using build arguments set via fx set. There are currently 4 options which can be specified. They are:

Name	Type	Notes
`jtrace_enabled`	tri-state	Enables JTRACE when set to `true` or `"persistent"`. Disabled by default.
`jtrace_target_buffer_size`	uint32_t	Controls the amount of RAM reserved for the trace buffer.
`jtrace_last_entry_storage`	uint32_t	Controls the amount of RAM reserved for per-cpu last event records.
`jtrace_use_large_entries`	bool	Controls whether large or small trace entries are used.

By default, JTRACE is disabled and not included in a kernel build. To enable JTRACE, users may set the jtrace_enabled build argument to a value of either true, or "persistent" to enable JTRACE and configure it for persistent tracing.

If persistent tracing is chosen, jtrace_target_buffer_size must be a multiple of the kernel's configured persistent_ram_allocation_granularity value (defaults to 128 bytes).

If not explicitly specified by the user, defaults values for each of the other configuration parameters will be chosen based on whether or not persistent tracing has been selected.

Name	Non-persistent Default	Persistent Default
`jtrace_target_buffer_size`	32768	4096
`jtrace_last_entry_storage`	0	0
`jtrace_use_large_entries`	true	false

Once the build has been configured, simply rebuild and deploy the new kernel to your target in order to start to use JTRACE.

Examples

To enable non-persistent tracing with large records and 32KB worth of storage, add the following to your fx set line.

--args ‘jtrace_enabled=true’

To expand this storage to 64KB and use small records instead, the arguments would be:

--args ‘jtrace_enabled=true jtrace_target_buffer_size=65536 jtrace_use_large_entries=false’

To enable persistent tracing with large records (overriding the defaults) and storage for 4 per-cpu last event records (also overriding the defaults), requesting 4KB worth of storage, add the following to your fx set line.

--args ‘jtrace_enabled=“persistent” jtrace_last_entry_storage=4 jtrace_use_large_entries=true’

Please note that ""s are required around persistent.

Adding trace points

Requirements

In order to add a trace point to a file, users must include the jtrace header by adding the following #include statement in each file where trace points are to be added.

#include <lib/jtrace/jtrace.h>

If the include file is not found, it may be necessary to add the headers to the deps section of your module's BUILD.gn file.

deps = [ "//zircon/kernel/lib/jtrace:headers" ] # If deps has not been declared yet.
deps += [ "//zircon/kernel/lib/jtrace:headers" ] # If deps has already been declared.

The JTRACE macro.

Trace points may now be added by invoking the JTRACE macro. Every trace point must specify a string literal “tag”, and may specify as many general purpose arguments as the configured trace entry record supports. Tags must be string literals in order to remain valid across reboots when persistent tracing is configured.

General purpose arguments do not need to be specified, and will default to 0 if omitted. Additionally, no casting should be needed when specifying general purpose arguments. If the argument field is large enough to store the value, it will. Small trace entries have space for a single 32 bit argument, while large trace entries have space for four 32 bit args, followed by two 64 bit args.

Here is an example of using tracing, when configured for large entries.

#include <lib/jtrace/jtrace.h>

struct MyStruct {
  void *ptr1, *ptr2;
  uint32_t foo, bar;
};

void MyFunction(const MyStruct* args, uint32_t num_args) {
  JTRACE("(num_args, args)", num_args, 0, 0, 0, args);

  for (uint32_t i = 0; i < num_args; ++i) {
    const auto& a = args[i];
    JTRACE("Before (i, foo, bar, 0, ptr1, ptr2)", i, a.foo, a.bar, 0, a.ptr1, a.ptr2);
    zx_status_t status = DoAThingWithMyStruct(a);
    JTRACE("After (status)", status);
  }

  JTRACE("Finished");
}

And here is what one of the “Before” tagged trace statements would look like when the trace buffer is eventually displayed.

[   7.718998976][cpu 2 tid     3702] : 00000002 00000006 00000001 00000000 ffffffff0024c720 ffffff9519077e60 : (       0.308 uSec) : jtrace_example.cc:MyFunction:12 (Before (i, foo, bar, 0, ptr1, ptr2))

Breaking this down by field, we have:

Field	Description
`[ 7.718998976]`	The clock monotonic timestamp of the trace entry
`[cpu 2 tid 3702]`	The ID of the CPU and thread which was active when the trace entry was recorded
`00000002`	The value of the 1st 32-bit argument (`i`)
`00000006`	The value of the 2nd 32-bit argument (`foo`)
`00000001`	The value of the 3rd 32-bit argument (`bar`)
`00000000`	The value of the 4th 32-bit argument (0)
`ffffffff0024c720`	The value of the 1st 64-bit argument (`ptr1`)
`ffffff9519077e60`	The value of the 2nd 64-bit argument (`ptr2`)
`( 0.308 uSec)`	The time since the last displayed trace entry
`jtrace_example.cc:MyFunction:12`	The filename, function name, and line number of the trace entry
`(Before (i, foo, bar, 0, ptr1, ptr2))`	The user's string literal tag

Displaying the contents of a trace buffer

When JTRACE is enabled, the contents of the trace buffer will be dumped automatically whenever a kernel panic occurs, along with the register dump and backtrace.

Users may also dump the contents of the trace buffer manually using the kernel shell. From a serial console, simply run k jtrace to dump the current trace buffer, or run k jtrace -r to dump the contents of the recovered persistent trace buffer, if available. Note that while the live trace buffer is being dumped, tracing will be temporarily disabled and no new trace entries will be added to the buffer. Invocations of the JTRACE macro will be treated as no-ops.

Examples

Here is an example of a dump of a trace buffer configured to use small entries, with 4kB of storage, and 4 Per-CPU last event records. Syscall entry and exit was instrumented using two JTRACE macros. The tag was used to hold the string literal name of the syscall being made, while the first general purpose argument was used to record a bool indicating whether the trace marks the start, or the end, of the syscall.

JTRACE: Recovered 166/166 entries
[  17.431102506][cpu 0] : 00000001 : (    0.000 uSec) : (futex_wait)
[  17.431107979][cpu 0] : 00000000 : (    5.473 uSec) : (nanosleep)
[  17.431109757][cpu 0] : 00000001 : (    1.778 uSec) : (nanosleep)
[  17.431118119][cpu 0] : 00000000 : (    8.362 uSec) : (futex_wake)
[  17.431131543][cpu 0] : 00000001 : (   13.424 uSec) : (futex_wake)
[  17.431134424][cpu 0] : 00000000 : (    2.881 uSec) : (futex_wake)
<snip>
[  17.437545001][cpu 3] : 00000001 : (    1.443 uSec) : (object_wait_async)
[  17.437547341][cpu 3] : 00000000 : (    2.340 uSec) : (port_wait)
[  17.437548263][cpu 3] : 00000001 : (    0.922 uSec) : (port_wait)
[  17.437666664][cpu 3] : 00000000 : (  118.401 uSec) : (channel_read_etc)
[  17.437668850][cpu 3] : 00000001 : (    2.186 uSec) : (channel_read_etc)
[  17.437678993][cpu 3] : 00000000 : (   10.143 uSec) : (debug_send_command)

JTRACE: Last recorded per-CPU events.
[  17.437001443][cpu 0] : 00000000 : (    0.000 uSec) : (futex_wait)
[  17.436016989][cpu 1] : 00000000 : (    0.000 uSec) : (channel_call_etc_noretry)
[  17.437066990][cpu 2] : 00000000 : (    0.000 uSec) : (futex_wait)
[  17.437678993][cpu 3] : 00000000 : (    0.000 uSec) : (debug_send_command)

JTRACE: Last log timestamp [  17.437678993]

Here is the same example, this time with large records configured.

JTRACE: Recovered 47/47 entries
[  24.270872973][cpu 2 tid     1301] : 00000001 00000000 00000000 00000000 0000000000000000 0000000000000000 : (       0.000 uSec) : syscalls.cc:do_syscall_post:84 (channel_read_etc)
[  24.270875158][cpu 2 tid     1301] : 00000000 00000000 00000000 00000000 0000000000000000 0000000000000000 : (       2.185 uSec) : syscalls.cc:do_syscall_pre:59 (object_wait_async)
[  24.270879083][cpu 2 tid     1301] : 00000001 00000000 00000000 00000000 0000000000000000 0000000000000000 : (       3.925 uSec) : syscalls.cc:do_syscall_post:84 (object_wait_async)
[  24.270886634][cpu 2 tid     1301] : 00000000 00000000 00000000 00000000 0000000000000000 0000000000000000 : (       7.551 uSec) : syscalls.cc:do_syscall_pre:59 (port_wait)
[  24.273031016][cpu 1 tid     1295] : 00000000 00000000 00000000 00000000 0000000000000000 0000000000000000 : (    2144.382 uSec) : syscalls.cc:do_syscall_pre:59 (channel_write_etc)
[  24.273057126][cpu 1 tid     1295] : 00000001 00000000 00000000 00000000 0000000000000000 0000000000000000 : (      26.110 uSec) : syscalls.cc:do_syscall_post:84 (channel_write_etc)
<snip>
[  24.273871139][cpu 1 tid     2735] : 00000001 00000000 00000000 00000000 0000000000000000 0000000000000000 : (       5.089 uSec) : syscalls.cc:do_syscall_post:84 (object_wait_async)
[  24.273875048][cpu 1 tid     2735] : 00000000 00000000 00000000 00000000 0000000000000000 0000000000000000 : (       3.909 uSec) : syscalls.cc:do_syscall_pre:59 (port_wait)
[  24.273876820][cpu 1 tid     2735] : 00000001 00000000 00000000 00000000 0000000000000000 0000000000000000 : (       1.772 uSec) : syscalls.cc:do_syscall_post:84 (port_wait)
[  24.274000887][cpu 1 tid     2735] : 00000000 00000000 00000000 00000000 0000000000000000 0000000000000000 : (     124.067 uSec) : syscalls.cc:do_syscall_pre:59 (channel_read_etc)
[  24.274003866][cpu 1 tid     2735] : 00000001 00000000 00000000 00000000 0000000000000000 0000000000000000 : (       2.979 uSec) : syscalls.cc:do_syscall_post:84 (channel_read_etc)
[  24.274019066][cpu 1 tid     2735] : 00000000 00000000 00000000 00000000 0000000000000000 0000000000000000 : (      15.200 uSec) : syscalls.cc:do_syscall_pre:59 (debug_send_command)

JTRACE: Last recorded per-CPU events.
[  24.269107951][cpu 0 tid     2475] : 00000000 00000000 00000000 00000000 0000000000000000 0000000000000000 : (       0.000 uSec) : syscalls.cc:do_syscall_pre:59 (debug_read)
[  24.274019066][cpu 1 tid     2735] : 00000000 00000000 00000000 00000000 0000000000000000 0000000000000000 : (       0.000 uSec) : syscalls.cc:do_syscall_pre:59 (debug_send_command)
[  24.273785974][cpu 2 tid     1902] : 00000000 00000000 00000000 00000000 0000000000000000 0000000000000000 : (       0.000 uSec) : syscalls.cc:do_syscall_pre:59 (port_wait)
[  24.154375056][cpu 3 tid    28770] : 00000000 00000000 00000000 00000000 0000000000000000 0000000000000000 : (       0.000 uSec) : syscalls.cc:do_syscall_pre:59 (futex_wait)

JTRACE: Last log timestamp [  24.274019066]

Trace Entries

Trace entries may be configured at compile time to be either “large” or “small” records (see Configuration). All records in the trace buffer are of the same configured size. It is not possible to mix small records with large records. Large entries are supersets of small entries. In other words, a large entry contains all of the fields that a small record contains, plus some additional records.

Because of the (typically) limited amount of persistent DRAM available to systems, small records are typically used when JTRACE is configured to be persistent, and large records are used otherwise. That said, users may override this behavior and force either small or large records in either situation.

Small entries

Small trace entries record the following fields.

Timestamp: A timestamp which records when the trace entry was added to the buffer. Recorded in units of ticks, but translated to the clock monotonic timeline when the buffer is dumped.
CPU Id: The ID of the CPU which created the trace entry.
Tag: A pointer to a C-string literal supplied by the developer to assist in debugging.
a: A single 32-bit general purpose field supplied by the developer to assist in debugging.

Small trace entries currently take up 24 bytes of storage each.

Large entries

Large entries extend small entries, adding the following fields.

File: The file name of the location of the trace point.
Line: The line number of in the file of the location of the trace point.
Function: The function name of the location of the trace point.
TID: The numeric thread ID of the thread which created the trace entry.
[a-d]: Four 32-bit general purpose fields (instead of a small record's one) supplied by the developer to assist in debugging.
[e-f]: Two 64-bit general purpose fields supplied by the developer to assist in debugging.

Small trace entries currently take up 72 bytes of storage each.

Per-CPU Last Event Records

In addition to the main trace buffer, JTRACE can be configured to separately store the last events recorded by each CPU. When enabled, these records will be dumped separately, after the main trace buffer dump is complete.

This feature can be useful when attempting to debug hangs of individual CPUs, especially if they lead to the triggering of a HW WDT. If a CPU gets stuck for any significant amount of time, it is likely that trace events from other CPUs will rapidly flush the events from the stuck CPU out of the trace buffer, especially if the buffer is small because it is located in persistent RAM. The separate last-event record for the stuck CPU allows developers to know the last trace point executed by the stuck CPU before it stopped responding.

To enable Per-CPU last event records, set the jtrace_last_entry_storage configuration parameter to be equal to the number of CPUs present in your target hardware. Note that if this number is too low, last event records for CPU IDs >= the configured number never be recorded. If the number is too high, some of the entries will never have a record logged to them and some of the storage reserved in the trace buffer will be wasted.

Persistent Tracing

Overview

Occasionally, a developer might find themselves in the unfortunate position of needing to debug a complete system hang. When these occur, it can be very difficult to gain any insights as all ability to observe the state of the system, or what led up to it, is lost. This is a situation where persistent tracing may be able to help.

When configured for persistent tracing, the trace buffer is placed in special “persistent RAM” provided to the kernel by the bootloader. Typically, this is a region of physical RAM which is guaranteed to not be touched by the bootloader(s) as the system boots, meaning that it is possible to put information in the region which manages to survive a full HW reset of the system.

Usage

In order to enable persistent tracing, set the jtrace_enabled to the value "persistent". After a hang occurs, reboot the system (either via a HW WDT, or a manual reset line in the target). Once the system has rebooted, if the recovered trace buffer was successfully recovered, it may be dumped by executing k jtrace -r from a serial shell. Note that persistent RAM is typically just a section of DRAM which is put aside by the bootloader chain, and it may suffer corruption in the time it takes for the early stage bootloaders to re-enable dynamic RAM refresh. The trace system will make every effort to dump what it can from the recovered buffer, falling back to a hexdump of the region if needed, however some care should be taken when interpreting the results of this dump if there is any possibility of corruption as a result of the reboot.

Not all targets currently support a direct memory-mappable persistent RAM region, and when they do, this region is frequently very small. Currently, only ARM64 targets support these regions, so persistent tracing is not currently available for x64 systems. The jtrace_target_buffer_size value configured by the user is simply a request for storage, there is no guarantee that there will be enough storage to support the request. In the case that there is not, the trace system will claim as much storage as it can, and make due with what it is able to reserve. After booting, users can example the current configuration of the JTRACE subsystem by running k jtrace -i from a serial console. For example on a target with 8kB of persistent RAM, a request for a 32kB buffer might result in the following.

JTRACE configuration
--------------------
Requested Buffer Size  : 32768
Allocated Buffer Size  : 6144
Allocated Buffer Loc   : 0xffff00023fff8000
Per-CPU last entry cnt : 0
Large entries          : no
Persistent             : yes

32kB was requested, but only 8kB was available, and 2kB of that was reserved for the crashlog up front, leaving only 6kB for the trace buffer.

Caveats

By its nature, there are some side effects and limitations of the JTRACE system which users should keep in mind.

Trace records do not have to exist in monotonically increasing timestamp order. Because of the lockless nature of the ring buffer storage, a record slot is first reserved, and then the timestamp is recorded. It is possible for effects like cache misses and IRQs to cause two adjacent trace entries to actually have out of order timestamps.
It is technically possible (but extremely unlikely) for a trace record to be corrupted because of two CPUs writing to the record concurrently. For this to happen, a CPU would need to reserve a slot and begin writing to it, then encounter a long delay (perhaps caused by preemption). During this delay, if sufficient entries are recorded to wrap the trace buffer, a second CPU could claim the same slot in the ring buffer as the first, resulting in a corrupted record with information from both CPUs recorded in it.
Reservation of an entry slot in the ring buffer uses atomics and acquire/release memory order semantics. Depending on the memory ordering details of the target architecture, this might have an affect on the reproducibility of some bugs, as read and write operations from two different CPUs which are also JTRACEing may suddenly may not be able to be re-ordered across the point of the JTRACE.
When using persistent tracing, records written to the JTRACE buffer must be immediately flushed from cache all of the way to physical memory in order to properly survive a spontaneous reboot. A properly configured persistent_ram_allocation_granularity value should prevent these cache cleaning operations from overlapping with anything else, but it should be noted that the write of the entry to the trace buffer will result in a small amount of additional data being forced from cache into physical RAM more frequently than normal.