blob: 3f50b63b9d0c4abc476394c10f27aa3454c943b3 [file] [log] [blame] [view] [edit]
# Symbolizer markup format #
This document defines a text format for log messages that can be
processed by a _symbolizing filter_. The basic idea is that logging
code emits text that contains raw address values and so forth, without
the logging code doing any real work to convert those values to
human-readable form. Instead, logging text uses the markup format
defined here to identify pieces of information that should be converted
to human-readable form after the fact. As with other markup formats,
the expectation is that most of the text will be displayed as is, while
the markup elements will be replaced with expanded text, or converted
into active UI elements, that present more details in symbolic form.
This means there is no need for symbol tables, DWARF debugging sections,
or similar information to be directly accessible at runtime. There is
also no need at runtime for any logic intended to compute human-readable
presentation of information, such as C++ symbol demangling. Instead,
logging must include markup elements that give the contextual
information necessary to make sense of the raw data, such as memory
layout details.
This format identifies markup elements with a syntax that is both simple
and distinctive. It's simple enough to be matched and parsed with
straightforward code. It's distinctive enough that character sequences
that look like the start or end of a markup element should rarely if
ever appear incidentally in logging text. It's specifically intended
not to require sanitizing plain text, such as the HTML/XML requirement
to replace `<` with `&lt;` and the like.
## Scope and assumptions ##
This specification defines a format standard for Zircon and Fuchsia.
But there is nothing specific to Zircon or Fuchsia about the markup
format. A symbolizing filter implementation will be independent both of
the _target_ operating system and machine architecture where the logs
are generated and of the _host_ operating system and machine
architecture where the filter runs.
This format assumes that the symbolizing filter processes intact whole
lines. If long lines might be split during some stage of a logging
pipeline, they must be reassembled to restore the original line breaks
before feeding lines into the symbolizing filter. Most markup elements
must appear entirely on a single line (often with other text before
and/or after the markup element). There are some markup elements that
are specified to span lines, with line breaks in the middle of the
element. Even in those cases, the filter is not expected to handle line
breaks in arbitrary places inside a markup element, but only inside
certain fields.
This format assumes that the symbolizing filter processes a coherent
stream of log lines from a single process address space context. If a
logging stream interleaves log lines from more than one process, these
must be collated into separate per-process log streams and each stream
processed by a separate instance of the symbolizing filter. Because the
kernel and user processes use disjoint address regions in most operating
systems (including Zircon), a single user process address space plus
the kernel address space can be treated as a single address space for
symbolization purposes if desired.
## Dependence on Build IDs ##
The symbolizer markup scheme relies on contextual information about
runtime memory address layout to make it possible to convert markup
elements into useful symbolic form. This relies on having an
unmistakable identification of which binary was loaded at each address.
An ELF Build ID is the payload of an ELF note with name `"GNU"` and type
`NT_GNU_BUILD_ID`, a unique byte sequence that identifies a particular
binary (executable, shared library, loadable module, or driver module).
The linker generates this automatically based on a hash that includes
the complete symbol table and debugging information, even if this is
later stripped from the binary.
This specification uses the ELF Build ID as the sole means of
identifying binaries. Each binary relevant to the log must have been
linked with a unique Build ID. The symbolizing filter must have some
means of mapping a Build ID back to the original ELF binary (either the
whole unstripped binary, or a stripped binary paired with a separate
debug file).
## Colorization ##
The markup format supports a restricted subset of ANSI X3.64 SGR (Select
Graphic Rendition) control sequences. These are unlike other markup
elements:
* They specify presentation details (**bold** or colors) rather than
semantic information. The association of semantic meaning with color
(e.g. red for errors) is chosen by the code doing the logging, rather
than by the UI presentation of the symbolizing filter. This is a
concession to existing code (e.g. LLVM sanitizer runtimes) that use
specific colors and would require substantial changes to generate
semantic markup instead.
* A single control sequence changes "the state", rather than being an
hierarchical structure that surrounds affected text.
The filter processes ANSI SGR control sequences only within a single
line. If a control sequence to enter a **bold** or color state is
encountered, it's expected that the control sequence to reset to default
state will be encountered before the end of that line. If a "dangling"
state is left at the end of a line, the filter may reset to default
state for the next line.
An SGR control sequence is not interpreted inside any other markup element.
However, other markup elements may appear between SGR control sequences and
the color/**bold** state is expected to apply to the symbolic output that
replaces the markup element in the filter's output.
The accepted SGR control sequences all have the form `"\033[%um"`
(expressed here using C string syntax), where `%u` is one of these:
| Code | Effect | Notes |
|:----:|:------:|-------|
| `0` | Reset to default formatting. | |
| `1` | Use **bold text** | Combines with color states, doesn't reset them.|
| `30` | Black foreground | |
| `31` | Red foreground | |
| `32` | Green foreground | |
| `33` | Yellow foreground | |
| `34` | Blue foreground | |
| `35` | Magenta foreground | |
| `36` | Cyan foreground | |
| `37` | White foreground | |
## Common markup element syntax ##
{# Disable variable substitution to avoid {{ being interpreted by the template engine #}
{% verbatim %}
All the markup elements share a common syntactic structure to facilitate
simple matching and parsing code. Each element has the form:
```
{{{tag:fields}}}
```
`tag` identifies one of the element types described below, and is always
a short alphabetic string that must be in lower case. The rest of the
element consists of one or more fields. Fields are separated by `:` and
cannot contain any `:` or `}` characters. How many fields must be or
may be present and what they contain is specified for each element type.
No markup elements or ANSI SGR control sequences are interpreted inside the
contents of a field.
In the descriptions of each element type, `printf`-style placeholders
indicate field contents:
* `%s`
A string of printable characters, not including `:` or `}`.
* `%p`
An address value represented by `0x` followed by an even number of
hexadecimal digits (using either lower-case or upper-case for
`A`..`F`). If the digits are all `0` then the `0x` prefix may be
omitted. No more than 16 hexadecimal digits are expected to appear in
a single value (64 bits).
* `%u`
A nonnegative decimal integer.
* `%x`
A sequence of an even number of hexadecimal digits (using either
lower-case or upper-case for `A`..`F`), with no `0x` prefix.
This represents an arbitrary sequence of bytes, such as an ELF Build ID.
## Presentation elements ##
These are elements that convey a specific program entity to be displayed
in human-readable symbolic form.
* `{{{symbol:%s}}}`
Here `%s` is the linkage name for a symbol or type. It may require
demangling according to language ABI rules. Even for unmangled names,
it's recommended that this markup element be used to identify a symbol
name so that it can be presented distinctively.
Examples:
```
{{{symbol:_ZN7Mangled4NameEv}}}
{{{symbol:foobar}}}
```
* `{{{pc:%p}}}`
Here `%p` is the memory address of a code location.
It might be presented as a function name and source location.
Examples:
```
{{{pc:0x12345678}}}
{{{pc:0xffffffff9abcdef0}}}
```
* `{{{data:%p}}}`
Here `%p` is the memory address of a data location.
It might be presented as the name of a global variable at that location.
Examples:
```
{{{data:0x12345678}}}
{{{data:0xffffffff9abcdef0}}}
```
* `{{{bt:%u:%p}}}`
This represents one frame in a backtrace. It usually appears on a
line by itself (surrounded only by whitespace), in a sequence of such
lines with ascending frame numbers. So the human-readable output
might be formatted assuming that, such that it looks good for a
sequence of `bt` elements each alone on its line with uniform
indentation of each line. But it can appear anywhere, so the filter
should not remove any non-whitespace text surrounding the element.
Here `%u` is the frame number, which starts at zero for the location
of the fault being identified, increments to one for the caller of
frame zero's call frame, to two for the caller of frame one, etc.
`%p` is the memory address of a code location.
In frames after frame zero, this code location identifies a call site.
Some emitters may subtract one byte or one instruction length from the
actual return address for the call site, with the intent that the
address logged can be translated directly to a source location for the
call site and not for the apparent return site thereafter (which can
be confusing). It's recommended that emitters _not_ do this, so that
each frame's code location is the exact return address given to its
callee and e.g. could be highlighted in instruction-level disassembly.
The symbolizing filter can do the adjustment to the address it
translates into a source location. Assuming that a call instruction
is longer than one byte on all supported machines, applying the
"subtract one byte" adjustment a second time still results in an
address somewhere in the call instruction, so a little sloppiness here
does no harm.
Examples:
```
{{{bt:0:0x12345678}}}
{{{bt:1:0xffffffff9abcdef0}}}
```
* `{{{hexdict:...}}}`
This element can span multiple lines. Here `...` is a sequence of
key-value pairs where a single `:` separates each key from its value,
and arbitrary whitespace separates the pairs. The value (right-hand
side) of each pair either is one or more `0` digits, or is `0x`
followed by hexadecimal digits. Each value might be a memory address
or might be some other integer (including an integer that looks like a
likely memory address but actually has an unrelated purpose). When
the contextual information about the memory layout suggests that a
given value could be a code location or a global variable data
address, it might be presented as a source location or variable name
or with active UI that makes such interpretation optionally visible.
The intended use is for things like register dumps, where the emitter
doesn't know which values might have a symbolic interpretation but a
presentation that makes plausible symbolic interpretations available
might be very useful to someone reading the log. At the same time,
a flat text presentation should usually avoid interfering too much
with the original contents and formatting of the dump. For example,
it might use footnotes with source locations for values that appear
to be code locations. An active UI presentation might show the dump
text as is, but highlight values with symbolic information available
and pop up a presentation of symbolic details when a value is selected.
Example:
```
{{{hexdict:
CS: 0 RIP: 0x6ee17076fb80 EFL: 0x10246 CR2: 0
RAX: 0xc53d0acbcf0 RBX: 0x1e659ea7e0d0 RCX: 0 RDX: 0x6ee1708300cc
RSI: 0 RDI: 0x6ee170830040 RBP: 0x3b13734898e0 RSP: 0x3b13734898d8
R8: 0x3b1373489860 R9: 0x2776ff4f R10: 0x2749d3e9a940 R11: 0x246
R12: 0x1e659ea7e0f0 R13: 0xd7231230fd6ff2e7 R14: 0x1e659ea7e108 R15: 0xc53d0acbcf0
}}}
```
## Trigger elements ##
These elements cause an external action and will be presented to the
user in a human readable form. Generally they trigger an external
action to occur that results in a linkable page. The link or some
other informative information about the external action can then be
presented to the user.
* `{{{dumpfile:%s:%s}}}`
Here the first `%s` is an identifier for a type of dump and the
second `%s` is an identifier for a particular dump that's just been
published. The types of dumps, the exact meaning of "published",
and the nature of the identifier are outside the scope of the markup
format per se. In general it might correspond to writing a file by
that name or something similar.
This element may trigger additional post-processing work beyond
symbolizing the markup. It indicates that a dump file of some sort
has been published. Some logic attached to the symbolizing filter may
understand certain types of dump file and trigger additional
post-processing of the dump file upon encountering this element (e.g.
generating visualizations, symbolization). The expectation is that the
information collected from contextual elements (described below) in the
logging stream may be necessary to decode the content of the dump. So
if the symbolizing filter triggers other processing, it may need to
feed some distilled form of the contextual information to those
processes.
On Zircon and Fuchsia in particular, "publish" means to call the
`__sanitizer_publish_data` function from `<zircon/sanitizer.h>`
with the "type" identifier as the "sink name" string. The "dump
identifier" is the name attached to the Zircon VMO whose handle
was passed in the call to `__sanitizer_publish_data`.
**TODO(mcgrathr): Link to docs about `__sanitizer_publish_data` and
getting data dumps off the device.**
An example of a type identifier is `sancov`, for dumps from LLVM
[SanitizerCoverage](https://clang.llvm.org/docs/SanitizerCoverage.html).
Example:
```
{{{dumpfile:sancov:sancov.8675}}}
```
## Contextual elements ##
These are elements that supply information necessary to convert
presentation elements to symbolic form. Unlike presentation elements,
they are not directly related to the surrounding text. Contextual
elements should appear alone on lines with no other non-whitespace
text, so that the symbolizing filter might elide the whole line from
its output without hiding any other log text.
The contextual elements themselves do not necessarily need to be
presented in human-readable output. However, the information they
impart may be essential to understanding the logging text even after
symbolization. So it's recommended that this information be preserved
in some form when the original raw log with markup may no longer be
readily accessible for whatever reason.
Contextual elements should appear in the logging stream before they are
needed. That is, if some piece of context may affect how the
symbolizing filter would interpret or present a later presentation
element, the necessary contextual elements should have appeared
somewhere earlier in the logging stream. It should always be possible
for the symbolizing filter to be implemented as a single pass over the
raw logging stream, accumulating context and massaging text as it goes.
* `{{{reset}}}`
This should be output before any other contextual element. The need
for this contextual element is to support implementations that handle
logs coming from multiple processes. Such implementations might not
know when a new process starts or ends. Because some identifying
information (like process IDs) might be the same between old and new
processes, a way is needed to distinguish two processes with such
identical identifying information. This element informs such
implementations to reset the state of a filter so that information
from a previous process's contextual elements is not assumed for new
process that just happens have the same identifying information.
* `{{{module:%i:%s:%s:...}}}`
This element represents a so called "module". A "module" is a single
linked binary, such as a loaded ELF file. Usually each module occupies
a contiguous range of memory (always does on Zircon).
Here `%i` is the module ID which is used by other contextual elements to
refer to this module. The first `%s` is a human-readable identifier for
the module, such as an ELF `DT_SONAME` string or a file name; but it
might be empty. It's only for casual information. Only the module ID is
used to refer to this module in other contextual elements, never the `%s`
string. The `module` element defining a module ID must always be emitted
before any other elements that refer to that module ID, so that a filter
never needs to keep track of dangling references. The second `%s` is the
module type and it determines what the remaining fields are. The
following module types are supported:
* `elf:%x`
Here `%x` encodes an ELF Build ID. The Build ID should refer to a
single linked binary. The Build ID string is the sole way to identify
the binary from which this module was loaded.
Example:
```
{{{module:1:libc.so:elf:83238ab56ba10497}}}
```
* `{{{mmap:%p:%x:...}}}`
This contextual element is used to give information about a particular
region in memory. `%p` is the starting address and `%x` gives the size
in hex of the region of memory. The `...` part can take different forms
to give different information about the specified region of memory. The
allowed forms are the following:
* `load:%i:%s:%p`
This subelement informs the filter that a segment was loaded from a
module. The module is identified by its module id `%i`. The `%s` is
one or more of the letters 'r', 'w', and 'x' (in that order and in
either upper or lower case) to indicate this segment of memory is
readable, writable, and/or executable. The symbolizing filter can use
this information to guess whether an address is a likely code address
or a likely data address in the given module. The remaining `%p` gives
the module relative address. For ELF files the module relative address
will be the `p_vaddr` of the associated program header. For example if
your module's executable segment has `p_vaddr=0x1000`, `p_memsz=0x1234`,
and was loaded at 0x7acba69d5000 then you need to subtract 0x7acba69d4000
from any address between 0x7acba69d5000 and 0x7acba69d6234 to get the
module relative address. The starting address will usually have been
rounded down to the active page size, and the size rounded up.
Example:
```
{{{mmap:0x7acba69d5000:0x5a000:load:1:rx:0x1000}}}
```
{# Re-enable variable substitution #}
{% endverbatim %}