| # Symbolizer markup format # |
| |
| This document defines a text format for log messages that can be |
| processed by a _symbolizing filter_. The basic idea is that logging |
| code emits text that contains raw address values and so forth, without |
| the logging code doing any real work to convert those values to |
| human-readable form. Instead, logging text uses the markup format |
| defined here to identify pieces of information that should be converted |
| to human-readable form after the fact. As with other markup formats, |
| the expectation is that most of the text will be displayed as is, while |
| the markup elements will be replaced with expanded text, or converted |
| into active UI elements, that present more details in symbolic form. |
| |
| This means there is no need for symbol tables, DWARF debugging sections, |
| or similar information to be directly accessible at runtime. There is |
| also no need at runtime for any logic intended to compute human-readable |
| presentation of information, such as C++ symbol demangling. Instead, |
| logging must include markup elements that give the contextual |
| information necessary to make sense of the raw data, such as memory |
| layout details. |
| |
| This format identifies markup elements with a syntax that is both simple |
| and distinctive. It's simple enough to be matched and parsed with |
| straightforward code. It's distinctive enough that character sequences |
| that look like the start or end of a markup element should rarely if |
| ever appear incidentally in logging text. It's specifically intended |
| not to require sanitizing plain text, such as the HTML/XML requirement |
| to replace `<` with `<` and the like. |
| |
| ## Scope and assumptions ## |
| |
| This specification defines a format standard for Zircon and Fuchsia. |
| But there is nothing specific to Zircon or Fuchsia about the markup |
| format. A symbolizing filter implementation will be independent both of |
| the _target_ operating system and machine architecture where the logs |
| are generated and of the _host_ operating system and machine |
| architecture where the filter runs. |
| |
| This format assumes that the symbolizing filter processes intact whole |
| lines. If long lines might be split during some stage of a logging |
| pipeline, they must be reassembled to restore the original line breaks |
| before feeding lines into the symbolizing filter. Most markup elements |
| must appear entirely on a single line (often with other text before |
| and/or after the markup element). There are some markup elements that |
| are specified to span lines, with line breaks in the middle of the |
| element. Even in those cases, the filter is not expected to handle line |
| breaks in arbitrary places inside a markup element, but only inside |
| certain fields. |
| |
| This format assumes that the symbolizing filter processes a coherent |
| stream of log lines from a single process address space context. If a |
| logging stream interleaves log lines from more than one process, these |
| must be collated into separate per-process log streams and each stream |
| processed by a separate instance of the symbolizing filter. Because the |
| kernel and user processes use disjoint address regions in most operating |
| systems (including Zircon), a single user process address space plus |
| the kernel address space can be treated as a single address space for |
| symbolization purposes if desired. |
| |
| ## Dependence on Build IDs ## |
| |
| The symbolizer markup scheme relies on contextual information about |
| runtime memory address layout to make it possible to convert markup |
| elements into useful symbolic form. This relies on having an |
| unmistakable identification of which binary was loaded at each address. |
| |
| An ELF Build ID is the payload of an ELF note with name `"GNU"` and type |
| `NT_GNU_BUILD_ID`, a unique byte sequence that identifies a particular |
| binary (executable, shared library, loadable module, or driver module). |
| The linker generates this automatically based on a hash that includes |
| the complete symbol table and debugging information, even if this is |
| later stripped from the binary. |
| |
| This specification uses the ELF Build ID as the sole means of |
| identifying binaries. Each binary relevant to the log must have been |
| linked with a unique Build ID. The symbolizing filter must have some |
| means of mapping a Build ID back to the original ELF binary (either the |
| whole unstripped binary, or a stripped binary paired with a separate |
| debug file). |
| |
| ## Colorization ## |
| |
| The markup format supports a restricted subset of ANSI X3.64 SGR (Select |
| Graphic Rendition) control sequences. These are unlike other markup |
| elements: |
| * They specify presentation details (**bold** or colors) rather than |
| semantic information. The association of semantic meaning with color |
| (e.g. red for errors) is chosen by the code doing the logging, rather |
| than by the UI presentation of the symbolizing filter. This is a |
| concession to existing code (e.g. LLVM sanitizer runtimes) that use |
| specific colors and would require substantial changes to generate |
| semantic markup instead. |
| * A single control sequence changes "the state", rather than being an |
| hierarchical structure that surrounds affected text. |
| |
| The filter processes ANSI SGR control sequences only within a single |
| line. If a control sequence to enter a **bold** or color state is |
| encountered, it's expected that the control sequence to reset to default |
| state will be encountered before the end of that line. If a "dangling" |
| state is left at the end of a line, the filter may reset to default |
| state for the next line. |
| |
| An SGR control sequence is not interpreted inside any other markup element. |
| However, other markup elements may appear between SGR control sequences and |
| the color/**bold** state is expected to apply to the symbolic output that |
| replaces the markup element in the filter's output. |
| |
| The accepted SGR control sequences all have the form `"\033[%um"` |
| (expressed here using C string syntax), where `%u` is one of these: |
| |
| | Code | Effect | Notes | |
| |:----:|:------:|-------| |
| | `0` | Reset to default formatting. | | |
| | `1` | Use **bold text** | Combines with color states, doesn't reset them.| |
| | `30` | Black foreground | | |
| | `31` | Red foreground | | |
| | `32` | Green foreground | | |
| | `33` | Yellow foreground | | |
| | `34` | Blue foreground | | |
| | `35` | Magenta foreground | | |
| | `36` | Cyan foreground | | |
| | `37` | White foreground | | |
| |
| ## Common markup element syntax ## |
| |
| {# Disable variable substitution to avoid {{ being interpreted by the template engine #} |
| {% verbatim %} |
| |
| All the markup elements share a common syntactic structure to facilitate |
| simple matching and parsing code. Each element has the form: |
| |
| ``` |
| {{{tag:fields}}} |
| ``` |
| |
| `tag` identifies one of the element types described below, and is always |
| a short alphabetic string that must be in lower case. The rest of the |
| element consists of one or more fields. Fields are separated by `:` and |
| cannot contain any `:` or `}` characters. How many fields must be or |
| may be present and what they contain is specified for each element type. |
| |
| No markup elements or ANSI SGR control sequences are interpreted inside the |
| contents of a field. |
| |
| In the descriptions of each element type, `printf`-style placeholders |
| indicate field contents: |
| |
| * `%s` |
| |
| A string of printable characters, not including `:` or `}`. |
| |
| * `%p` |
| |
| An address value represented by `0x` followed by an even number of |
| hexadecimal digits (using either lower-case or upper-case for |
| `A`..`F`). If the digits are all `0` then the `0x` prefix may be |
| omitted. No more than 16 hexadecimal digits are expected to appear in |
| a single value (64 bits). |
| |
| * `%u` |
| |
| A nonnegative decimal integer. |
| |
| * `%x` |
| |
| A sequence of an even number of hexadecimal digits (using either |
| lower-case or upper-case for `A`..`F`), with no `0x` prefix. |
| This represents an arbitrary sequence of bytes, such as an ELF Build ID. |
| |
| ## Presentation elements ## |
| |
| These are elements that convey a specific program entity to be displayed |
| in human-readable symbolic form. |
| |
| * `{{{symbol:%s}}}` |
| |
| Here `%s` is the linkage name for a symbol or type. It may require |
| demangling according to language ABI rules. Even for unmangled names, |
| it's recommended that this markup element be used to identify a symbol |
| name so that it can be presented distinctively. |
| |
| Examples: |
| |
| ``` |
| {{{symbol:_ZN7Mangled4NameEv}}} |
| {{{symbol:foobar}}} |
| ``` |
| |
| * `{{{pc:%p}}}` |
| |
| Here `%p` is the memory address of a code location. |
| It might be presented as a function name and source location. |
| |
| Examples: |
| |
| ``` |
| {{{pc:0x12345678}}} |
| {{{pc:0xffffffff9abcdef0}}} |
| ``` |
| |
| * `{{{data:%p}}}` |
| |
| Here `%p` is the memory address of a data location. |
| It might be presented as the name of a global variable at that location. |
| |
| Examples: |
| |
| ``` |
| {{{data:0x12345678}}} |
| {{{data:0xffffffff9abcdef0}}} |
| ``` |
| |
| * `{{{bt:%u:%p}}}` |
| |
| This represents one frame in a backtrace. It usually appears on a |
| line by itself (surrounded only by whitespace), in a sequence of such |
| lines with ascending frame numbers. So the human-readable output |
| might be formatted assuming that, such that it looks good for a |
| sequence of `bt` elements each alone on its line with uniform |
| indentation of each line. But it can appear anywhere, so the filter |
| should not remove any non-whitespace text surrounding the element. |
| |
| Here `%u` is the frame number, which starts at zero for the location |
| of the fault being identified, increments to one for the caller of |
| frame zero's call frame, to two for the caller of frame one, etc. |
| `%p` is the memory address of a code location. |
| |
| In frames after frame zero, this code location identifies a call site. |
| Some emitters may subtract one byte or one instruction length from the |
| actual return address for the call site, with the intent that the |
| address logged can be translated directly to a source location for the |
| call site and not for the apparent return site thereafter (which can |
| be confusing). It's recommended that emitters _not_ do this, so that |
| each frame's code location is the exact return address given to its |
| callee and e.g. could be highlighted in instruction-level disassembly. |
| The symbolizing filter can do the adjustment to the address it |
| translates into a source location. Assuming that a call instruction |
| is longer than one byte on all supported machines, applying the |
| "subtract one byte" adjustment a second time still results in an |
| address somewhere in the call instruction, so a little sloppiness here |
| does no harm. |
| |
| Examples: |
| |
| ``` |
| {{{bt:0:0x12345678}}} |
| {{{bt:1:0xffffffff9abcdef0}}} |
| ``` |
| |
| * `{{{hexdict:...}}}` |
| |
| This element can span multiple lines. Here `...` is a sequence of |
| key-value pairs where a single `:` separates each key from its value, |
| and arbitrary whitespace separates the pairs. The value (right-hand |
| side) of each pair either is one or more `0` digits, or is `0x` |
| followed by hexadecimal digits. Each value might be a memory address |
| or might be some other integer (including an integer that looks like a |
| likely memory address but actually has an unrelated purpose). When |
| the contextual information about the memory layout suggests that a |
| given value could be a code location or a global variable data |
| address, it might be presented as a source location or variable name |
| or with active UI that makes such interpretation optionally visible. |
| |
| The intended use is for things like register dumps, where the emitter |
| doesn't know which values might have a symbolic interpretation but a |
| presentation that makes plausible symbolic interpretations available |
| might be very useful to someone reading the log. At the same time, |
| a flat text presentation should usually avoid interfering too much |
| with the original contents and formatting of the dump. For example, |
| it might use footnotes with source locations for values that appear |
| to be code locations. An active UI presentation might show the dump |
| text as is, but highlight values with symbolic information available |
| and pop up a presentation of symbolic details when a value is selected. |
| |
| Example: |
| |
| ``` |
| {{{hexdict: |
| CS: 0 RIP: 0x6ee17076fb80 EFL: 0x10246 CR2: 0 |
| RAX: 0xc53d0acbcf0 RBX: 0x1e659ea7e0d0 RCX: 0 RDX: 0x6ee1708300cc |
| RSI: 0 RDI: 0x6ee170830040 RBP: 0x3b13734898e0 RSP: 0x3b13734898d8 |
| R8: 0x3b1373489860 R9: 0x2776ff4f R10: 0x2749d3e9a940 R11: 0x246 |
| R12: 0x1e659ea7e0f0 R13: 0xd7231230fd6ff2e7 R14: 0x1e659ea7e108 R15: 0xc53d0acbcf0 |
| }}} |
| ``` |
| |
| ## Trigger elements ## |
| |
| These elements cause an external action and will be presented to the |
| user in a human readable form. Generally they trigger an external |
| action to occur that results in a linkable page. The link or some |
| other informative information about the external action can then be |
| presented to the user. |
| |
| * `{{{dumpfile:%s:%s}}}` |
| |
| Here the first `%s` is an identifier for a type of dump and the |
| second `%s` is an identifier for a particular dump that's just been |
| published. The types of dumps, the exact meaning of "published", |
| and the nature of the identifier are outside the scope of the markup |
| format per se. In general it might correspond to writing a file by |
| that name or something similar. |
| |
| This element may trigger additional post-processing work beyond |
| symbolizing the markup. It indicates that a dump file of some sort |
| has been published. Some logic attached to the symbolizing filter may |
| understand certain types of dump file and trigger additional |
| post-processing of the dump file upon encountering this element (e.g. |
| generating visualizations, symbolization). The expectation is that the |
| information collected from contextual elements (described below) in the |
| logging stream may be necessary to decode the content of the dump. So |
| if the symbolizing filter triggers other processing, it may need to |
| feed some distilled form of the contextual information to those |
| processes. |
| |
| On Zircon and Fuchsia in particular, "publish" means to call the |
| `__sanitizer_publish_data` function from `<zircon/sanitizer.h>` |
| with the "type" identifier as the "sink name" string. The "dump |
| identifier" is the name attached to the Zircon VMO whose handle |
| was passed in the call to `__sanitizer_publish_data`. |
| **TODO(mcgrathr): Link to docs about `__sanitizer_publish_data` and |
| getting data dumps off the device.** |
| |
| An example of a type identifier is `sancov`, for dumps from LLVM |
| [SanitizerCoverage](https://clang.llvm.org/docs/SanitizerCoverage.html). |
| |
| Example: |
| |
| ``` |
| {{{dumpfile:sancov:sancov.8675}}} |
| ``` |
| |
| ## Contextual elements ## |
| |
| These are elements that supply information necessary to convert |
| presentation elements to symbolic form. Unlike presentation elements, |
| they are not directly related to the surrounding text. Contextual |
| elements should appear alone on lines with no other non-whitespace |
| text, so that the symbolizing filter might elide the whole line from |
| its output without hiding any other log text. |
| |
| The contextual elements themselves do not necessarily need to be |
| presented in human-readable output. However, the information they |
| impart may be essential to understanding the logging text even after |
| symbolization. So it's recommended that this information be preserved |
| in some form when the original raw log with markup may no longer be |
| readily accessible for whatever reason. |
| |
| Contextual elements should appear in the logging stream before they are |
| needed. That is, if some piece of context may affect how the |
| symbolizing filter would interpret or present a later presentation |
| element, the necessary contextual elements should have appeared |
| somewhere earlier in the logging stream. It should always be possible |
| for the symbolizing filter to be implemented as a single pass over the |
| raw logging stream, accumulating context and massaging text as it goes. |
| |
| * `{{{reset}}}` |
| |
| This should be output before any other contextual element. The need |
| for this contextual element is to support implementations that handle |
| logs coming from multiple processes. Such implementations might not |
| know when a new process starts or ends. Because some identifying |
| information (like process IDs) might be the same between old and new |
| processes, a way is needed to distinguish two processes with such |
| identical identifying information. This element informs such |
| implementations to reset the state of a filter so that information |
| from a previous process's contextual elements is not assumed for new |
| process that just happens have the same identifying information. |
| |
| * `{{{module:%i:%s:%s:...}}}` |
| |
| This element represents a so called "module". A "module" is a single |
| linked binary, such as a loaded ELF file. Usually each module occupies |
| a contiguous range of memory (always does on Zircon). |
| |
| Here `%i` is the module ID which is used by other contextual elements to |
| refer to this module. The first `%s` is a human-readable identifier for |
| the module, such as an ELF `DT_SONAME` string or a file name; but it |
| might be empty. It's only for casual information. Only the module ID is |
| used to refer to this module in other contextual elements, never the `%s` |
| string. The `module` element defining a module ID must always be emitted |
| before any other elements that refer to that module ID, so that a filter |
| never needs to keep track of dangling references. The second `%s` is the |
| module type and it determines what the remaining fields are. The |
| following module types are supported: |
| |
| * `elf:%x` |
| |
| Here `%x` encodes an ELF Build ID. The Build ID should refer to a |
| single linked binary. The Build ID string is the sole way to identify |
| the binary from which this module was loaded. |
| |
| Example: |
| |
| ``` |
| {{{module:1:libc.so:elf:83238ab56ba10497}}} |
| ``` |
| |
| * `{{{mmap:%p:%x:...}}}` |
| |
| This contextual element is used to give information about a particular |
| region in memory. `%p` is the starting address and `%x` gives the size |
| in hex of the region of memory. The `...` part can take different forms |
| to give different information about the specified region of memory. The |
| allowed forms are the following: |
| |
| * `load:%i:%s:%p` |
| |
| This subelement informs the filter that a segment was loaded from a |
| module. The module is identified by its module id `%i`. The `%s` is |
| one or more of the letters 'r', 'w', and 'x' (in that order and in |
| either upper or lower case) to indicate this segment of memory is |
| readable, writable, and/or executable. The symbolizing filter can use |
| this information to guess whether an address is a likely code address |
| or a likely data address in the given module. The remaining `%p` gives |
| the module relative address. For ELF files the module relative address |
| will be the `p_vaddr` of the associated program header. For example if |
| your module's executable segment has `p_vaddr=0x1000`, `p_memsz=0x1234`, |
| and was loaded at 0x7acba69d5000 then you need to subtract 0x7acba69d4000 |
| from any address between 0x7acba69d5000 and 0x7acba69d6234 to get the |
| module relative address. The starting address will usually have been |
| rounded down to the active page size, and the size rounded up. |
| |
| Example: |
| |
| ``` |
| {{{mmap:0x7acba69d5000:0x5a000:load:1:rx:0x1000}}} |
| ``` |
| |
| {# Re-enable variable substitution #} |
| {% endverbatim %} |