| # Symbolizer markup format # | 
 |  | 
 | This document defines a text format for log messages that can be | 
 | processed by a _symbolizing filter_.  The basic idea is that logging | 
 | code emits text that contains raw address values and so forth, without | 
 | the logging code doing any real work to convert those values to | 
 | human-readable form.  Instead, logging text uses the markup format | 
 | defined here to identify pieces of information that should be converted | 
 | to human-readable form after the fact.  As with other markup formats, | 
 | the expectation is that most of the text will be displayed as is, while | 
 | the markup elements will be replaced with expanded text, or converted | 
 | into active UI elements, that present more details in symbolic form. | 
 |  | 
 | This means there is no need for symbol tables, DWARF debugging sections, | 
 | or similar information to be directly accessible at runtime.  There is | 
 | also no need at runtime for any logic intended to compute human-readable | 
 | presentation of information, such as C++ symbol demangling.  Instead, | 
 | logging must include markup elements that give the contextual | 
 | information necessary to make sense of the raw data, such as memory | 
 | layout details. | 
 |  | 
 | This format identifies markup elements with a syntax that is both simple | 
 | and distinctive.  It's simple enough to be matched and parsed with | 
 | straightforward code.  It's distinctive enough that character sequences | 
 | that look like the start or end of a markup element should rarely if | 
 | ever appear incidentally in logging text.  It's specifically intended | 
 | not to require sanitizing plain text, such as the HTML/XML requirement | 
 | to replace `<` with `<` and the like. | 
 |  | 
 | ## Scope and assumptions ## | 
 |  | 
 | This specification defines a format standard for Zircon and Fuchsia. | 
 | But there is nothing specific to Zircon or Fuchsia about the markup | 
 | format.  A symbolizing filter implementation will be independent both of | 
 | the _target_ operating system and machine architecture where the logs | 
 | are generated and of the _host_ operating system and machine | 
 | architecture where the filter runs. | 
 |  | 
 | This format assumes that the symbolizing filter processes intact whole | 
 | lines.  If long lines might be split during some stage of a logging | 
 | pipeline, they must be reassembled to restore the original line breaks | 
 | before feeding lines into the symbolizing filter.  Most markup elements | 
 | must appear entirely on a single line (often with other text before | 
 | and/or after the markup element).  There are some markup elements that | 
 | are specified to span lines, with line breaks in the middle of the | 
 | element.  Even in those cases, the filter is not expected to handle line | 
 | breaks in arbitrary places inside a markup element, but only inside | 
 | certain fields. | 
 |  | 
 | This format assumes that the symbolizing filter processes a coherent | 
 | stream of log lines from a single process address space context.  If a | 
 | logging stream interleaves log lines from more than one process, these | 
 | must be collated into separate per-process log streams and each stream | 
 | processed by a separate instance of the symbolizing filter.  Because the | 
 | kernel and user processes use disjoint address regions in most operating | 
 | systems (including Zircon), a single user process address space plus | 
 | the kernel address space can be treated as a single address space for | 
 | symbolization purposes if desired. | 
 |  | 
 | ## Dependence on Build IDs ## | 
 |  | 
 | The symbolizer markup scheme relies on contextual information about | 
 | runtime memory address layout to make it possible to convert markup | 
 | elements into useful symbolic form.  This relies on having an | 
 | unmistakable identification of which binary was loaded at each address. | 
 |  | 
 | An ELF Build ID is the payload of an ELF note with name `"GNU"` and type | 
 | `NT_GNU_BUILD_ID`, a unique byte sequence that identifies a particular | 
 | binary (executable, shared library, loadable module, or driver module). | 
 | The linker generates this automatically based on a hash that includes | 
 | the complete symbol table and debugging information, even if this is | 
 | later stripped from the binary. | 
 |  | 
 | This specification uses the ELF Build ID as the sole means of | 
 | identifying binaries.  Each binary relevant to the log must have been | 
 | linked with a unique Build ID.  The symbolizing filter must have some | 
 | means of mapping a Build ID back to the original ELF binary (either the | 
 | whole unstripped binary, or a stripped binary paired with a separate | 
 | debug file). | 
 |  | 
 | ## Colorization ## | 
 |  | 
 | The markup format supports a restricted subset of ANSI X3.64 SGR (Select | 
 | Graphic Rendition) control sequences.  These are unlike other markup | 
 | elements: | 
 |  * They specify presentation details (**bold** or colors) rather than | 
 |    semantic information.  The assocation of semantic meaning with color | 
 |    (e.g. red for errors) is chosen by the code doing the logging, rather | 
 |    than by the UI presentation of the symbolizing filter.  This is a | 
 |    concession to existing code (e.g. LLVM sanitizer runtimes) that use | 
 |    specific colors and would require substantial changes to generate | 
 |    semantic markup instead. | 
 |  * A single control sequence changes "the state", rather than being an | 
 |    hierarchical structure that surrounds affected text. | 
 |  | 
 | The filter processes ANSI SGR control sequences only within a single | 
 | line.  If a control sequence to enter a **bold** or color state is | 
 | encountered, it's expected that the control sequence to reset to default | 
 | state will be encountered before the end of that line.  If a "dangling" | 
 | state is left at the end of a line, the filter may reset to default | 
 | state for the next line. | 
 |  | 
 | An SGR control sequence is not interpreted inside any other markup element. | 
 | However, other markup elements may appear between SGR control sequences and | 
 | the color/**bold** state is expected to apply to the symbolic output that | 
 | replaces the markup element in the filter's output. | 
 |  | 
 | The accepted SGR control sequences all have the form `"\033[%um"` | 
 | (expressed here using C string syntax), where `%u` is one of these: | 
 |  | 
 | | Code | Effect | Notes | | 
 | |:----:|:------:|-------| | 
 | | `0`  | Reset to default formatting. | | | 
 | | `1`  | Use **bold text**  | Combines with color states, doesn't reset them.| | 
 | | `30` | Black foreground   | | | 
 | | `31` | Red foreground     | | | 
 | | `32` | Green foreground   | | | 
 | | `33` | Yellow foreground  | | | 
 | | `34` | Blue foreground    | | | 
 | | `35` | Magenta foreground | | | 
 | | `36` | Cyan foreground    | | | 
 | | `37` | White foreground   | | | 
 |  | 
 | ## Common markup element syntax ## | 
 |  | 
 | All the markup elements share a common syntactic structure to facilitate | 
 | simple matching and parsing code.  Each element has the form: | 
 |  | 
 | ``` | 
 | {{{tag:fields}}} | 
 | ``` | 
 |  | 
 | `tag` identifies one of the element types described below, and is always | 
 | a short alphabetic string that must be in lower case.  The rest of the | 
 | element consists of one or more fields.  Fields are separated by `:` and | 
 | cannot contain any `:` or `}` characters.  How many fields must be or | 
 | may be present and what they contain is specified for each element type. | 
 |  | 
 | No markup elements or ANSI SGR control sequences are interpreted inside the | 
 | contents of a field. | 
 |  | 
 | In the descriptions of each element type, `printf`-style placeholders | 
 | indicate field contents: | 
 |  | 
 | * `%s` | 
 |  | 
 |   A string of printable characters, not including `:` or `}`. | 
 |  | 
 | * `%p` | 
 |  | 
 |   An address value represented by `0x` followed by an even number of | 
 |   hexadecimal digits (using either lower-case or upper-case for | 
 |   `A`..`F`).  If the digits are all `0` then the `0x` prefix may be | 
 |   omitted.  No more than 16 hexadecimal digits are expected to appear in | 
 |   a single value (64 bits). | 
 |  | 
 | * `%u` | 
 |  | 
 |   A nonnegative decimal integer. | 
 |  | 
 | * `%x` | 
 |  | 
 |   A sequence of an even number of hexadecimal digits (using either | 
 |   lower-case or upper-case for `A`..`F`), with no `0x` prefix. | 
 |   This represents an arbitrary sequence of bytes, such as an ELF Build ID. | 
 |  | 
 | ## Presentation elements ## | 
 |  | 
 | These are elements that convey a specific program entity to be displayed | 
 | in human-readable symbolic form. | 
 |  | 
 | * `{{{symbol:%s}}}` | 
 |  | 
 |   Here `%s` is the linkage name for a symbol or type.  It may require | 
 |   demangling according to language ABI rules.  Even for unmangled names, | 
 |   it's recommended that this markup element be used to identify a symbol | 
 |   name so that it can be presented distinctively. | 
 |  | 
 |   Examples: | 
 |   ``` | 
 |   {{{symbol:_ZN7Mangled4NameEv}}} | 
 |   {{{symbol:foobar}}} | 
 |   ``` | 
 |  | 
 | * `{{{pc:%p}}}` | 
 |  | 
 |   Here `%p` is the memory address of a code location. | 
 |   It might be presented as a function name and source location. | 
 |  | 
 |   Examples: | 
 |   ``` | 
 |   {{{pc:0x12345678}}} | 
 |   {{{pc:0xffffffff9abcdef0}}} | 
 |   ``` | 
 |  | 
 | * `{{{data:%p}}}` | 
 |  | 
 |   Here `%p` is the memory address of a data location. | 
 |   It might be presented as the name of a global variable at that location. | 
 |  | 
 |   Examples: | 
 |   ``` | 
 |   {{{data:0x12345678}}} | 
 |   {{{data:0xffffffff9abcdef0}}} | 
 |   ``` | 
 |  | 
 | * `{{{bt:%u:%p}}}` | 
 |  | 
 |   This represents one frame in a backtrace.  It usually appears on a | 
 |   line by itself (surrounded only by whitespace), in a sequence of such | 
 |   lines with ascending frame numbers.  So the human-readable output | 
 |   might be formatted assuming that, such that it looks good for a | 
 |   sequence of `bt` elements each alone on its line with uniform | 
 |   indentation of each line.  But it can appear anywhere, so the filter | 
 |   should not remove any non-whitespace text surrounding the element. | 
 |  | 
 |   Here `%u` is the frame number, which starts at zero for the location | 
 |   of the fault being identified, increments to one for the caller of | 
 |   frame zero's call frame, to two for the caller of frame one, etc. | 
 |   `%p` is the memory address of a code location. | 
 |  | 
 |   In frames after frame zero, this code location identifies a call site. | 
 |   Some emitters may subtract one byte or one instruction length from the | 
 |   actual return address for the call site, with the intent that the | 
 |   address logged can be translated directly to a source location for the | 
 |   call site and not for the apparent return site thereafter (which can | 
 |   be confusing).  It's recommended that emitters _not_ do this, so that | 
 |   each frame's code location is the exact return address given to its | 
 |   callee and e.g. could be highlighted in instruction-level disassembly. | 
 |   The symbolizing filter can do the adjustment to the address it | 
 |   translates into a source location.  Assuming that a call instruction | 
 |   is longer than one byte on all supported machines, applying the | 
 |   "subtract one byte" adjustment a second time still results in an | 
 |   address somewhere in the call instruction, so a little sloppiness here | 
 |   does no harm. | 
 |  | 
 |   Examples: | 
 |   ``` | 
 |   {{{bt:0:0x12345678}}} | 
 |   {{{bt:1:0xffffffff9abcdef0}}} | 
 |   ``` | 
 |  | 
 | * `{{{hexdict:...}}}` | 
 |  | 
 |   This element can span multiple lines.  Here `...` is a sequence of | 
 |   key-value pairs where a single `:` separates each key from its value, | 
 |   and arbitrary whitespace separates the pairs.  The value (right-hand | 
 |   side) of each pair either is one or more `0` digits, or is `0x` | 
 |   followed by hexadecimal digits.  Each value might be a memory address | 
 |   or might be some other integer (including an integer that looks like a | 
 |   likely memory address but actually has an unrelated purpose).  When | 
 |   the contextual information about the memory layout suggests that a | 
 |   given value could be a code location or a global variable data | 
 |   address, it might be presented as a source location or variable name | 
 |   or with active UI that makes such interpretation optionally visible. | 
 |  | 
 |   The intended use is for things like register dumps, where the emitter | 
 |   doesn't know which values might have a symbolic interpretation but a | 
 |   presentation that makes plausible symbolic interpretations available | 
 |   might be very useful to someone reading the log.  At the same time, | 
 |   a flat text presentation should usually avoid interfering too much | 
 |   with the original contents and formatting of the dump.  For example, | 
 |   it might use footnotes with source locations for values that appear | 
 |   to be code locations.  An active UI presentation might show the dump | 
 |   text as is, but highlight values with symbolic information available | 
 |   and pop up a presentation of symbolic details when a value is selected. | 
 |  | 
 |   Example: | 
 |   ``` | 
 |   {{{hexdict: | 
 |     CS:                   0 RIP:     0x6ee17076fb80 EFL:            0x10246 CR2:                  0 | 
 |     RAX:      0xc53d0acbcf0 RBX:     0x1e659ea7e0d0 RCX:                  0 RDX:     0x6ee1708300cc | 
 |     RSI:                  0 RDI:     0x6ee170830040 RBP:     0x3b13734898e0 RSP:     0x3b13734898d8 | 
 |      R8:     0x3b1373489860  R9:         0x2776ff4f R10:     0x2749d3e9a940 R11:              0x246 | 
 |     R12:     0x1e659ea7e0f0 R13: 0xd7231230fd6ff2e7 R14:     0x1e659ea7e108 R15:      0xc53d0acbcf0 | 
 |   }}} | 
 |   ``` | 
 |  | 
 | ## Trigger elements ## | 
 |  | 
 | These elements cause an external action and will be presented to the | 
 | user in a human readable form. Generally they trigger an external | 
 | action to occur that results in a linkable page. The link or some | 
 | other informative information about the external action can then be | 
 | presented to the user. | 
 |  | 
 | * `{{{dumpfile:%s:%s}}}` | 
 |  | 
 |   Here the first `%s` is an identifier for a type of dump and the | 
 |   second `%s` is an identifier for a particular dump that's just been | 
 |   published.  The types of dumps, the exact meaning of "published", | 
 |   and the nature of the identifier are outside the scope of the markup | 
 |   format per se.  In general it might correspond to writing a file by | 
 |   that name or something similar. | 
 |  | 
 |   This element may trigger additional post-processing work beyond | 
 |   symbolizing the markup. It indicates that a dump file of some sort | 
 |   has been published.  Some logic attached to the symbolizing filter may | 
 |   understand certain types of dump file and trigger additional | 
 |   post-processing of the dump file upon encountering this element (e.g. | 
 |   generating visualizations, symbolization).  The expectation is that the | 
 |   information collected from contextual elements (described below) in the | 
 |   logging stream may be necessary to decode the content of the dump.  So | 
 |   if the symbolizing filter triggers other processing, it may need to | 
 |   feed some distilled form of the contextual information to those | 
 |   processes. | 
 |  | 
 |   On Zircon and Fuchsia in particular, "publish" means to call the | 
 |   `__sanitizer_publish_data` function from `<zircon/sanitizer.h>` | 
 |   with the "type" identifier as the "sink name" string.  The "dump | 
 |   identifier" is the name attached to the Zircon VMO whose handle | 
 |   was passed in the call to `__sanitizer_publish_data`. | 
 |   **TODO(mcgrathr): Link to docs about `__sanitizer_publish_data` and | 
 |   getting data dumps off the device.** | 
 |  | 
 |   An example of a type identifier is `sancov`, for dumps from LLVM | 
 |   [SanitizerCoverage](https://clang.llvm.org/docs/SanitizerCoverage.html). | 
 |  | 
 |   Example: | 
 |   ``` | 
 |   {{{dumpfile:sancov:sancov.8675}}} | 
 |   ``` | 
 |  | 
 | ## Contextual elements ## | 
 |  | 
 | These are elements that supply information necessary to convert | 
 | presentation elements to symbolic form.  Unlike presentation elements, | 
 | they are not directly related to the surrounding text.  Contextual | 
 | elements should appear alone on lines with no other non-whitespace | 
 | text, so that the symbolizing filter might elide the whole line from | 
 | its output without hiding any other log text. | 
 |  | 
 | The contextual elements themselves do not necessarily need to be | 
 | presented in human-readable output.  However, the information they | 
 | impart may be essential to understanding the logging text even after | 
 | symbolization.  So it's recommended that this information be preserved | 
 | in some form when the original raw log with markup may no longer be | 
 | readily accessible for whatever reason. | 
 |  | 
 | Contextual elements should appear in the logging stream before they are | 
 | needed.  That is, if some piece of context may affect how the | 
 | symbolizing filter would interpret or present a later presentation | 
 | element, the necessary contextual elements should have appeared | 
 | somewhere earlier in the logging stream.  It should always be possible | 
 | for the symbolizing filter to be implemented as a single pass over the | 
 | raw logging stream, accumulating context and massaging text as it goes. | 
 |  | 
 | * `{{{reset}}}` | 
 |  | 
 |   This should be output before any other contextual element. The need | 
 |   for this contextual element is to support implementations that handle | 
 |   logs coming from multiple processes. Such implementations might not | 
 |   know when a new process starts or ends. Because some identifying | 
 |   information (like process IDs) might be the same between old and new | 
 |   processes, a way is needed to distinguish two processes with such | 
 |   identical identifying information. This element informs such | 
 |   implementations to reset the state of a filter so that information | 
 |   from a previous process's contextual elements is not assumed for new | 
 |   process that just happens have the same identifying information. | 
 |  | 
 | * `{{{module:%i:%s:%s:...}}}` | 
 |  | 
 |   This element represents a so called "module". A "module" is a single | 
 |   linked binary, such as a loaded ELF file. Usually each module occupies | 
 |   a contiguous range of memory (always does on Zircon). | 
 |   | 
 |   Here `%i` is the Module ID which is used by other contextual elements | 
 |   to refer to this module. The first `%s` is a human-readable identifier | 
 |   for the module, such as an ELF `DT_SONAME` string or a file name; but | 
 |   it might be empty. It's only for casual information. The Module ID | 
 |   will be exclusivelly used to refer to this module in other contextual | 
 |   elements. The second `%s` is the module type and it determines what | 
 |   the remaining fields are. The following module types are supported: | 
 |  | 
 |   * `elf:%x` | 
 |  | 
 |     Here `%x` encodes an ELF Build ID. The Build ID should refer to a | 
 |     single linked binary. The Build ID string is the sole way to identify | 
 |     the binary from which this module was loaded. | 
 |  | 
 |   Example: | 
 |   ``` | 
 |   {{{module:1:libc.so:elf:83238ab56ba10497}}} | 
 |   ``` | 
 |  | 
 | * `{{{mmap:%p:%x:...}}}` | 
 |  | 
 |   This contextual element is used to give information about a particular | 
 |   region in memory. `%p` is the starting address and `%x` gives the size | 
 |   in hex of the region of memory. The `...` part can take different forms | 
 |   to give different information about the specified region of memory. The | 
 |   allowed forms are the following: | 
 |  | 
 |   * `load:%i:%s:%p` | 
 |  | 
 |     This subelement informs the filter that a segment was loaded from a | 
 |     module. The module is identified by its module id `%i`. The `%s` is | 
 |     one or more of the letters 'r', 'w', and 'x' (in that order and in | 
 |     either upper or lower case) to indicate this segment of memory is | 
 |     readable, writable, and/or executable. The symbolizing filter can use | 
 |     this information to guess whether an address is a likely code address | 
 |     or a likely data address in the given module. The remaining `%p` gives | 
 |     the module relative address. For ELF files the module relative address | 
 |     will be the `p_vaddr` of the associated program header. For example if | 
 |     your module's executable segment has `p_vaddr=0x1000`, `p_memsz=0x1234`, | 
 |     and was loaded at 0x7acba69d5000 then you need to subtract 0x7acba69d4000 | 
 |     from any address between 0x7acba69d5000 and 0x7acba69d6234 to get the | 
 |     module relative address. The starting address will usually have been | 
 |     rounded down to the active page size, and the size rounded up. | 
 |  | 
 |   Example: | 
 |   ``` | 
 |   {{{mmap:0x7acba69d5000:0x5a000:load:1:rx:0x1000}}} | 
 |   ``` |