Contributors: cphoenix@
This codelab explains the Triage utility:
The source files and examples to which this document refers are available at:
Triage allows you to scan Fuchsia snapshots (snapshot.zip files) for predefined conditions.
The Triage system makes it easy to configure new conditions, increasing the usefulness of Triage for everyone.
Before you start on this codelab, make sure you have completed the following:
fx build
.fx triage
This command downloads a fresh snapshot.zip
file using the fx snapshot
command. This command runs the default rules, which are located in the source tree:
//src/diagnostics/config/triage/*.triage
To analyze a specific snapshot.zip file, use --data
.
--data
argument.--data
can also be a path to a directory containing an unzipped snapshot.fx triage
without specifying a --data
option, it runs a fresh fx snapshot
and analyzes its files.$ fx triage --data my/foo/snapshot.zip
To use a specific configuration file or all *.triage
files in a specific directory, use --config
.
--config
arguments.--config
argument is used, the default rules will not be automatically loaded.fx triage --config my/directory --config my/file.triage
The rest of this codelab explains how to configure new behavior in Triage.
Triage condenses the mass of Diagnostic data into useful information, through the following steps:
inspect.json
file using selector strings in the select
section of the config file.eval
section of the config file.act
section of the config file.test
section.Navigate to the examples/diagnostics/triage
directory in your source tree. The following command is intended to run from that directory:
fx triage --config . --data snapshot
Running this command in the sample directory with the unmodified codelab files prints a line from a pre-supplied action:
Warning: 'always_triggered' in 'rules' detected 'Triage is running': 'always_true' was true
This codelab includes an inspect.json
file with Inspect data to make the exercises work predictably. This file is in the sample directory under snapshot/inspect.json
.
Note: inspect.json
files are normally packaged in the snapshot.zip
file produced by fx snapshot
. Either use unzip
to unpack these files or give the .zip file as the argument to --data
. For this codelab the snapshot has already been unzipped.
The Triage program uses configuration loaded from one or more .triage files. This codelab uses the rules.triage
file located in the sample directory.
.triage files use JSON5 which is easier to write and read than JSON:
The inspect.json
file in the sample directory indicates a couple of problems with the system. You're going to configure the triage system to detect those problems.
This step configures Triage to extract values from the data in the inspect.json
file.
The rules.triage
file contains a key-value section called select
. Each key (name) can be used in the body of other config entries. Each value is a selector string. In effect, each entry in the select
section (and the eval
section, described below) defines a variable.
The selector string is a colon-separated string that tells where in the Inspect data to find the value you need.
select: { disk_total: "INSPECT:bootstrap/fshost:root/data_stats/stats:total_bytes", // "data_stat" is an intentional typo to fix later in the codelab. disk_used: "INSPECT:bootstrap/fshost:root/data_stat/stats:used_bytes", }
Inspect data published by a component is organized as a tree of nodes with values (properties) at the leaves. The inspect.json
file is an array of these trees, each with a moniker that identifies the source component.
The portion of the selector string between the INSPECT:
and the second colon should match one of the moniker strings in the inspect.json
file.
The portion between the second and third colons is a /
-separated list of node names.
The portion after the last colon is the property name.
The above selector string indicates a component whose moniker
matches the string bootstrap/fshost
and whose inspect tree contains the path root/data_stat/stats
. It also indicates the used_bytes
property from the stats
subnode of the root
node of that component's Inspect Tree.
Put the above selectors into the “select” section of the rules.triage file.
Entering selectors by hand is error-prone, so fx triage --select
can be used to output valid selectors.
$ fx triage --data snapshot --select total_bytes INSPECT:bootstrap/fshost:root/data_stats/stats:total_bytes
Multiple --select
parameters can be used. The program will generate all possible selectors for the snapshot's Diagnostic data, then filter (grep) through all --select
parameters.
After selecting values from the inspect.json
file, you need to do some logic, and probably some arithmetic, to see whether those values indicate a condition worth flagging.
Copy and add the following line to the eval
section of the rules.triage
file to calculate how full the disk is:
eval: { .... disk_percentage: "disk_used / disk_total", }
eval
entries use ordinary infix math expressions. See the Details section for more information.
In the “act” part of the config file, add an action that prints a warning when the disk is 98% full. Use the following lines:
act: { ... disk_full: { type: "Warning", trigger: "disk_percentage > 0.98", print: "Disk reached 98% full", }, }
Note the following:
print
is the only available action.act: { ... disk_display: { type: "Gauge", value: "disk_percentage", format: "percentage", } }
Gauges always display the supplied value.
The format
field is optional. A value of “percentage” tells the output formatter to expect a value between 0 and 1 and display it as a percentage.
The following command will run Triage against the local config file.
fx triage --config . --data snapshot
You will get an error that looks like the following:
[ERROR] In config 'rules': No value found matching selector bootstrap/fshost:root/data_stat/stats:used_bytes
There was a typo in the selector rules. Triage could not find values needed to evaluate a rule. In fact, the correct selector is “data_stats” not “data_stat.” Fix it in your selector rules and try again.
fx triage --config . --data snapshot
Now what happened? Nothing, right? So, how do you know whether there was no problem in the inspect.json
file, or a bug in your rule?
You can (and should!) add tests for your actions. For each test, specify values and whether or not those values should trigger your rule.
To test the rule you've added, add the following to the test
section of the rules.triage
file:
test: { .... is_full: { yes: ["disk_full"], values: { disk_used: 98, disk_total: 100, } } }
Note: Unlike the right hand side of eval
entries, the values
entries are parsed as JSON values, not expression strings. Numbers should not be quoted.
The keys in the values
section should be the names of eval
or select
entries. The values supplied will override the value that the entry would have selected or calculated.
You can also test conditions in which actions should not trigger:
test: { .... not_full: { no: ["disk_full"], values: { disk_used: 97, disk_total: 100, } } }
To run the test, just run Triage. It automatically self-tests each time it's run.
fx triage --config . --data snapshot
Whoops! That should signal an error:
Test is_full failed: trigger 'disk_percentage > 0.98' of action disk_full returned Bool(false), expected true
You want to trigger when the disk is 98% or more full, but that's not quite what you wrote, and your test caught the problem. Modify the >
in your action to be a >=
:
trigger: "disk_percentage >= 0.98",
Run Triage again. The error should disappear, replaced by a warning that your inspect.json
file does in fact indicate a full disk.
Warning: 'disk_full' in 'rules' detected 'Disk is 98% full': 'disk98' was true
You can write expressions to test whether a line of a log file (syslog.txt, klog.txt, bootlog.txt) is matched by a regular expression. It looks like this:
eval: { syslog_has_not_found: "SyslogHas('ERROR.*not found')", ... } act: { something_not_found: { trigger: "SyslogHas('ERROR.*not found')", ... } }
Note: To nest quotation marks you can use either single quote '
or escaped double quote \"
.
The functions are SyslogHas(), KlogHas(), BootlogHas(). If a log file is missing (for example, some snapshots contain no bootlog.txt) it is treated the same as an empty file.
To test this, you can include entries in the test like this:
test: { test_error: { yes: [error_scan], syslog: "ERROR: file Foo not found\nSecond line OK", } }
The snapshot contains a file annotations.json
, which contains information on the build, board, uptime, and so on.
Values can be fetched from this file by using the function Annotation()
with a single string parameter, which is a key of the JSON object in the file. For example,
eval: { using_chromebook: "Annotation('build.board') == 'chromebook-x64'", }
You can add any number of Triage configuration files, and even use variables defined in one file in another file. This has lots of applications:
Add a file “product.triage” containing the following:
{ eval: { max_components: "4", }, }
Note the following:
select
, act
, or test
entries.4
is a math expression string so it does need to be quoted.Add the following entries to the rules.triage file:
select: { ... actual_components: "INSPECT:bootstrap/archivist:root/event_stats:components_started", }
That will extract how many components were active in the device.
eval: { ... too_many_components: "actual_components > product::max_components",
That compares the actual components with the theoretical maximum for the product.
Note: To use variable names from another file, combine the file name, two colons, and the variable name.
Finally, add an action:
act: { ... component_overflow: { type: "Warning", trigger: "too_many_components", print: "Too many components!", }, }
Unfortunately, this device tried to use too many components, so this warning should trigger when “fx triage” is run.
Note: The trigger
of an action can also use file::name
syntax to refer to a variable from another file.
In a production environment, several “product.triage” files could be maintained in different directories, and Triage could be directed to use any of them with the “--config” command line argument.
Tests use only the metrics within the file where the test occurs, plus the values supplied by the test. An expression (eval or test trigger) that uses namespaced values like “a::b” must have those values supplied by an “a::b” entry in the test's values.
Note: Unlike most keys in .triage files, namespaced names must be double-quoted when used as keys.
test: { component_max_ok: { no: [ "component_overflow", ], values: { actual_components: 17, "product::max_components": 17, }, }, },
Names (of selectors, expressions, actions, and tests, as well as the basenames of config files) can be any letter or underscore, followed by any number of letters, numbers, or underscores.
Names beginning with underscores may have special meaning in future versions of Triage. They‘re not forbidden, but it’s best to avoid them.
The name of each .triage file establishes its namespace. Loading two .triage files with the same name from different directories is not allowed.
+ - * / //
operators, with ordinary order and precedence of operations./
produces a float value.//
produces an int value, truncating the result toward 0, even with float arguments. (Note this is different from Python 3 where // truncates downward.)+ - *
preserve the type of their operands (mixed promotes to float).> >= < <= == !=
eval
rule.eval
and select
entries as variables.filename::variable
namespaced variables.Triage provides predefined functions for use in eval
expressions:
Max(value1, value2, value3...)
returns the largest value, with type promotion to float.Min(value1, value2, value3...)
returns the smallest value, with type promotion to float.And(value1, value2, value3...)
takes Boolean arguments and returns the logical AND of the values.Or(value1, value2, value3...)
takes Boolean arguments and returns the logical OR of the values.Not(value)
takes one Boolean argument and returns the logical NOT of it.SyslogHas(matcher)
, KlogHas(matcher)
, BootlogHas(matcher)
return true if the corresponding log file has a line matching matcher, which is a string containing a regex expression.Annotation(key)
returns the corresponding value from the annotations.json file.Option(value1, value2, value3...)
returns the first useful value, to support selector migrations and defaults: the first non-empty-list, non-Missing value if any; or empty list if one was given; or Missing.Missing(value)
returns true if the value is an error indication.Days()
, Hours()
, Minutes()
, Seconds()
, Millis()
, Micros()
, and Nanos()
calculate values for comparison with monotonic timestamps.Now()
returns the approximate timestamp when the Diagnostic data was created.StringMatches(value, regex)
applies the given regex to the given value and returns true if there is a match. The regex syntax is that supported by the Rust regex crate.Note: Since logs are not structured, selectors can't be applied to them, so we supply regex matching functions instead.
Triage can apply functions to vectors of values. Vectors have the format "[expr, expr, expr...]"
. Some selectors return multi-element vectors.
Triage provides the functions Map()
, Fold()
, Filter()
, and Count()
to process vectors, Fn()
to define functions or lambdas for Map, Fold, and Filter to apply, and Apply()
to apply a Fn() to arguments.
For more information see Configuring fx triage.
See fx triage
for the latest features and options - Triage will keep improving!