blob: 72edd8b40e6df7dd8ee68a5c5af3672095fb6fae [file] [log] [blame] [view]
# Build graph convergence
Build graph convergence is that property that a single build invocation will
perform all the actions necessary and in the right order so that every action's
outputs are _newer_ than its inputs.
Fuchsia uses the Ninja build system, which is timestamp-driven. Ninja expresses
the build as a graph of input/output files and actions that take inputs and
produce outputs.
When you run a build, e.g. with `fx build`, Ninja will traverse the build graph
and perform any actions whose outputs are not present or whose inputs have
changed since they last run, all in topological order (dependencies before
dependents).
However, build graph _actions_ are not verified to fulfill the promise that
outputs are newer than inputs, which can lead to convergence issues.
## Common root causes
There are infinitely many ways to create Ninja convergence issues. That said,
prior experience taught us that there are common root causes for these problems.
### An output isn't generated
If a build action is declared to produce an output but doesn't actually produce
that output (in some circumstances, or ever) then this will cause convergence
issues. For instance, an action might declare that it generates a stamp file on
success but fail to generate or touch this stamp file, or save it to the wrong
location.
### An output is stale (not newer than all inputs)
Ninja knows that an output is fresh if it's newer than all inputs. If one or
more inputs have changed since the output was saved, then Ninja will repeat the
step(s) necessary to generate the output.
However if the action that generates the output doesn't update the output when
inputs have changed, this creates the appearance of a perpetual state of
staleness.
A common mistake that causes this is when actions review their inputs, decide
they have nothing to do/change with the contents of their outputs, but fail to
update the modification timestamp on their outputs (i.e. "touch" or "stamp"
their outputs).
### Modifying inputs
It is possible for an action to modify its inputs. Typically inputs to an action
should be opened with read access only, however it's not out of the question to
write to them. That said, if your action needs to modify an input, it should do
so before writing any outputs. Or if you must modify inputs after writing
outputs, be sure to update the timestamp on your outputs before exiting the
action. Otherwise you will have updated one or more of your inputs to be newer
than one or more of your outputs, and thus confused Ninja into thinking that
your outputs are stale.
Modifying inputs in actions can also introduce race conditions that make
reproducing the problem non-deterministic. If multiple actions depend on the
same input and one of them modifies the input, then the build will fail to
converge if one of those actions results in an input timestamp that is newer
than any of the actions' outputs. In dependency-ordered execution, the relative
ordering of independent actions cannot be guaranteed.
_Avoid modifying inputs._
### Symbolic links and hard links
The Ninja build system follows symbolic links to determine timestamps. This can
have surprising consequences when soft symlinks participate in Ninja rules as
input dependencies or outputs. The timestamps of symlinks themselves (as opposed
to their destinations) are not considered for staleness and freshness. See
[ninja#1186] for an explanation in terms of `stat()` and `lstat()` and a
demonstration. Hard links (`ln` without `-s`), have the problem where multiple
references point to the same filesystem object, and hence, have the same
timestamp.
Even a simple link action can cause issues. Consider a simple action with the
input `src`, the output `$target_out_dir/dst`, and the invoked action being `ln
src $target_out_dir/dst`. At face value this action converges correctly. But the
behavior of `action()` may be overridden elsewhere in the build system, such as
to wrap actions with other actions. As a result your inner action may not
converge to no-op when `src` has an older timestamp than the wrapper action's
script, which will then in turn be considered older than `dst` (its output,
which carries the timestamp of the input). [copy()] doesn't suffer from the same
problem because it's never wrapped.
_Avoid symlinks and hard-links in action inputs and outputs._
_For making copies, prefer the built-in [copy()] target._
### Timestamp granularity
Modern filesystems store timestamps on files (such as the time of last
modification) in nanosecond resolution. Some older runtimes, such as Python 2.7,
persist file timestamps in lower resolution, for instance milliseconds. It is
therefore possible for an action to read an input and write an output with a
timestamp that it considers to be "now" but is actually older than the timestamp
of the input, if for instance the input and output were both written at the same
millisecond and the output's timestamp is truncated after the millisecond
digits.
At the time of this writing we have mechanisms in place to ensure that all
Python actions in the build run with Python 3.x, in part to avoid this problem.
# Build convergence diagnostics
We have the following tools to diagnose build convergence issues:
* Ninja no-op check in the Commit Queue
* Filesystem access action tracing
## Ninja no-op check
Fuchsia's Commit Queue (CQ) verifies that changes not only build successfully,
but also keep the build system in a state that it converges to no-op in a single
build invocation.
Example of a build convergence error from CQ:
```
fuchsia confirm no-op
ninja build does not converge to a no-op
```
The same build is run in CQ before changes can be merged into the source tree,
to ensure that changes don't break the build. After completing a build
successfully, CQ will invoke Ninja again and expect Ninja to report `"no work to
do"`. This serves as a soundness check, since a correct build graph is expected
to "converge" to no-op.
If this soundness check fails then CQ will report a failure on a step named
`fuchsia confirm no-op`.
### Reproducing Ninja convergence issues
With a source tree synced to your change, simply try the following:
```posix-terminal
fx build
```
This command should print:
```
ninja: no work to do.
```
If this is not the case, and actual build actions are being performed, run the
same command again. If the second invocation still didn't produce "no work",
then you've reproduced the issue. If you've arrived at "no work" still, try the
following:
```posix-terminal
# Clean your build cache
rm -rf out
# Set up the build specification again
fx set ...
# Build
fx build
# Build again, expecting no-op
fx build
```
### Troubleshooting Ninja convergence issues
In the CQ results page, under the failed step `confirm no-op`, you will see
several links:
* execution details
* ninja -d explain -n -v
* dirty paths
The link to `ninja -d explain -n -v` shows information that you should be able
to reproduce locally with the following command:
```posix-terminal
fx ninja -C $(fx get-build-dir) -d explain -n -v
```
This link to "dirty paths" shows the most relevant subset of the same
information. You will see a text file that will most likely begin as follows:
```
ninja explain: output <...> doesn't exist
...
```
Every line in this file is like a domino brick. You should begin troubleshooting
the problem by looking at the first domino brick that started the chain reaction
of extra work being done. For instance in the example above a particular output
file doesn't exist, which causes Ninja to re-run the build action that's
supposed to produce this output, and then subsequently rerun dependent actions.
## Filesystem access tracing
There are also builders that trace actions' file system accesses. Diagnostics
for stale or missing outputs look like:
```text
Not all outputs of //your:label were written or touched, which can cause subsequent
build invocations to re-execute actions due to a missing file or old timestamp.
Required writes:
...
Missing outputs:
...
Stale outputs:
...
```
Diagnostics for unallowed writes to inputs look like:
```text
Unexpected file accesses building //your/target:label, following the order they are accessed:
(FileAccessType.WRITE /path/to/input-that-should-not-be-touched.txt)
```
Compared to the Ninja no-op check, this check is done on every individual action
and diagnoses one of the many causes of convergence issues immediately as the
action happens, rather than later through a full `fx build` command. This
approach can catch some issues that would otherwise be hard to reproduce due to
race conditions.
### Troubleshooting traced action failures
To locally enable action tracing, do one of the following:
* Run `fx set ... --args=build_should_trace_actions=true`
* Run `fx args`, add `build_should_trace_actions=true` in the editor,
save-and-exit
and then `fx build //your/failing:target`.
Examine the files in the message in the context of the action's script or
command, and see if they fall in one of the categories of
[common issues](#common-root-causes).
[ninja#1186]: https://github.com/ninja-build/ninja/issues/1186
[copy()]: https://gn.googlesource.com/gn/+/master/docs/reference.md#func_copy