docs/development/build/ninja_no_op.md - fuchsia - Git at Google

 # Build graph convergence

 Build graph convergence is that property that a single build invocation will
 perform all the actions necessary and in the right order so that every action's
 outputs are _newer_ than its inputs.

 Fuchsia uses the Ninja build system, which is timestamp-driven. Ninja expresses
 the build as a graph of input/output files and actions that take inputs and
 produce outputs.

 When you run a build, e.g. with `fx build`, Ninja will traverse the build graph
 and perform any actions whose outputs are not present or whose inputs have
 changed since they last run, all in topological order (dependencies before
 dependents).

 However, build graph _actions_ are not verified to fulfill the promise that
 outputs are newer than inputs, which can lead to convergence issues.

 ## Common root causes

 There are infinitely many ways to create Ninja convergence issues. That said,
 prior experience taught us that there are common root causes for these problems.

 ### An output isn't generated

 If a build action is declared to produce an output but doesn't actually produce
 that output (in some circumstances, or ever) then this will cause convergence
 issues. For instance, an action might declare that it generates a stamp file on
 success but fail to generate or touch this stamp file, or save it to the wrong
 location.

 ### An output is stale (not newer than all inputs)

 Ninja knows that an output is fresh if it's newer than all inputs. If one or
 more inputs have changed since the output was saved, then Ninja will repeat the
 step(s) necessary to generate the output.

 However if the action that generates the output doesn't update the output when
 inputs have changed, this creates the appearance of a perpetual state of
 staleness.

 A common mistake that causes this is when actions review their inputs, decide
 they have nothing to do/change with the contents of their outputs, but fail to
 update the modification timestamp on their outputs (i.e. "touch" or "stamp"
 their outputs).

 ### Modifying inputs

 It is possible for an action to modify its inputs. Typically inputs to an action
 should be opened with read access only, however it's not out of the question to
 write to them. That said, if your action needs to modify an input, it should do
 so before writing any outputs. Or if you must modify inputs after writing
 outputs, be sure to update the timestamp on your outputs before exiting the
 action. Otherwise you will have updated one or more of your inputs to be newer
 than one or more of your outputs, and thus confused Ninja into thinking that
 your outputs are stale.

 Modifying inputs in actions can also introduce race conditions that make
 reproducing the problem non-deterministic. If multiple actions depend on the
 same input and one of them modifies the input, then the build will fail to
 converge if one of those actions results in an input timestamp that is newer
 than any of the actions' outputs. In dependency-ordered execution, the relative
 ordering of independent actions cannot be guaranteed.

 _Avoid modifying inputs._

 ### Symbolic links and hard links

 The Ninja build system follows symbolic links to determine timestamps. This can
 have surprising consequences when soft symlinks participate in Ninja rules as
 input dependencies or outputs. The timestamps of symlinks themselves (as opposed
 to their destinations) are not considered for staleness and freshness. See
 [ninja#1186] for an explanation in terms of `stat()` and `lstat()` and a
 demonstration. Hard links (`ln` without `-s`), have the problem where multiple
 references point to the same filesystem object, and hence, have the same
 timestamp.

 Even a simple link action can cause issues. Consider a simple action with the
 input `src`, the output `$target_out_dir/dst`, and the invoked action being `ln
 src $target_out_dir/dst`. At face value this action converges correctly. But the
 behavior of `action()` may be overridden elsewhere in the build system, such as
 to wrap actions with other actions. As a result your inner action may not
 converge to no-op when `src` has an older timestamp than the wrapper action's
 script, which will then in turn be considered older than `dst` (its output,
 which carries the timestamp of the input). [copy()] doesn't suffer from the same
 problem because it's never wrapped.

 _Avoid symlinks and hard-links in action inputs and outputs._

 _For making copies, prefer the built-in [copy()] target._

 ### Timestamp granularity

 Modern filesystems store timestamps on files (such as the time of last
 modification) in nanosecond resolution. Some older runtimes, such as Python 2.7,
 persist file timestamps in lower resolution, for instance milliseconds. It is
 therefore possible for an action to read an input and write an output with a
 timestamp that it considers to be "now" but is actually older than the timestamp
 of the input, if for instance the input and output were both written at the same
 millisecond and the output's timestamp is truncated after the millisecond
 digits.

 At the time of this writing we have mechanisms in place to ensure that all
 Python actions in the build run with Python 3.x, in part to avoid this problem.

 # Build convergence diagnostics

 We have the following tools to diagnose build convergence issues:

 *   Ninja no-op check in the Commit Queue
 *   Filesystem access action tracing

 ## Ninja no-op check

 Fuchsia's Commit Queue (CQ) verifies that changes not only build successfully,
 but also keep the build system in a state that it converges to no-op in a single
 build invocation.

 Example of a build convergence error from CQ:

 ```
 fuchsia confirm no-op
 ninja build does not converge to a no-op
 ```

 The same build is run in CQ before changes can be merged into the source tree,
 to ensure that changes don't break the build. After completing a build
 successfully, CQ will invoke Ninja again and expect Ninja to report `"no work to
 do"`. This serves as a soundness check, since a correct build graph is expected
 to "converge" to no-op.

 If this soundness check fails then CQ will report a failure on a step named
 `fuchsia confirm no-op`.

 ### Reproducing Ninja convergence issues

 With a source tree synced to your change, simply try the following:

 ```posix-terminal
 fx build
 ```

 This command should print:

 ```
 ninja: no work to do.
 ```

 If this is not the case, and actual build actions are being performed, run the
 same command again. If the second invocation still didn't produce "no work",
 then you've reproduced the issue. If you've arrived at "no work" still, try the
 following:

 ```posix-terminal
 # Clean your build cache
 rm -rf out
 # Set up the build specification again
 fx set ...
 # Build
 fx build
 # Build again, expecting no-op
 fx build
 ```

 ### Troubleshooting Ninja convergence issues

 In the CQ results page, under the failed step `confirm no-op`, you will see
 several links:

 *   execution details
 *   ninja -d explain -n -v
 *   dirty paths

 The link to `ninja -d explain -n -v` shows information that you should be able
 to reproduce locally with the following command:

 ```posix-terminal
 fx ninja -C $(fx get-build-dir) -d explain -n -v
 ```

 This link to "dirty paths" shows the most relevant subset of the same
 information. You will see a text file that will most likely begin as follows:

 ```
 ninja explain: output <...> doesn't exist
 ...
 ```

 Every line in this file is like a domino brick. You should begin troubleshooting
 the problem by looking at the first domino brick that started the chain reaction
 of extra work being done. For instance in the example above a particular output
 file doesn't exist, which causes Ninja to re-run the build action that's
 supposed to produce this output, and then subsequently rerun dependent actions.

 ## Filesystem access tracing

 There are also builders that trace actions' file system accesses. Diagnostics
 for stale or missing outputs look like:

 ```text
 Not all outputs of //your:label were written or touched, which can cause subsequent
 build invocations to re-execute actions due to a missing file or old timestamp.

 Required writes:
 ...

 Missing outputs:
 ...

 Stale outputs:
 ...
 ```

 Diagnostics for unallowed writes to inputs look like:

 ```text
 Unexpected file accesses building //your/target:label, following the order they are accessed:
 (FileAccessType.WRITE /path/to/input-that-should-not-be-touched.txt)
 ```

 Compared to the Ninja no-op check, this check is done on every individual action
 and diagnoses one of the many causes of convergence issues immediately as the
 action happens, rather than later through a full `fx build` command. This
 approach can catch some issues that would otherwise be hard to reproduce due to
 race conditions.

 ### Troubleshooting traced action failures

 To locally enable action tracing, do one of the following:

 *   Run `fx set ... --args=build_should_trace_actions=true`
 *   Run `fx args`, add `build_should_trace_actions=true` in the editor,
     save-and-exit

 and then `fx build //your/failing:target`.

 Examine the files in the message in the context of the action's script or
 command, and see if they fall in one of the categories of
 [common issues](#common-root-causes).

 [ninja#1186]: https://github.com/ninja-build/ninja/issues/1186
 [copy()]: https://gn.googlesource.com/gn/+/master/docs/reference.md#func_copy
	# Build graph convergence

	Build graph convergence is that property that a single build invocation will
	perform all the actions necessary and in the right order so that every action's
	outputs are _newer_ than its inputs.

	Fuchsia uses the Ninja build system, which is timestamp-driven. Ninja expresses
	the build as a graph of input/output files and actions that take inputs and
	produce outputs.

	When you run a build, e.g. with `fx build`, Ninja will traverse the build graph
	and perform any actions whose outputs are not present or whose inputs have
	changed since they last run, all in topological order (dependencies before
	dependents).

	However, build graph _actions_ are not verified to fulfill the promise that
	outputs are newer than inputs, which can lead to convergence issues.

	## Common root causes

	There are infinitely many ways to create Ninja convergence issues. That said,
	prior experience taught us that there are common root causes for these problems.

	### An output isn't generated

	If a build action is declared to produce an output but doesn't actually produce
	that output (in some circumstances, or ever) then this will cause convergence
	issues. For instance, an action might declare that it generates a stamp file on
	success but fail to generate or touch this stamp file, or save it to the wrong
	location.

	### An output is stale (not newer than all inputs)

	Ninja knows that an output is fresh if it's newer than all inputs. If one or
	more inputs have changed since the output was saved, then Ninja will repeat the
	step(s) necessary to generate the output.

	However if the action that generates the output doesn't update the output when
	inputs have changed, this creates the appearance of a perpetual state of
	staleness.

	A common mistake that causes this is when actions review their inputs, decide
	they have nothing to do/change with the contents of their outputs, but fail to
	update the modification timestamp on their outputs (i.e. "touch" or "stamp"
	their outputs).

	### Modifying inputs

	It is possible for an action to modify its inputs. Typically inputs to an action
	should be opened with read access only, however it's not out of the question to
	write to them. That said, if your action needs to modify an input, it should do
	so before writing any outputs. Or if you must modify inputs after writing
	outputs, be sure to update the timestamp on your outputs before exiting the
	action. Otherwise you will have updated one or more of your inputs to be newer
	than one or more of your outputs, and thus confused Ninja into thinking that
	your outputs are stale.

	Modifying inputs in actions can also introduce race conditions that make
	reproducing the problem non-deterministic. If multiple actions depend on the
	same input and one of them modifies the input, then the build will fail to
	converge if one of those actions results in an input timestamp that is newer
	than any of the actions' outputs. In dependency-ordered execution, the relative
	ordering of independent actions cannot be guaranteed.

	_Avoid modifying inputs._

	### Symbolic links and hard links

	The Ninja build system follows symbolic links to determine timestamps. This can
	have surprising consequences when soft symlinks participate in Ninja rules as
	input dependencies or outputs. The timestamps of symlinks themselves (as opposed
	to their destinations) are not considered for staleness and freshness. See
	[ninja#1186] for an explanation in terms of `stat()` and `lstat()` and a
	demonstration. Hard links (`ln` without `-s`), have the problem where multiple
	references point to the same filesystem object, and hence, have the same
	timestamp.

	Even a simple link action can cause issues. Consider a simple action with the
	input `src`, the output `$target_out_dir/dst`, and the invoked action being `ln
	src $target_out_dir/dst`. At face value this action converges correctly. But the
	behavior of `action()` may be overridden elsewhere in the build system, such as
	to wrap actions with other actions. As a result your inner action may not
	converge to no-op when `src` has an older timestamp than the wrapper action's
	script, which will then in turn be considered older than `dst` (its output,
	which carries the timestamp of the input). [copy()] doesn't suffer from the same
	problem because it's never wrapped.

	_Avoid symlinks and hard-links in action inputs and outputs._

	_For making copies, prefer the built-in [copy()] target._

	### Timestamp granularity

	Modern filesystems store timestamps on files (such as the time of last
	modification) in nanosecond resolution. Some older runtimes, such as Python 2.7,
	persist file timestamps in lower resolution, for instance milliseconds. It is
	therefore possible for an action to read an input and write an output with a
	timestamp that it considers to be "now" but is actually older than the timestamp
	of the input, if for instance the input and output were both written at the same
	millisecond and the output's timestamp is truncated after the millisecond
	digits.

	At the time of this writing we have mechanisms in place to ensure that all
	Python actions in the build run with Python 3.x, in part to avoid this problem.

	# Build convergence diagnostics

	We have the following tools to diagnose build convergence issues:

	* Ninja no-op check in the Commit Queue
	* Filesystem access action tracing

	## Ninja no-op check

	Fuchsia's Commit Queue (CQ) verifies that changes not only build successfully,
	but also keep the build system in a state that it converges to no-op in a single
	build invocation.

	Example of a build convergence error from CQ:

	```
	fuchsia confirm no-op
	ninja build does not converge to a no-op
	```

	The same build is run in CQ before changes can be merged into the source tree,
	to ensure that changes don't break the build. After completing a build
	successfully, CQ will invoke Ninja again and expect Ninja to report `"no work to
	do"`. This serves as a soundness check, since a correct build graph is expected
	to "converge" to no-op.

	If this soundness check fails then CQ will report a failure on a step named
	`fuchsia confirm no-op`.

	### Reproducing Ninja convergence issues

	With a source tree synced to your change, simply try the following:

	```posix-terminal
	fx build
	```

	This command should print:

	```
	ninja: no work to do.
	```

	If this is not the case, and actual build actions are being performed, run the
	same command again. If the second invocation still didn't produce "no work",
	then you've reproduced the issue. If you've arrived at "no work" still, try the
	following:

	```posix-terminal
	# Clean your build cache
	rm -rf out
	# Set up the build specification again
	fx set ...
	# Build
	fx build
	# Build again, expecting no-op
	fx build
	```

	### Troubleshooting Ninja convergence issues

	In the CQ results page, under the failed step `confirm no-op`, you will see
	several links:

	* execution details
	* ninja -d explain -n -v
	* dirty paths

	The link to `ninja -d explain -n -v` shows information that you should be able
	to reproduce locally with the following command:

	```posix-terminal
	fx ninja -C $(fx get-build-dir) -d explain -n -v
	```

	This link to "dirty paths" shows the most relevant subset of the same
	information. You will see a text file that will most likely begin as follows:

	```
	ninja explain: output <...> doesn't exist
	...
	```

	Every line in this file is like a domino brick. You should begin troubleshooting
	the problem by looking at the first domino brick that started the chain reaction
	of extra work being done. For instance in the example above a particular output
	file doesn't exist, which causes Ninja to re-run the build action that's
	supposed to produce this output, and then subsequently rerun dependent actions.

	## Filesystem access tracing

	There are also builders that trace actions' file system accesses. Diagnostics
	for stale or missing outputs look like:

	```text
	Not all outputs of //your:label were written or touched, which can cause subsequent
	build invocations to re-execute actions due to a missing file or old timestamp.

	Required writes:
	...

	Missing outputs:
	...

	Stale outputs:
	...
	```

	Diagnostics for unallowed writes to inputs look like:

	```text
	Unexpected file accesses building //your/target:label, following the order they are accessed:
	(FileAccessType.WRITE /path/to/input-that-should-not-be-touched.txt)
	```

	Compared to the Ninja no-op check, this check is done on every individual action
	and diagnoses one of the many causes of convergence issues immediately as the
	action happens, rather than later through a full `fx build` command. This
	approach can catch some issues that would otherwise be hard to reproduce due to
	race conditions.

	### Troubleshooting traced action failures

	To locally enable action tracing, do one of the following:

	* Run `fx set ... --args=build_should_trace_actions=true`
	* Run `fx args`, add `build_should_trace_actions=true` in the editor,
	save-and-exit

	and then `fx build //your/failing:target`.

	Examine the files in the message in the context of the action's script or
	command, and see if they fall in one of the categories of
	[common issues](#common-root-causes).

	[ninja#1186]: https://github.com/ninja-build/ninja/issues/1186
	[copy()]: https://gn.googlesource.com/gn/+/master/docs/reference.md#func_copy