blob: 5d02e6f84a45667d0528bedd4aa082d41d8bd7b6 [file] [view]
{% set rfcid = "RFC-0148" %}
{% include "docs/contribute/governance/rfcs/_common/_rfc_header.md" %}
# {{ rfc.name }}: {{ rfc.title }}
## Summary
Guidelines for project and infrastructure owners in the Fuchsia ecosystem to
create sustainable CI (Continuous Integration) experiences.
## Motivation
Until mid-2021, we kept most of our source code and prebuilts centralized in one
"Fuchsia tree". Accordingly, the infrastructure and its owners have been mostly
dedicated towards supporting that one tree.
As new out-of-tree projects e.g. [RFC-0095][rfc-0095] are brought up, in-tree
contributors may newly become out-of-tree contributors. The out-of-tree CI
systems should deliver a comparable or better experience than the in-tree
experience, and the experience should be familiar enough such that switching
between projects is low-friction. Otherwise, working out-of-tree represents a
productivity loss which can discourage evolution of the platform.
At the same time, the infrastructure team size won't be able to scale linearly
relative to the number of out-of-tree projects. We need to generalize our CI
capabilities from "mostly tailored to the Fuchsia project" to "usable by many
projects in the Fuchsia ecosystem". Otherwise, each project will demand custom
infrastructure and its own dedicated maintainers.
The lessons learned from building and maintaining Fuchsia's CI over the last
several years offer us a foundation for what to do, continue, and/or avoid with
respect to project-infrastructure integration going forward. Ultimately, the
goals of our CI systems are to make our projects easy to change, hard to break,
and efficient to ship: this RFC gives high-level recommendations to project and
infrastructure owners such that said systems can best achieve these goals.
## Stakeholders
*Facilitator:*
* Hunter Freyer (hjfreyer@google.com)
*Reviewers:*
* Aidan Wolter (awolter@google.com) - Product Assembly
* Chase Latta (chaselatta@google.com) - Product Development Kit
* David Gilhooley (dgilhooley@google.com) - Drivers
* Jiaming Li (lijiaming@google.com) - Product Development Kit, Workstation OOT
* Marc-Antoine Ruel (maruel@google.com) - Engineering Productivity
* Nicolas Sylvain (nsylvain@google.com) - Engineering Productivity
* Renato Mangini Dias (mangini@google.com) - Bazel
*Consulted:*
* Anirudh Mathukumilli (rudymathu@google.com) - Foundation Infrastructure
* Nathan Mulcahey (nmulcahey@google.com) - Foundation Infrastructure
* Oliver Newman (olivernewman@google.com) - Platform Infrastructure
* Petr Hosek (phosek@google.com) - Toolchain
* Sébastien Marchand (sebmarchand@google.com) - 1P Infrastructure
*Socialization:*
This design was initially socialized with the Fuchsia engineering productivity
mailing list, iterated on in a Google doc, and shared with relevant stakeholders
to identify the reviewers listed in the above section. It was then converted to
markdown following the RFC template and moved to the RFC "Iterate" stage.
## Design
The "Avoid" subsections below enumerate common pitfalls which negatively impact
a project's CI, the project's contributors, and/or the infrastructure owners.
Conversely, the "Must Have" and "Consider" subsections are guidelines to help
navigate said pitfalls and more. They do not form an exhaustive list: they do
not include considerations for performance tracking, flake detection, etc. which
may also improve long-term project health but aren't required for a minimally
viable CI implementation.
### Avoid: Infrastructure dependence on project internals
When the infrastructure depends on project internals, both sides become harder
to change. Hitting infrastructure sharp edges when making seemingly benign
changes has been a long-standing pain point when working in Fuchsia, and is one
of the bigger complaints that contributors have about the engineering process.
For example, the infrastructure used to know many (and still knows some)
internal details of the Fuchsia build system which created sharp edges in
development i.e. the Fuchsia build was not free to change if it violated any of
the infrastructure's expectations. The infrastructure code does not live
alongside the Fuchsia code and thus its expectations can be hard to discover:
they are often only made known at presubmit or postsubmit runtime when something
fails. Other harmful examples include the infrastructure hardcoding paths in the
checkout, the names of tests, etc. Such references tend to organically
accumulate, progressively creating more and more friction over time.
Keeping the infrastructure compatible with the project becomes increasingly
difficult the more branches are involved and/or the longer they live. Either the
infrastructure is versioned in the project's history, or the live version of the
infrastructure must maintain compatibility with all active branches of the
project.
Also, when the infrastructure is encoding a lot of project-specific knowledge,
it's likely that each project has its own accompanying set of tailored CI
scripts, which has linearly-scaling implementation and maintenance costs.
### Avoid: Non-trivial reproduction of infrastructure behavior
When contributors cannot reproduce what the infrastructure is doing, the
infrastructure's results become much less actionable.
To debug unreproducible test failures, one will need to repeatedly submit
patches to the infrastructure until the test(s) pass, which is generally slower
and more resource-intensive than debugging locally. It also feeds the notion
that testing locally is pointless because the pass/fail correlation to
infrastructure-run tests is low.
The same goes for builds which are difficult to reproduce or cannot be
reproduced locally. The infrastructure should not be configuring builds in a way
that diverges heavily from developer workflows in non-obvious ways. For example,
as of this writing, the
[Fuchsia SDK remains difficult to build locally](https://fxbug.dev/44889#c18).
The infrastructure maintains its own [logic][sdk-recipe] which significantly
differs from the internal-only fx script, and there is no automation which
checks that they produce the same output.
In degenerate cases, unreproducible infrastructure behavior can force
"temporary" disabling of failing builds or tests to unblock submission and
recover the CI. In this state, they can further degrade from stacking breakages,
effectively becoming permanently disabled due to the impracticality of a fix.
### Avoid: Floating dependencies
Projects should avoid using floating dependencies, e.g. "fetch the latest
version of Bazel on the fly". Floating dependencies include the machine's
pre-installed software.
Any floating dependencies can flow into builds and tests, rendering them
[non-hermetic][hermeticity]. With floating dependencies, the infrastructure's
results cannot be fully attributable to the exact CL or commit(s) under test,
because they are not the only possible sources of change. Note that parts of the
infrastructure itself can often effectively be floating dependencies. Network
flakiness is an example of a common source of unpredictability in test results.
Floating dependencies create correspondingly larger headaches the more stable
the build is expected to be. For example, release branches typically only accept
hotfixes to minimize the risk of introducing new bugs, but floating dependencies
always represent such a risk.
They also contribute to the mysterious "it works locally, but not in the
infrastructure" phenomenon and vice-versa.
### Must Have: Reproducible checkout
A project's checkout must be fully reproducible with a simple series of steps on
a "clean" workspace. That workspace could be a developer's machine or an
infrastructure machine. An "update" of an existing checkout at a
[commit-ish][commit-ish] must always yield the same result as if the checkout
was created freshly from that commit-ish, at any point in time. This means that
all fetched dependencies must be pinned. A pinned (non-floating) dependency is
ideally cryptographic and deterministic e.g. a content hash. An immutable
reference can also be acceptable e.g. a semantic version as a git tag, though
the former is preferred.
Not only does a reproducible checkout provide a great experience for developers
getting started with a project, but it also makes the infrastructure's view of
the project less likely to diverge from the developer's view.
Non-reproducibility can also come from source code or binaries being deleted
and/or made inaccessible at any point in time. Hosting locations must be
approved by the Fuchsia infrastructure owners before they are integrated into a
project's checkout.
### Must Have: Clear separations between checkout, build, and test
A project must have clear separations of its checkout, build, and test phases.
This is necessary for the infrastructure to enforce security boundaries, as well
as optimize checkout, build, and test runtimes and resource usage. Clearly
separated phases also allow for better attribution of failures, especially
infrastructure failures versus user errors. For example, a failing build should
be attributable to a code issue and not, say, a timeout when fetching a remote
dependency.
The checkout phase fetches the source code and any dependencies. After the
checkout phase, one must have everything required to build. This means that the
build phase is hermetic i.e. cannot fetch any dependencies on the fly.
A build must be able to run without internet access. In practice, it still may
access the internet when using a remote distributed compiler, but only as a
performance optimization (it should not change the result of the build). This
requirement also benefits users working offline or with limited internet access
e.g. airborne users.
A project must not assume that the build and test phases are run on the same
machine in the infrastructure. For example, Fuchsia builds are run on separate
machines (with more cores) from test orchestrators and executors. This allows
the infrastructure to allocate machine resources more efficiently and speed up
builds.
Similarly, tests should be hermetic i.e. their inputs are explicitly mapped. See
[Testing scope][testing-scope] for more information. Tests shouldn't assume the
existence of a full checkout or build on the machine they are being run on, and
should not depend on other tests running on the same machine. The infrastructure
may shard tests onto separate machines, passing over only the explicitly mapped
inputs.
As for linters, they may be run post-checkout or post-build to provide
non-binary pass/fail hints in the context of code analysis and/or code review.
Linters which operate on the checkout can be considered part of the checkout
phase; likewise, linters which operate on build outputs can be considered part
of the build phase. They can be assumed to run on the same machine as their
associated phase.
### Consider: Reproducible build
Any two builds, given the same checkout and dependencies, should ideally yield
bit-for-bit identical outputs whether on a developer's machine or on an
infrastructure machine. If not bit-for-bit identical, builds should be at
minimum be functionally equivalent. [Reproducible builds][reproducible-builds],
like reproducible checkouts, help to create consistent views of the project
across users and across time.
Build reproducibility includes not depending on system-provisioned tools or
services, e.g. not depending on curl, ping, ip, etc. from the system. The build
should depend only on the checkout, which is thus responsible for vendoring all
build dependencies. Along similar lines, projects should be wary of using any
technologies which are not easily portable across platforms. Ideally, a project
should be runnable on vanilla installations of Debian/Ubuntu Linux, MacOS, or
Windows.
Note that the minimal set of dependencies required to actually bootstrap a
checkout should never flow beyond the checkout. For example, if bash is required
to perform the checkout, and bash is also required by the build, the checkout
should be pulling in a vendored bash. The build should then use that vendored
bash, *not* the bash used to bootstrap the checkout.
To speed up the build in presubmit, the infrastructure may seed the build
directory from a cache during the checkout phase. If incremental builds are not
always handled correctly, this strategy can create non-deterministic behavior.
In presubmit, the occasional incremental build issue can often be worth the
tradeoff for build speed. However, this optimization should not be used beyond
presubmit, and absolutely never for official builds where correctness and
security cannot be compromised.
### Consider: Clear layering of project and infrastructure
The infrastructure is responsible for automating builds and tests for projects
at scale. Emphasis on "automation at scale": a project should support performing
these tasks locally, mostly or entirely independently of the infrastructure.
This implies that the infrastructure holds very little logic to build and test
any specific project. These capabilities should be surfaced by the projects
themselves, and invoked by the infrastructure without knowledge aside from
well-known entrypoints, outputs, and configurations. A useful mental model is to
view the infrastructure as a new contributor going through a project's "Getting
Started" guide on building and testing.
For example, [fint][fint] is an abstraction over Fuchsia's build system which
obscures its internals from the infrastructure's view. With fint, the
infrastructure does not even know or care that Fuchsia uses [GN][gn]. This
reduces the amount of sharp edges that Fuchsia contributors can encounter when
modifying the build.
The infrastructure should also not be holding the configuration to fetch any
project dependencies, e.g. Bazel, Python3, miscellaneous Toolchains, etc. The
dependencies should be declared by the projects themselves. Infrastructure
machines should not be assumed to include any dependencies by default aside from
the minimal set of tools required to bootstrap a checkout. Project owners should
expect the available pre-installed set of tools to be reduced in the future.
There are still some cases where a project needs to know infrastructure
expectations. Some special kinds of outputs which are post-processed by the
infrastructure should follow an infrastructure-defined contract. For example,
binary size reports or code coverage reports to be displayed in Gerrit should
conform to the expected formats. This way, the infrastructure doesn't need
custom handling for each project which uses a particular infrastructure feature.
### Consider: Favoring CI configuration over code
In order to scale the number of supported projects, the infrastructure should
favor new configuration over new code. As an example, the CI code used to build
a class of similar projects should mostly be shared either at the scripting or
library levels. Configuration can account for any necessary differences between
projects e.g. repository URL, service accounts, checkout strategy, build
entrypoint, artifact upload destination, etc.
We support two checkout tools: [Jiri][jiri] or Git (with or without submodules).
Projects should use one of these options. Prebuilt dependencies should be hosted
by [Git-on-Borg][gob] or [CIPD][cipd]. The infrastructure code for building
should also be mostly shared if the logic to build each project is
well-abstracted per the section above.
By favoring configuration, the implementation cost for new CIs should be lower
than writing new CI code from scratch, which benefits projects needing to spin
up quickly. They also benefit from ongoing support and maintenance of the shared
infrastructure codebase and services.
### Consider: Build output abstraction
To facilitate the consumption of build artifacts, the build should have a
well-documented contract for its output surface area. The infrastructure is
likely to be a consumer of this surface area in order to perform various
post-build actions, e.g. uploading data to [BigQuery][bigquery], sharding and
running tests, or running binary size checks. This is in contrast to
"intermediate" build outputs which should be considered internals, and not
depended directly on by downstream consumers.
Project-defined tools can also be consumers of the build output. For example,
the [artifactory tool][artifactory] reads Fuchsia's build output to locate and
organize build artifacts in cloud storage. The infrastructure is only
responsible for invoking the tool with the infrastructure-specific arguments
i.e. a storage bucket name and a unique build identifier.
The build contract may adhere to some common infrastructure APIs. This helps
keep integrations robust, e.g. integration with the infrastructure's code
coverage service. Changes to the build internals of generating code coverage
metrics shouldn't require code changes on the infrastructure side.
The build contract should be tested e.g. schema changes don't result in
hard-transitions for downstream consumers.
### Consider: Main-first development
Projects should aim to keep the build healthy at tip-of-tree. This lets all
contributors live near the latest version of the code without needing to spin
off branches or work on an older version of the tree to sidestep bugs. This
helps reduce merge conflicts and prevents contributors from having significantly
different views of the project at any given time.
By default, the infrastructure's presubmit will try to rebase CLs onto
tip-of-tree (as this is a proxy for testing a clean submission), so it is
practical for a contributor's workflow to be as close as possible to this
behavior. Just as developers have similar views of the codebase, so should the
infrastructure.
The infrastructure's postsubmit facilitates keeping the build healthy at
tip-of-tree by continually testing tip-of-tree as new CLs land. If the build
goes red at tip-of-tree, this should be quickly reported by the infrastructure
and actioned by developers.
Sandbox branches may be used for code which is not intended to be submitted.
Note that their use is generally an exception to the norm, and not a first-class
flow backed by the infrastructure.
### Consider: Fast roll and release cadences
Each project should attempt to roll its dependencies at a fast cadence. The
infrastructure should facilitate this by automating the process of rolling
dependencies, and project owners should fix failing roll attempts with high
priority. Ideally, dependencies are rolled within O(hours) of release. The
staler a dependency is, the harder it becomes to roll forward and/or apply
cherry-picks. This is especially critical for security patches which are
time-sensitive.
In the same vein, each project should attempt to release at a fast cadence. The
infrastructure should facilitate this by automating the release process after
code integrates cleanly into mainline (commonly referred to as "continuous
deployment"). Project owners should invest heavily in writing automated tests
such that releases from near-tip-of-tree can be reliably integrated downstream,
following the main-first development model.
The infrastructure should also provide visibility into the dependency graph of
projects, where projects form the "nodes", and rolls and releases form the
"edges". Project owners should be able to trace CLs flowing through the graph
and discover where CLs have landed, or have gotten stuck, etc.
## Implementation
This RFC gives high-level guidelines on how projects should interface with the
infrastructure, but is intentionally light on implementation details. Each
project may follow the guidelines in any number of ways, and we don't want to
create artificial constraints by prescribing specifics. New out-of-tree projects
are still getting off the ground at this time, and anything we map out here is
likely to go stale as the projects evolve.
## Security considerations
While projects are encouraged to own their build and test logic, the
infrastructure must still own the security boundaries. Source code and/or
artifacts for each project must be able to securely flow into the next in order
for the many-project ecosystem to ultimately ship onto products.
The inputs to a CI task must be trusted: all source code and binaries must be
fetched from hosting locations which are approved by the Fuchsia infrastructure
owners. After the checkout phase is complete, there can be no more inputs, and
this should be enforced by the infrastructure e.g. attempting to fetch a
dependency during the build phase should result in an error.
Any outputs of the task should provide provenance i.e. *artifact* was built from
*project* at *revision:X*. When artifacts are uploaded, the infrastructure
should enforce that the artifacts are uploaded to storage with appropriate
scope. For instance, a project which depends on internal source code must be
prevented from uploading artifacts to a public bucket.
## Testing
The CI systems referred to in this RFC will enable building and testing new
projects at scale in a similar fashion as they do for the Fuchsia project today.
This reduces the amount of manual testing and debugging that project
contributors will need to do at their desks, in favor of offloading work to
infrastructure machines.
On the infrastructure side, Fuchsia's CI has already been worked on extensively
to enable automated testing at scale of its own code: in other words, the CI is
capable of testing changes to itself. Though some generalization may be needed,
we will largely inherit these capabilities when building new CIs.
## Documentation
This RFC will serve as a reference for new and existing projects.
On the infrastructure side, we will write documentation on new CI configuration
once we have generalized those capabilities, such that the process can be mostly
self-service. We will also generalize the existing documentation to account for
new out-of-tree projects rather than only applying to the in-tree
infrastructure.
## Drawbacks, alternatives, and unknowns
Like many software development best practices, following these best practices
may be more upfront effort for project contributors. For example, tracking
floating dependencies is a commonly used shortcut for quick iteration on the
cutting edge without the need for rollers. It can be argued that they are a
useful hack in the short term, but they should be considered technical debt,
among the other discouraged practices in this RFC.
Finding the best balance of technical debt for each new project is unknown, as
it had been during the development of Fuchsia. We continue to pay down build,
test, and infrastructure technical debt over time which was often taken to meet
project goals. This RFC does not seek to prevent technical debt, but rather to
make such tradeoffs more informed and intentional.
[rfc-0095]: /docs/contribute/governance/rfcs/0095_build_and_assemble_workstation_out_of_tree.md
[sdk-recipe]: https://fuchsia.googlesource.com/infra/recipes/+/179288fb999a853cc4cf78ccbf3de2ee5be57707/recipes/sdk.py
[commit-ish]: https://git-scm.com/docs/gitglossary#Documentation/gitglossary.txt-aiddefcommit-ishacommit-ishalsocommittish
[hermeticity]: https://docs.bazel.build/versions/main/hermeticity.html
[fint]: /tools/integration/fint/README.md
[artifactory]: /tools/artifactory/README.md
[gerrit-size-plugin]: https://chromium.googlesource.com/infra/gerrit-plugins/binary-size/+/HEAD/README.md
[testing-scope]: /docs/contribute/testing/scope.md
[reproducible-builds]: https://reproducible-builds.org
[jiri]: https://fuchsia.googlesource.com/jiri
[repo]: https://gerrit.googlesource.com/git-repo
[gob]: https://opensource.google/docs/glossary/#gitonborg
[cipd]: https://chromium.googlesource.com/infra/luci/luci-go/+/refs/heads/main/cipd
[gn]: https://gn.googlesource.com/gn
[bigquery]: https://cloud.google.com/bigquery/docs