blob: 3c524484cf81d79426c50b3fb198910bbc232eae [file] [log] [blame] [view]
{% set rfcid = "RFC-0005" %}
{% include "docs/contribute/governance/rfcs/_common/_rfc_header.md" %}
# {{ rfc.name }}: {{ rfc.title }}
<!-- SET the `rfcid` VAR ABOVE. DO NOT EDIT ANYTHING ELSE ABOVE THIS LINE. -->
## Summary
Note: This RFC has been withdrawn. It was originally Accepted on 2020-09-21. See [Rationale for
withdrawal](#rationale-for-withdrawal). This RFC is otherwise retained in its original state for
historical purposes.
This RFC describes a simple snapshot mechanism that gives increased resilience
to bugs in the upgrade process. Changes to the Fuchsia Volume Manager (FVM),
allow a snapshot of the Blobfs partition to be taken that can be reverted to at
any stage during the upgrade.
## Motivation
At time of writing, a failed upgrade that causes corruption of a Blobfs
partition can leave devices in states that are hard to recover from. The
recovery partition currently lacks the ability to restore devices in this state,
so the only supported way of restoring in these cases is via bootloaders using a
process that is not friendly to end-users.
A snapshot mechanism would reduce the risk of us ending up in this state.
## Design
The basic concept is to support a primitive snapshot mechanism within FVM, that
allows for the appearance of two partitions for the duration of an upgrade, but
allows for sharing data between the partitions.
At this time, FVM is a simple volume-manager, it has the ability to map slices
from arbitrary slice-aligned logical offsets, to specific offsets on the
underlying device, and it keeps mappings from different partitions separate.
Blobfs consists of the following distinct regions:
Region |
----------------- |
Superblock |
Allocation bitmap |
Inodes |
Journal |
Data |
To support the proposal here, we could allow different _slice types_ within
FVM[^1]. The types would apply to _extents_ of slices:
Type | Description |
-------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
A/B slices | This would be an extent of slices that have an alternate copy. |
A/B bitmap[^2]| This would be an extent of slices that have an alternate copy of a bitmap that represents allocations in a shared data extent. |
Shared data | This would be an extent of slices whose allocation is managed by an A/B bitmap extent. |
Shared | This would be an extent that is shared between the two partitions, but only one of the partitions could write to the region at a time. |
With these slice types, it would then be possible for FVM to present _two_
partitions showing the A/B variations of the extents. So, going back to the
Blobfs regions, we would have:
Region | Type
----------------- | -----------
Superblock | A/B slices
Allocation bitmap | A/B bitmap
Inodes | A/B slices
Journal | Shared[^3]
Data | Shared data
Most of the time, only one of the partitions would be active, and the system
would appear just as it is today.
During upgrade, the second partition can be activated at which point the first
partition becomes _locked_ and no further writes are allowed to it, but reads
would continue to be served. The second partition can be prepared, potentially
in just the same way as it is now, but throughout the upgrade period there is
always the option to go back to the first partition, which is guaranteed to
remain untouched.
For the A/B extents, it's easy to see how the first partition's data is
preserved; the second partition wouldn't see the first partition's data. For the
journal, the shared region — only the writable partition would be able to write
to it; i.e. the second partition. For the shared data region, the bitmap would
indicate which of the blocks could be written to. Any blocks marked as used by
the first partition would appear to be read-only to both partitions.
To facilitate this scheme the second partition would also need to be able to
read the alternate bitmap, so that it could know which blocks it is allowed to
allocate, so to allow for this, it could be presented in the logical address
space at some currently unused offset. A strawman proposal is that all of the
alternate A/B extents would appear at the same offset but with the top bit set
(read-only).
The following diagram hopefully illustrates how each of the partitions would
appear:
![Partition Arrangement](resources/0005_blobfs_snapshots/fig_1.png){:#fig-1}
**Figure 1: Partition arrangement.**
Notes:
* It gets us some of the resilience that we would have got from the simple A/B
partition approach, but _not all._
* We can keep the current incremental approach to updates (i.e. only update
blobs that have changed) at the expense of not ending up with a predictable
layout. On user builds, we would have the option of completely rewriting all
the blobs, but we would still be at the mercy of fragmentation.
* It adds complexity into FVM.
### New Upgrade flow
The upgrade flow must be modified to facilitate the snapshotting interactions.
The current flow is shown in [Figure 2](#fig-2), and the proposed alternative in
[Figure 3](#fig-3). New APIs and interactions are colored.
![Current OTA](resources/0005_blobfs_snapshots/fig_2.png){:#fig-2}
**Figure 2: Current upgrade implementation (high-level)**
![Proposed OTA](resources/0005_blobfs_snapshots/fig_3.png){:#fig-3}
**Figure 3: Proposed upgrade implementation (high-level)**
### New FVM operations
Several new FVM operations must be implemented and integrated into the Software
Delivery (SWD) stack. These APIs are used to drive a state machine ([Figure
4](#fig-4)), which ultimately switches the system between partitions.
![Snapshot state machine](resources/0005_blobfs_snapshots/fig_4.png){:#fig-4}
**Figure 4: State machine for snapshotting.**
#### TakeSnapshot
**Snapshots the active partition's metadata into the alternate partition, which
was previously cleared (see "DeleteSnapshot"). The active partition becomes
read-only, and all subsequent writes must now go to the inactive partition.**
* FVM makes the active partition read-only.
* Pending journal entries must be flushed.
* FVM creates the inactive partition.
* FVM copies over the metadata from the active->inactive partition.
Writing new blobs for the duration of this multi-step process would not be
possible, and half written blobs would have to be abandoned, which should not be
limiting given the component responsible for writing the blobs should be the
same component responsible for asking for the snapshot.
#### CancelSnapshot
**Cancels the population of a snapshot created by TakeSnapshot, clearing the
inactive partition and allowing another snapshot to be created.**
* At this time, all read connections to the inactive partition must be closed.
* The inactive partition will be deleted by FVM. The active partition will
become writable again.
#### SetWritablePartition
**Switches which partition is writable.**
* The journal must be flushed at this point (all pending operations must
complete). The fsync call in the diagram above can facilitate this, but
ideally the journal flushing is done transactionally with the rest of this
operation so no new writes can "sneak in".
This will likely be rarely used since TakeSnapshot will automatically switch the
writable partition, but if there is a need to return and make the active
partition writable (in order to garbage collect unused blobs, for example), then
this API can be used.
#### SetBootPartition
**Changes which partition is bootable.**
Normally, the bootable partition will change depending on which ZBI slot is
active, but it will also be possible to separately switch which partition is
bootable. This will likely be rarely used.
#### DeleteSnapshot
**Marks the alternate partition as cleared. FVM may choose to delete the
metadata therein.**
#### ListSnapshotPartitions
**Queries FVM for partitions that are configured for snapshotting.**
#### QuerySnapshotPartition
**Queries FVM for information about the of partitions that supports
snapshotting.**
* Identifies the state of the A/B partitions, such as which is active.
### Failure Modes
The system may encounter a failure from any of the states described in the state
machine. This section describes the appropriate actions to take if the system
encounters a failure.
Note that failures may be voluntary (where the system actively decides to cancel
an ongoing update) or involuntary (where the system fails due to external
factors, such as losing power). Both cases must be considered.
Note that blobfs has a journalling mechanism that protects against metadata
corruption in cases of involuntary failure during modification. No additional
work is required to make blobfs robust to involuntary failures during
modification.
Any of the new metadata operations in FVM should be made transactional where
necessary, to prevent FVM from becoming corrupted by an involuntary failure
during modification.
#### State 1: Before TakeSnapshot
There are no changes necessary for failure handling in this state; behaviour is
identical to the current system behaviour.
#### State 2: After TakeSnapshot, Before reboot
* For voluntary failures, the CancelSnapshot API can be invoked to delete the
inactive partition and return the system back into State 1.
* For involuntary failures, the system can either decide to simply abort the
update once it comes back online (by invoking CancelSnapshot), or the system
may choose to attempt to resume the update.
#### State 3: After reboot, before TakeSnapshot
Equivalent to State 1.
### Supporting ephemeral packages
Ephemeral packages are those that are not included in the base set of packages
for a given system version.
This proposal imposes few additional restrictions on ephemeral packages; the
below section [Routing of newly created files](#routing-of-newly-created-files)
describes how ephemeral packages can continue to be supported at any state
during the OTA, with one caveat where ephemeral packages must be deleted if the
snapshot is aborted while the new base partition was being prepared.
Ephemeral packages may persist across updates, since those written before the
update begins into the active partition will be copied into the inactive
partition when TakeSnapshot is called, and after that point, all ephemeral
packages are written into the new partition, which is readable and writable to
the system (and will become the new active partition after the update
completes).
### Routing of newly created files
There are three cases to consider when deciding where a new file is
installed. To simplify the discussion, assume that partition A is active and
partition B is inactive.
#### Case 1: Before TakeSnapshot
* Base packages: Not written.
* Ephemeral packages: Written to partition A.
#### Case 2: After TakeSnapshot, before reboot
* Base packages: Written to partition B.
* Ephemeral packages: Written to partition B. Note that these packages will be
deleted if the snapshot is aborted, before the next snapshot is attempted.
#### Case 3: After remount (NB: equivalent to "Before TakeSnapshot")
* Base packages: Not written.
* Ephemeral packages: Written to partition B.
### Changes to FVM Metadata
FVM's metadata has the following structure:
Region | Description |
---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
Superblock | What you'd expect. |
Partition table | An array of entries, one for each partition, containing things like name of partition, type, etc. |
Slice allocation | An array of entries, one for each allocatable slice that indicates which partition it is allocated to (if any) and the logical offset within that partition. |
To facilitate the proposal, additional metadata is required to record slice
types for extents, so something like the following needs to be stored somewhere:
```
enum class uint32_t SliceType {
kNormal,
kAB,
kABBitmap,
kSharedData,
kShared,
};
struct {
uint32_t slice_offset; // Offset within the partition
SliceType slice_type; // The slice type
} extents[8];
```
This metadata could be added to each partition entry. A better approach might be
to add a separate partition containing this metadata (i.e. a snapshot metadata
partition). The precise location and structure of this metadata is not discussed
here and left as an implementation detail.
With this structure, the extents for Blobfs would be:
```
[
/* super block: */ { 0, SliceType::kAB },
/* allocation bitmap: */ { 0x10000 / kSliceSize, SliceType::kABBitmap },
/* inodes: */ { 0x20000 / kSliceSize, SliceType::kAB },
/* journal: */ { 0x30000 / kSliceSize, SliceType::kShared },
/* data: */ { 0x40000 / kSliceSize, SliceType::kSharedData }
]
```
Some state is required, to indicates which of the two partitions is currently
writable, whether both partitions are active (or just one) and which partition
should be considered bootable[^4].
No changes would be required to slice allocation, except that slices at
alternate offsets would need to be allocated.
There might be other minor changes required for the super block (e.g. a bump in
the version).
### Supporting blobfs format evolution
This proposal substantially simplifies blobfs format evolution since the
alternative partition can be completely deleted and re-created with little cost
on each update.
That said, there are still two challenges to deal with when evolving the blobfs
format under this proposal.
* The block allocation map cannot change, because it is a structure shared
between both active/inactive partitions. (Given how simple the allocation
map is, this seems perfectly acceptable.)
* The active partition cannot overwrite any extents that are also allocated
by the inactive partition. However, this is fairly simple to deal with: if
the internal format of some data in an extent needs to change, the system
can simply allocate new extents and move the data over, during the
TakeSnapshot call.
## Implementation
The implementation will require the following changes, which are roughly
dependent on the changes that precede them:
1. Changes to FVM, and partition set up.
1. Changes to Blobfs allocation.
1. Changes to early bootstrap code.
1. Changes to the upgrade process to use the new APIs.
The majority of changes are required by #1 and #4. #1 will involve an on-disk
format change and migrating will be supported with a clean install. Reverting
will also require a clean install. This is the critical step that involves most
risk, but note that only the format change needs to be in place; any code that
uses the new FVM metadata can remain dormant until later phases.
The other steps can all be landed without requiring a clean install and can be
reverted likewise.
## Performance
This should have a negligible impact on performance. During upgrades there might
be a small impact due to costs involved in snapshotting, but this is likely
insignificant relative to other upgrade activities. At other times, there should
be no change.
## Space requirements
Space needs to be reserved for extra copies of Blobfs regions: the Superblock,
Inode table and Bitmap. How much this is exactly depends on the configuration
for the device, but it should be relatively small compared with the total amount
of space available to Blobfs.
## Security considerations
None.
## Privacy considerations
None.
## Testing
Standard Fuchsia test practices will be used. Existing system tests should
already be testing upgrades. These will be expanded to include tests that
deliberately corrupt the new Blobfs partition and tests that try to deliberately
corrupt the snapshot partition.
## Documentation
The new architecture and features of FVM will be described under
[Fuchsia > Concepts > Filesystem Architecture](/docs/concepts/filesystems/filesystems.md).
## Drawbacks, alternatives, and unknowns
### Full A/B proposal
A full A/B proposal was considered. While that proposal is conceptually simple,
it has some significant downsides:
* Each partition can only use 50% of the available disk space.
* This is currently a _soft_ constraint on our system updates, which are
budgeted to only use 50% of the available Blobfs space, but the A/B
proposal would make this a hard constraint.
* Engineering builds already exceed the 50% budget, so they do not
support upgrades that modify many files. Engineers rely heavily on the
ability to do incremental, small updates; breaking this workflow is a
non-starter.
* There is no mechanism to share files between the partitions, thus making
every update rewrite every file.
* This implies additional flash wear, and slower upgrades. Every update
would essentially be a maximal update.
### Full FVM Snapshot Feature
There are challenges with developing a full FVM snapshot feature. Traditional
snapshot mechanisms are typically dynamic in nature, which means that metadata
needs to be updated as writes arrive. Furthermore, there is a mismatch between
FVM's slice size (currently 1 MiB, soon to be 32 KiB) and Blobfs's block size (8
KiB). Addressing this would involve a substantial increase in complexity to FVM
and there are also edge cases where it's possible to run out of space. Maybe a
scheme could be developed that had static mappings, but before too long, you
would end up with a proposal not too dissimilar from the one presented here.
Altogether, this would likely take much longer to implement, would potentially
have some serious downsides (write amplification, complexity), and offers no
clear benefits that we see needing in the near future. It is possible that a
full snapshot feature would help in the longer term, so the precise design of
FVM's metadata should provide room for expansion to support such use cases in
future.
## Prior art and references
Reliable and resilient upgrades is a common problem typically solved in the
following ways:
1. A/B copies: keep functionally equivalent copies and switch between them as
required. Simple, but costs space.
1. A/R copies: keep a recovery copy, which is a stripped down version that only
supports restoring software. More complicated, lower space requirements,
slightly degraded user experience.
1. A/B/R: a combination of #1 & #2.
1. A + snapshot: most of the time, have only one copy available. At upgrade
time, take a snapshot of A and apply the update as deltas on the snapshot.
At any time, provide the option to roll back to the snapshot. Often
complicated, but flexible.
The authors believe Android uses #3, iOS and macOS use #2 & #4.
This RFC is a simplified version of #4.
## Rationale for withdrawal {#rationale-for-withdrawal}
Development on this RFC proceeded for several months before we decided to discontinue work on
this RFC. There were several factors in this decision, the main being:
* Technical debt in the FVM codebase led to slow progress and risky changes. Lack
of test coverage, long-latent bugs, and widespread assumptions about the format layout of FVM
(due to lack of encapsulation for the FVM format) were the main hindrances.
* FVM was under-documented and poorly understood by the team. Organizational knowledge about FVM
had decayed over time, and the initial assumption that FVM would be a relatively simple and
appropriate place to build this feature was incorrect.
* The impact of rolling out the feature was higher than originally understood, since the
feature would require an FVM major format revision which was determined to be highly
disruptive to engineering efforts (since it requires reimaging devices, and since it
also requires rolling the Zedboot version, which itself is a highly disruptive operation).
Given the high risk of developing this feature, and the high likelihood of impact on the growing
Fuchsia developer community, it no longer made sense to pursue this feature. Instead, the storage
team will be focusing efforts on improving test coverage and automation to mitigate the risks
described in the motivation for this RFC, proceeding with a rewrite of the FVM host tooling (a
substantial source of unexpected complexity), and evaluating the possibility of reducing reliance
on particular FVM/Zedboot versions to reduce impact to developers when either of these need to be
changed.
[^1]: Note that these additional slice types do not necessarily need
to be added to the FVM format; there are a number of ways of expressing this
metadata and the precise format is left as an implementation
detail.
[^2]: We could, as a possible simplification, leave out the A/B
bitmap and shared data types and trust that Blobfs behaves
correctly. However, including this within FVM gives us an extra level of
protection against bugs in the Blobfs implementation. There is also the
option of leaving room and adding this at a later
stage.
[^3]: The journal's region can be shared. At the time at which the
second partition is activated, the journal can be flushed at which time it
is no longer needed for the locked, read-only partition; it is only needed
to prevent inconsistencies on the writable
partition.
[^4]: It is possible this bootable state could be stored elsewhere
and passed to FVM at bind time, but it's likely easier to just store this
state within FVM.