{% set rfcid = “RFC-0005” %} {% include “docs/contribute/governance/rfcs/_common/_rfc_header.md” %}
Note: This RFC has been withdrawn. It was originally Accepted on 2020-09-21. See Rationale for withdrawal. This RFC is otherwise retained in its original state for historical purposes.
This RFC describes a simple snapshot mechanism that gives increased resilience to bugs in the upgrade process. Changes to the Fuchsia Volume Manager (FVM), allow a snapshot of the Blobfs partition to be taken that can be reverted to at any stage during the upgrade.
At time of writing, a failed upgrade that causes corruption of a Blobfs partition can leave devices in states that are hard to recover from. The recovery partition currently lacks the ability to restore devices in this state, so the only supported way of restoring in these cases is via bootloaders using a process that is not friendly to end-users.
A snapshot mechanism would reduce the risk of us ending up in this state.
The basic concept is to support a primitive snapshot mechanism within FVM, that allows for the appearance of two partitions for the duration of an upgrade, but allows for sharing data between the partitions.
At this time, FVM is a simple volume-manager, it has the ability to map slices from arbitrary slice-aligned logical offsets, to specific offsets on the underlying device, and it keeps mappings from different partitions separate.
Blobfs consists of the following distinct regions:
Region |
---|
Superblock |
Allocation bitmap |
Inodes |
Journal |
Data |
To support the proposal here, we could allow different slice types within FVM[^1]. The types would apply to extents of slices:
Type | Description |
---|---|
A/B slices | This would be an extent of slices that have an alternate copy. |
A/B bitmap[^2] | This would be an extent of slices that have an alternate copy of a bitmap that represents allocations in a shared data extent. |
Shared data | This would be an extent of slices whose allocation is managed by an A/B bitmap extent. |
Shared | This would be an extent that is shared between the two partitions, but only one of the partitions could write to the region at a time. |
With these slice types, it would then be possible for FVM to present two partitions showing the A/B variations of the extents. So, going back to the Blobfs regions, we would have:
Region | Type |
---|---|
Superblock | A/B slices |
Allocation bitmap | A/B bitmap |
Inodes | A/B slices |
Journal | Shared[^3] |
Data | Shared data |
Most of the time, only one of the partitions would be active, and the system would appear just as it is today.
During upgrade, the second partition can be activated at which point the first partition becomes locked and no further writes are allowed to it, but reads would continue to be served. The second partition can be prepared, potentially in just the same way as it is now, but throughout the upgrade period there is always the option to go back to the first partition, which is guaranteed to remain untouched.
For the A/B extents, it‘s easy to see how the first partition’s data is preserved; the second partition wouldn‘t see the first partition’s data. For the journal, the shared region — only the writable partition would be able to write to it; i.e. the second partition. For the shared data region, the bitmap would indicate which of the blocks could be written to. Any blocks marked as used by the first partition would appear to be read-only to both partitions.
To facilitate this scheme the second partition would also need to be able to read the alternate bitmap, so that it could know which blocks it is allowed to allocate, so to allow for this, it could be presented in the logical address space at some currently unused offset. A strawman proposal is that all of the alternate A/B extents would appear at the same offset but with the top bit set (read-only).
The following diagram hopefully illustrates how each of the partitions would appear:
{:#fig-1}
Figure 1: Partition arrangement.
Notes:
The upgrade flow must be modified to facilitate the snapshotting interactions. The current flow is shown in Figure 2, and the proposed alternative in Figure 3. New APIs and interactions are colored.
{:#fig-2}
Figure 2: Current upgrade implementation (high-level)
{:#fig-3}
Figure 3: Proposed upgrade implementation (high-level)
Several new FVM operations must be implemented and integrated into the Software Delivery (SWD) stack. These APIs are used to drive a state machine (Figure 4), which ultimately switches the system between partitions.
{:#fig-4}
Figure 4: State machine for snapshotting.
Snapshots the active partition's metadata into the alternate partition, which was previously cleared (see “DeleteSnapshot”). The active partition becomes read-only, and all subsequent writes must now go to the inactive partition.
Writing new blobs for the duration of this multi-step process would not be possible, and half written blobs would have to be abandoned, which should not be limiting given the component responsible for writing the blobs should be the same component responsible for asking for the snapshot.
Cancels the population of a snapshot created by TakeSnapshot, clearing the inactive partition and allowing another snapshot to be created.
Switches which partition is writable.
This will likely be rarely used since TakeSnapshot will automatically switch the writable partition, but if there is a need to return and make the active partition writable (in order to garbage collect unused blobs, for example), then this API can be used.
Changes which partition is bootable.
Normally, the bootable partition will change depending on which ZBI slot is active, but it will also be possible to separately switch which partition is bootable. This will likely be rarely used.
Marks the alternate partition as cleared. FVM may choose to delete the metadata therein.
Queries FVM for partitions that are configured for snapshotting.
Queries FVM for information about the of partitions that supports snapshotting.
The system may encounter a failure from any of the states described in the state machine. This section describes the appropriate actions to take if the system encounters a failure.
Note that failures may be voluntary (where the system actively decides to cancel an ongoing update) or involuntary (where the system fails due to external factors, such as losing power). Both cases must be considered.
Note that blobfs has a journalling mechanism that protects against metadata corruption in cases of involuntary failure during modification. No additional work is required to make blobfs robust to involuntary failures during modification.
Any of the new metadata operations in FVM should be made transactional where necessary, to prevent FVM from becoming corrupted by an involuntary failure during modification.
There are no changes necessary for failure handling in this state; behaviour is identical to the current system behaviour.
Equivalent to State 1.
Ephemeral packages are those that are not included in the base set of packages for a given system version.
This proposal imposes few additional restrictions on ephemeral packages; the below section Routing of newly created files describes how ephemeral packages can continue to be supported at any state during the OTA, with one caveat where ephemeral packages must be deleted if the snapshot is aborted while the new base partition was being prepared.
Ephemeral packages may persist across updates, since those written before the update begins into the active partition will be copied into the inactive partition when TakeSnapshot is called, and after that point, all ephemeral packages are written into the new partition, which is readable and writable to the system (and will become the new active partition after the update completes).
There are three cases to consider when deciding where a new file is installed. To simplify the discussion, assume that partition A is active and partition B is inactive.
FVM's metadata has the following structure:
Region | Description |
---|---|
Superblock | What you'd expect. |
Partition table | An array of entries, one for each partition, containing things like name of partition, type, etc. |
Slice allocation | An array of entries, one for each allocatable slice that indicates which partition it is allocated to (if any) and the logical offset within that partition. |
To facilitate the proposal, additional metadata is required to record slice types for extents, so something like the following needs to be stored somewhere:
enum class uint32_t SliceType { kNormal, kAB, kABBitmap, kSharedData, kShared, }; struct { uint32_t slice_offset; // Offset within the partition SliceType slice_type; // The slice type } extents[8];
This metadata could be added to each partition entry. A better approach might be to add a separate partition containing this metadata (i.e. a snapshot metadata partition). The precise location and structure of this metadata is not discussed here and left as an implementation detail.
With this structure, the extents for Blobfs would be:
[ /* super block: */ { 0, SliceType::kAB }, /* allocation bitmap: */ { 0x10000 / kSliceSize, SliceType::kABBitmap }, /* inodes: */ { 0x20000 / kSliceSize, SliceType::kAB }, /* journal: */ { 0x30000 / kSliceSize, SliceType::kShared }, /* data: */ { 0x40000 / kSliceSize, SliceType::kSharedData } ]
Some state is required, to indicates which of the two partitions is currently writable, whether both partitions are active (or just one) and which partition should be considered bootable[^4].
No changes would be required to slice allocation, except that slices at alternate offsets would need to be allocated.
There might be other minor changes required for the super block (e.g. a bump in the version).
This proposal substantially simplifies blobfs format evolution since the alternative partition can be completely deleted and re-created with little cost on each update.
That said, there are still two challenges to deal with when evolving the blobfs format under this proposal.
The implementation will require the following changes, which are roughly dependent on the changes that precede them:
The majority of changes are required by #1 and #4. #1 will involve an on-disk format change and migrating will be supported with a clean install. Reverting will also require a clean install. This is the critical step that involves most risk, but note that only the format change needs to be in place; any code that uses the new FVM metadata can remain dormant until later phases.
The other steps can all be landed without requiring a clean install and can be reverted likewise.
This should have a negligible impact on performance. During upgrades there might be a small impact due to costs involved in snapshotting, but this is likely insignificant relative to other upgrade activities. At other times, there should be no change.
Space needs to be reserved for extra copies of Blobfs regions: the Superblock, Inode table and Bitmap. How much this is exactly depends on the configuration for the device, but it should be relatively small compared with the total amount of space available to Blobfs.
None.
None.
Standard Fuchsia test practices will be used. Existing system tests should already be testing upgrades. These will be expanded to include tests that deliberately corrupt the new Blobfs partition and tests that try to deliberately corrupt the snapshot partition.
The new architecture and features of FVM will be described under Fuchsia > Concepts > Filesystem Architecture.
A full A/B proposal was considered. While that proposal is conceptually simple, it has some significant downsides:
There are challenges with developing a full FVM snapshot feature. Traditional snapshot mechanisms are typically dynamic in nature, which means that metadata needs to be updated as writes arrive. Furthermore, there is a mismatch between FVM‘s slice size (currently 1 MiB, soon to be 32 KiB) and Blobfs’s block size (8 KiB). Addressing this would involve a substantial increase in complexity to FVM and there are also edge cases where it‘s possible to run out of space. Maybe a scheme could be developed that had static mappings, but before too long, you would end up with a proposal not too dissimilar from the one presented here. Altogether, this would likely take much longer to implement, would potentially have some serious downsides (write amplification, complexity), and offers no clear benefits that we see needing in the near future. It is possible that a full snapshot feature would help in the longer term, so the precise design of FVM’s metadata should provide room for expansion to support such use cases in future.
Reliable and resilient upgrades is a common problem typically solved in the following ways:
The authors believe Android uses #3, iOS and macOS use #2 & #4.
This RFC is a simplified version of #4.
Development on this RFC proceeded for several months before we decided to discontinue work on this RFC. There were several factors in this decision, the main being:
Technical debt in the FVM codebase led to slow progress and risky changes. Lack of test coverage, long-latent bugs, and widespread assumptions about the format layout of FVM (due to lack of encapsulation for the FVM format) were the main hindrances.
FVM was under-documented and poorly understood by the team. Organizational knowledge about FVM had decayed over time, and the initial assumption that FVM would be a relatively simple and appropriate place to build this feature was incorrect.
The impact of rolling out the feature was higher than originally understood, since the feature would require an FVM major format revision which was determined to be highly disruptive to engineering efforts (since it requires reimaging devices, and since it also requires rolling the Zedboot version, which itself is a highly disruptive operation).
Given the high risk of developing this feature, and the high likelihood of impact on the growing Fuchsia developer community, it no longer made sense to pursue this feature. Instead, the storage team will be focusing efforts on improving test coverage and automation to mitigate the risks described in the motivation for this RFC, proceeding with a rewrite of the FVM host tooling (a substantial source of unexpected complexity), and evaluating the possibility of reducing reliance on particular FVM/Zedboot versions to reduce impact to developers when either of these need to be changed.
[^1]: Note that these additional slice types do not necessarily need to be added to the FVM format; there are a number of ways of expressing this metadata and the precise format is left as an implementation detail.
[^2]: We could, as a possible simplification, leave out the A/B bitmap and shared data types and trust that Blobfs behaves correctly. However, including this within FVM gives us an extra level of protection against bugs in the Blobfs implementation. There is also the option of leaving room and adding this at a later stage.
[^3]: The journal's region can be shared. At the time at which the second partition is activated, the journal can be flushed at which time it is no longer needed for the locked, read-only partition; it is only needed to prevent inconsistencies on the writable partition.
[^4]: It is possible this bootable state could be stored elsewhere and passed to FVM at bind time, but it's likely easier to just store this state within FVM.