docs/contribute/governance/rfcs/0005_blobfs_snapshots.md - fuchsia - Git at Google

 {% set rfcid = "RFC-0005" %}
 {% include "docs/contribute/governance/rfcs/_common/_rfc_header.md" %}
 # {{ rfc.name }}: {{ rfc.title }}
 <!-- SET the `rfcid` VAR ABOVE. DO NOT EDIT ANYTHING ELSE ABOVE THIS LINE. -->

 ## Summary

 Note: This RFC has been withdrawn. It was originally Accepted on 2020-09-21. See [Rationale for
 withdrawal](#rationale-for-withdrawal). This RFC is otherwise retained in its original state for
 historical purposes.

 This RFC describes a simple snapshot mechanism that gives increased resilience
 to bugs in the upgrade process. Changes to the Fuchsia Volume Manager (FVM),
 allow a snapshot of the Blobfs partition to be taken that can be reverted to at
 any stage during the upgrade.

 ## Motivation

 At time of writing, a failed upgrade that causes corruption of a Blobfs
 partition can leave devices in states that are hard to recover from. The
 recovery partition currently lacks the ability to restore devices in this state,
 so the only supported way of restoring in these cases is via bootloaders using a
 process that is not friendly to end-users.

 A snapshot mechanism would reduce the risk of us ending up in this state.

 ## Design

 The basic concept is to support a primitive snapshot mechanism within FVM, that
 allows for the appearance of two partitions for the duration of an upgrade, but
 allows for sharing data between the partitions.

 At this time, FVM is a simple volume-manager, it has the ability to map slices
 from arbitrary slice-aligned logical offsets, to specific offsets on the
 underlying device, and it keeps mappings from different partitions separate.

 Blobfs consists of the following distinct regions:

 Region            |
 ----------------- |
 Superblock        |
 Allocation bitmap |
 Inodes            |
 Journal           |
 Data              |

 To support the proposal here, we could allow different _slice types_ within
 FVM[^1]. The types would apply to _extents_ of slices:

 Type                       | Description                                                                                                                                           |
 -------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- |
 A/B slices  | This would be an extent of slices that have an alternate copy.                                                                                                       |
 A/B bitmap[^2]| This would be an extent of slices that have an alternate copy of a bitmap that represents allocations in a shared data extent.                                    |
 Shared data | This would be an extent of slices whose allocation is managed by an A/B bitmap extent.                                                                               |
 Shared      | This would be an extent that is shared between the two partitions, but only one of the partitions could write to the region at a time.                               |

 With these slice types, it would then be possible for FVM to present _two_
 partitions showing the A/B variations of the extents. So, going back to the
 Blobfs regions, we would have:

 Region            | Type
 ----------------- | -----------
 Superblock        | A/B slices
 Allocation bitmap | A/B bitmap
 Inodes            | A/B slices
 Journal           | Shared[^3]
 Data              | Shared data

 Most of the time, only one of the partitions would be active, and the system
 would appear just as it is today.

 During upgrade, the second partition can be activated at which point the first
 partition becomes _locked_ and no further writes are allowed to it, but reads
 would continue to be served. The second partition can be prepared, potentially
 in just the same way as it is now, but throughout the upgrade period there is
 always the option to go back to the first partition, which is guaranteed to
 remain untouched.

 For the A/B extents, it's easy to see how the first partition's data is
 preserved; the second partition wouldn't see the first partition's data. For the
 journal, the shared region — only the writable partition would be able to write
 to it; i.e. the second partition. For the shared data region, the bitmap would
 indicate which of the blocks could be written to. Any blocks marked as used by
 the first partition would appear to be read-only to both partitions.

 To facilitate this scheme the second partition would also need to be able to
 read the alternate bitmap, so that it could know which blocks it is allowed to
 allocate, so to allow for this, it could be presented in the logical address
 space at some currently unused offset. A strawman proposal is that all of the
 alternate A/B extents would appear at the same offset but with the top bit set
 (read-only).

 The following diagram hopefully illustrates how each of the partitions would
 appear:

 ![Partition Arrangement](resources/0005_blobfs_snapshots/fig_1.png){:#fig-1}

 **Figure 1: Partition arrangement.**

 Notes:

 *   It gets us some of the resilience that we would have got from the simple A/B
     partition approach, but _not all._
 *   We can keep the current incremental approach to updates (i.e. only update
     blobs that have changed) at the expense of not ending up with a predictable
     layout. On user builds, we would have the option of completely rewriting all
     the blobs, but we would still be at the mercy of fragmentation.
 *   It adds complexity into FVM.

 ### New Upgrade flow

 The upgrade flow must be modified to facilitate the snapshotting interactions.
 The current flow is shown in [Figure 2](#fig-2), and the proposed alternative in
 [Figure 3](#fig-3). New APIs and interactions are colored.

 ![Current OTA](resources/0005_blobfs_snapshots/fig_2.png){:#fig-2}

 **Figure 2: Current upgrade implementation (high-level)**

 ![Proposed OTA](resources/0005_blobfs_snapshots/fig_3.png){:#fig-3}

 **Figure 3: Proposed upgrade implementation (high-level)**

 ### New FVM operations

 Several new FVM operations must be implemented and integrated into the Software
 Delivery (SWD) stack. These APIs are used to drive a state machine ([Figure
 4](#fig-4)), which ultimately switches the system between partitions.

 ![Snapshot state machine](resources/0005_blobfs_snapshots/fig_4.png){:#fig-4}

 **Figure 4: State machine for snapshotting.**

 #### TakeSnapshot

 **Snapshots the active partition's metadata into the alternate partition, which
 was previously cleared (see "DeleteSnapshot"). The active partition becomes
 read-only, and all subsequent writes must now go to the inactive partition.**

 *   FVM makes the active partition read-only.
     *   Pending journal entries must be flushed.
 *   FVM creates the inactive partition.
 *   FVM copies over the metadata from the active->inactive partition.

 Writing new blobs for the duration of this multi-step process would not be
 possible, and half written blobs would have to be abandoned, which should not be
 limiting given the component responsible for writing the blobs should be the
 same component responsible for asking for the snapshot.

 #### CancelSnapshot

 **Cancels the population of a snapshot created by TakeSnapshot, clearing the
 inactive partition and allowing another snapshot to be created.**

 *   At this time, all read connections to the inactive partition must be closed.
 *   The inactive partition will be deleted by FVM. The active partition will
     become writable again.

 #### SetWritablePartition

 **Switches which partition is writable.**

 *   The journal must be flushed at this point (all pending operations must
     complete). The fsync call in the diagram above can facilitate this, but
     ideally the journal flushing is done transactionally with the rest of this
     operation so no new writes can "sneak in".

 This will likely be rarely used since TakeSnapshot will automatically switch the
 writable partition, but if there is a need to return and make the active
 partition writable (in order to garbage collect unused blobs, for example), then
 this API can be used.

 #### SetBootPartition

 **Changes which partition is bootable.**

 Normally, the bootable partition will change depending on which ZBI slot is
 active, but it will also be possible to separately switch which partition is
 bootable. This will likely be rarely used.

 #### DeleteSnapshot

 **Marks the alternate partition as cleared. FVM may choose to delete the
 metadata therein.**

 #### ListSnapshotPartitions

 **Queries FVM for partitions that are configured for snapshotting.**

 #### QuerySnapshotPartition

 **Queries FVM for information about the of partitions that supports
 snapshotting.**

 *   Identifies the state of the A/B partitions, such as which is active.

 ### Failure Modes

 The system may encounter a failure from any of the states described in the state
 machine. This section describes the appropriate actions to take if the system
 encounters a failure.

 Note that failures may be voluntary (where the system actively decides to cancel
 an ongoing update) or involuntary (where the system fails due to external
 factors, such as losing power). Both cases must be considered.

 Note that blobfs has a journalling mechanism that protects against metadata
 corruption in cases of involuntary failure during modification. No additional
 work is required to make blobfs robust to involuntary failures during
 modification.

 Any of the new metadata operations in FVM should be made transactional where
 necessary, to prevent FVM from becoming corrupted by an involuntary failure
 during modification.

 #### State 1: Before TakeSnapshot

 There are no changes necessary for failure handling in this state; behaviour is
 identical to the current system behaviour.

 #### State 2: After TakeSnapshot, Before reboot

 *   For voluntary failures, the CancelSnapshot API can be invoked to delete the
     inactive partition and return the system back into State 1.
 *   For involuntary failures, the system can either decide to simply abort the
     update once it comes back online (by invoking CancelSnapshot), or the system
     may choose to attempt to resume the update.

 #### State 3: After reboot, before TakeSnapshot

 Equivalent to State 1.

 ### Supporting ephemeral packages

 Ephemeral packages are those that are not included in the base set of packages
 for a given system version.

 This proposal imposes few additional restrictions on ephemeral packages; the
 below section [Routing of newly created files](#routing-of-newly-created-files)
 describes how ephemeral packages can continue to be supported at any state
 during the OTA, with one caveat where ephemeral packages must be deleted if the
 snapshot is aborted while the new base partition was being prepared.

 Ephemeral packages may persist across updates, since those written before the
 update begins into the active partition will be copied into the inactive
 partition when TakeSnapshot is called, and after that point, all ephemeral
 packages are written into the new partition, which is readable and writable to
 the system (and will become the new active partition after the update
 completes).

 ### Routing of newly created files

 There are three cases to consider when deciding where a new file is
 installed. To simplify the discussion, assume that partition A is active and
 partition B is inactive.

 #### Case 1: Before TakeSnapshot

 *   Base packages: Not written.
 *   Ephemeral packages: Written to partition A.

 #### Case 2: After TakeSnapshot, before reboot

 *   Base packages: Written to partition B.
 *   Ephemeral packages: Written to partition B. Note that these packages will be
     deleted if the snapshot is aborted, before the next snapshot is attempted.

 #### Case 3: After remount (NB: equivalent to "Before TakeSnapshot")

 *   Base packages: Not written.
 *   Ephemeral packages: Written to partition B.

 ### Changes to FVM Metadata

 FVM's metadata has the following structure:

 Region           | Description                                                                                                                                                   |
 ---------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
 Superblock       | What you'd expect.                                                                                                                                            |
 Partition table  | An array of entries, one for each partition, containing things like name of partition, type, etc.                                                             |
 Slice allocation | An array of entries, one for each allocatable slice that indicates which partition it is allocated to (if any) and the logical offset within that partition. |

 To facilitate the proposal, additional metadata is required to record slice
 types for extents, so something like the following needs to be stored somewhere:

 ```
 enum class uint32_t SliceType {
   kNormal,
   kAB,
   kABBitmap,
   kSharedData,
   kShared,
 };

 struct {
   uint32_t slice_offset;  // Offset within the partition
   SliceType slice_type;   // The slice type
 } extents[8];
 ```

 This metadata could be added to each partition entry. A better approach might be
 to add a separate partition containing this metadata (i.e. a snapshot metadata
 partition). The precise location and structure of this metadata is not discussed
 here and left as an implementation detail.

 With this structure, the extents for Blobfs would be:

 ```
 [
   /* super block: */       { 0,                    SliceType::kAB },
   /* allocation bitmap: */ { 0x10000 / kSliceSize, SliceType::kABBitmap },
   /* inodes: */            { 0x20000 / kSliceSize, SliceType::kAB },
   /* journal: */           { 0x30000 / kSliceSize, SliceType::kShared },
   /* data: */              { 0x40000 / kSliceSize, SliceType::kSharedData }
 ]
 ```

 Some state is required, to indicates which of the two partitions is currently
 writable, whether both partitions are active (or just one) and which partition
 should be considered bootable[^4].

 No changes would be required to slice allocation, except that slices at
 alternate offsets would need to be allocated.

 There might be other minor changes required for the super block (e.g. a bump in
 the version).

 ### Supporting blobfs format evolution

 This proposal substantially simplifies blobfs format evolution since the
 alternative partition can be completely deleted and re-created with little cost
 on each update.

 That said, there are still two challenges to deal with when evolving the blobfs
 format under this proposal.

 *   The block allocation map cannot change, because it is a structure shared
     between both active/inactive partitions. (Given how simple the allocation
     map is, this seems perfectly acceptable.)
 *   The active partition cannot overwrite any extents that are also allocated
     by the inactive partition. However, this is fairly simple to deal with: if
     the internal format of some data in an extent needs to change, the system
     can simply allocate new extents and move the data over, during the
     TakeSnapshot call.

 ## Implementation

 The implementation will require the following changes, which are roughly
 dependent on the changes that precede them:

 1.  Changes to FVM, and partition set up.
 1.  Changes to Blobfs allocation.
 1.  Changes to early bootstrap code.
 1.  Changes to the upgrade process to use the new APIs.

 The majority of changes are required by #1 and #4. #1 will involve an on-disk
 format change and migrating will be supported with a clean install. Reverting
 will also require a clean install. This is the critical step that involves most
 risk, but note that only the format change needs to be in place; any code that
 uses the new FVM metadata can remain dormant until later phases.

 The other steps can all be landed without requiring a clean install and can be
 reverted likewise.

 ## Performance

 This should have a negligible impact on performance. During upgrades there might
 be a small impact due to costs involved in snapshotting, but this is likely
 insignificant relative to other upgrade activities. At other times, there should
 be no change.

 ## Space requirements

 Space needs to be reserved for extra copies of Blobfs regions: the Superblock,
 Inode table and Bitmap. How much this is exactly depends on the configuration
 for the device, but it should be relatively small compared with the total amount
 of space available to Blobfs.

 ## Security considerations

 None.

 ## Privacy considerations

 None.

 ## Testing

 Standard Fuchsia test practices will be used. Existing system tests should
 already be testing upgrades. These will be expanded to include tests that
 deliberately corrupt the new Blobfs partition and tests that try to deliberately
 corrupt the snapshot partition.

 ## Documentation

 The new architecture and features of FVM will be described under
 [Fuchsia > Concepts > Filesystem Architecture](/docs/concepts/filesystems/filesystems.md).

 ## Drawbacks, alternatives, and unknowns

 ### Full A/B proposal

 A full A/B proposal was considered. While that proposal is conceptually simple,
 it has some significant downsides:

 *   Each partition can only use 50% of the available disk space.
     *   This is currently a _soft_ constraint on our system updates, which are
         budgeted to only use 50% of the available Blobfs space, but the A/B
         proposal would make this a hard constraint.
     *   Engineering builds already exceed the 50% budget, so they do not
         support upgrades that modify many files. Engineers rely heavily on the
         ability to do incremental, small updates; breaking this workflow is a
         non-starter.
 *   There is no mechanism to share files between the partitions, thus making
     every update rewrite every file.
     *   This implies additional flash wear, and slower upgrades. Every update
         would essentially be a maximal update.

 ### Full FVM Snapshot Feature

 There are challenges with developing a full FVM snapshot feature. Traditional
 snapshot mechanisms are typically dynamic in nature, which means that metadata
 needs to be updated as writes arrive. Furthermore, there is a mismatch between
 FVM's slice size (currently 1 MiB, soon to be 32 KiB) and Blobfs's block size (8
 KiB). Addressing this would involve a substantial increase in complexity to FVM
 and there are also edge cases where it's possible to run out of space. Maybe a
 scheme could be developed that had static mappings, but before too long, you
 would end up with a proposal not too dissimilar from the one presented here.
 Altogether, this would likely take much longer to implement, would potentially
 have some serious downsides (write amplification, complexity), and offers no
 clear benefits that we see needing in the near future. It is possible that a
 full snapshot feature would help in the longer term, so the precise design of
 FVM's metadata should provide room for expansion to support such use cases in
 future.

 ## Prior art and references

 Reliable and resilient upgrades is a common problem typically solved in the
 following ways:

 1.  A/B copies: keep functionally equivalent copies and switch between them as
     required. Simple, but costs space.
 1.  A/R copies: keep a recovery copy, which is a stripped down version that only
     supports restoring software. More complicated, lower space requirements,
     slightly degraded user experience.
 1.  A/B/R: a combination of #1 & #2.
 1.  A + snapshot: most of the time, have only one copy available. At upgrade
     time, take a snapshot of A and apply the update as deltas on the snapshot.
     At any time, provide the option to roll back to the snapshot. Often
     complicated, but flexible.

 The authors believe Android uses #3, iOS and macOS use #2 & #4.

 This RFC is a simplified version of #4.

 ## Rationale for withdrawal {#rationale-for-withdrawal}

 Development on this RFC proceeded for several months before we decided to discontinue work on
 this RFC. There were several factors in this decision, the main being:

 * Technical debt in the FVM codebase led to slow progress and risky changes. Lack
   of test coverage, long-latent bugs, and widespread assumptions about the format layout of FVM
   (due to lack of encapsulation for the FVM format) were the main hindrances.

 * FVM was under-documented and poorly understood by the team. Organizational knowledge about FVM
   had decayed over time, and the initial assumption that FVM would be a relatively simple and
   appropriate place to build this feature was incorrect.

 * The impact of rolling out the feature was higher than originally understood, since the
   feature would require an FVM major format revision which was determined to be highly
   disruptive to engineering efforts (since it requires reimaging devices, and since it
   also requires rolling the Zedboot version, which itself is a highly disruptive operation).

 Given the high risk of developing this feature, and the high likelihood of impact on the growing
 Fuchsia developer community, it no longer made sense to pursue this feature. Instead, the storage
 team will be focusing efforts on improving test coverage and automation to mitigate the risks
 described in the motivation for this RFC, proceeding with a rewrite of the FVM host tooling (a
 substantial source of unexpected complexity), and evaluating the possibility of reducing reliance
 on particular FVM/Zedboot versions to reduce impact to developers when either of these need to be
 changed.

 [^1]: Note that these additional slice types do not necessarily need
     to be added to the FVM format; there are a number of ways of expressing this
     metadata and the precise format is left as an implementation
     detail.

 [^2]: We could, as a possible simplification, leave out the A/B
     bitmap and shared data types and trust that Blobfs behaves
     correctly. However, including this within FVM gives us an extra level of
     protection against bugs in the Blobfs implementation. There is also the
     option of leaving room and adding this at a later
     stage.

 [^3]: The journal's region can be shared. At the time at which the
     second partition is activated, the journal can be flushed at which time it
     is no longer needed for the locked, read-only partition; it is only needed
     to prevent inconsistencies on the writable
     partition.

 [^4]: It is possible this bootable state could be stored elsewhere
     and passed to FVM at bind time, but it's likely easier to just store this
     state within FVM.