| # BlobFS |
| |
| **BlobFS** is a content-addressable filesystem optimized for write-once, |
| read-often files, such as binaries and libraries. On Fuchsia, BlobFS is the |
| storage system used for all software packages. |
| |
| When mounted, BlobFS presents a single logical directory containing all files |
| (a.k.a., blobs): |
| |
| ``` |
| blob/ |
| ├── 00aeb9b5652a4adbf630d04a6ca22668f9c8469746f3f175687b3c0ff6699a49 |
| ├── 01289d3e1d2cdbc7d1b4977210877c5bbdffdbad463d992badc149152962a205 |
| ├── 018951bcf92091fd5d294cbd1f3a48d6ca59be7759587f28077b2eb754b437c0 |
| └── 01bad8536a7aee498ffd323f53e06232b8a81edd507ac2a95bd0e819c4983138 |
| ``` |
| |
| Files in BlobFS are: |
| |
| * **Immutable**: Once created, a blob cannot be modified (except removal). |
| * **Content-Addressable**: Blob names are deterministically derived from their |
| contents. |
| * **Verified**: Cryptographic checksums are used to ensure integrity of blob |
| data. |
| |
| These properties of blobs make BlobfS a key component of Fuchsia's security |
| posture, ensuring that software packages' contents can be verified before they |
| are executed. |
| |
| ## Design and implementation of BlobFS |
| |
| ### On-disk format |
| |
| BlobFS stores each blob in a linked list of non-adjacent extents (a contiguous |
| range of data blocks). Each blob has an associated Inode, which describes where |
| the block's data starts on disk and other metadata about the blob. |
| |
| BlobFS divides a disk (or a partition thereof) into five chunks: |
| |
| * The **Superblock** storing filesystem-wide metadata, |
| * The **Block Map**, a bitmap used to keep track of free and allocated data |
| blocks, |
| * The **Node Map**, a flat array of Inodes (reference to where a blob's data |
| starts on disk) or ExtentContainers (reference to several extents containing |
| some of a blob's data). |
| * The **Journal**, a log of filesystem operations that ensures filesystem |
| integrity, even if the device reboots or loses power during an operation, |
| and |
| * The **Data Blocks**, where blob contents and their verification metadata are |
| stored in a series of extents. |
| |
| ![BlobFS disk layout](images/blobfs-disk-format.svg "disk_format") |
| |
| Figure 1: BlobFS disk layout |
| |
| #### Superblock |
| |
| The superblock is the first block in a BlobFS-formatted partition. It describes |
| the location and size of the other chunks of the filesystem, as well as other |
| filesystem-level metadata. |
| |
| When a BlobFS-formatted filesystem is mounted, this block is mapped into memory |
| and parsed to determine where the rest of the filesystem lives. The block is |
| modified whenever a new blob is created, and (for FVM-managed BlobFS instances) |
| whenever the size of the BlobFS filesystem shrinks or grows. |
| |
| ![BlobFS superblock](images/blobfs-superblock-layout.svg "superblock_layout") |
| |
| Figure 2: BlobFS superblock |
| |
| When BlobFS is managed by FVM, the superblock contains some additional metadata |
| describing the FVM slices that contain the BlobFS filesystem. These fields |
| (yellow in the above diagram) are ignored for non-FVM, fixed-size BlobFS images. |
| |
| #### Block map |
| |
| The block map is a simple bit-map that marks each data block as allocated or |
| not. This map is used during block allocation to find contiguous ranges of |
| blocks, known as _extents_, to store blob contents in. |
| |
| ![Example block map](images/blobfs-example-block-map.svg "block_map_example") |
| |
| Figure 3: An example block-map with several free extents of varying size. |
| |
| When a BlobFS image is mounted, the block map is mapped into memory where it can |
| be read by the block allocator. The block map is written back to disk whenever a |
| block is allocated (during blob creation) or deallocated (during blob deletion). |
| |
| #### Node map |
| |
| The node map is an array of all nodes on the filesystem, which can come in two |
| variations: |
| |
| * **Inodes**, which describe a single blob on the filesystem, or |
| * **ExtentContainers**, which point to an extent containing part of a blob's |
| data. |
| |
| Nodes of both types are stored together in a single flat array. Each node has a |
| common header that describes what type the node is, and whether the node is |
| allocated. Both node types are the same size, so there is no internal |
| fragmentation of the array. |
| |
| ##### Inodes |
| |
| Each blob in the filesystem has a corresponding Inode, which describes where the |
| blob's data starts and some other metadata about the blob. |
| |
| ![Layout of a BlobFS Inode](images/blobfs-inode-layout.svg "inode_layout") |
| |
| Figure 4: Layout of a BlobFS Inode. |
| |
| For small blobs, the Inode may be the only node necessary to describe where the |
| blob is on disk. In this case `extent_count` is one, `next_node` must not be |
| used, and `inline_extent` describes the blob's single extent. |
| |
| Larger blobs will likely occupy multiple extents, especially on a fragmented |
| BlobFS image. In this case, the first extent of the blob is stored in |
| `inline_extent`, and all subsequent extents are stored in a linked list of |
| ExtentContainers starting at `next_node.` |
| |
| ![Format of an Extent](images/blobfs-extent-format.svg "extent_format") |
| |
| Figure 5: Format of an Extent (occupying 64 bits). This format is used both in |
| Inodes and ExtentContainers. |
| |
| Note that this representation of extents implies that an extent can have at most |
| 2\*\*16 blocks in it (the maximum value of Extent Size). |
| |
| ##### ExtentContainers |
| |
| An ExtentContainer holds references to several (up to 6) extents, which store |
| some of the contents of a blob. |
| |
| The extents in an ExtentContainer are logically contiguous (i.e. the logical |
| addressable chunk of the blob stored in `extents[0]` is before `extents[1]`) and are |
| filled in order. If `next_node` is set, then the ExtentContainer must be full. |
| |
| ![Layout of a BlobFS ExtentContainer](images/blobfs-extentcontainer-layout.svg "extentcontainer_layout") |
| |
| Figure 6: Layout of a BlobFS ExtentContainer. |
| |
| ##### Properties of the node linked-list |
| |
| A blob's extents are held in a linked-list of a single Inode (which holds the |
| first extent) and zero or more ExtentContainers (each of which holds up to 6 |
| extents). |
| |
| This linked list has the following properties. Violating any of these properties |
| results in blobfs treating the blob as corrupted. |
| |
| * Extents are logically contiguous: |
| * If Node A precedes Node B in the list, then all extents in Node A have |
| lower logical offsets into the blob's contents. |
| * Within a given ExtentContainer, for extents 𝑥 and 𝑦, if 𝑥 < 𝑦, then |
| extent 𝑥 has a lower logical offset into the blob's contents than extent |
| 𝑦. |
| * Nodes are packed before a new node is linked. That is, if a Node has a |
| non-null `next_node`, then it must be full of extents (*extent for Inodes |
| and 6 extents for ExtentContainers). |
| * The total number of extents in the linked-list must equal to the Inode's |
| `extent_count`. |
| * The sum of the size of all extents in the linked-list must equal to the |
| Inode's `block_count`. |
| * The end of the list is determined based on the `extent_count` in the Inode |
| being satisfied. `next_node` in the final node should not be used. |
| |
| ##### Example Node layouts |
| |
| This section contains some examples of different ways a blob's Nodes may be |
| formatted. |
| |
| * [Example: Single-extent blob](#example-single-extent-blob) |
| * [Example: Multiple-extent blob](#example-multiple-extent-blob) |
| |
| ###### Example: Single-extent blob {: #example-single-extent-blob } |
| |
| ![Example: Single-extent blob](images/blobfs-example-node-1.svg "node_example_1") |
| |
| Figure 7: Node layout for a blob stored in a single extent |
| |
| ###### Example: Multiple-extent blob {: #example-multiple-extent-blob } |
| |
| ![Example: Multiple-extent blob](images/blobfs-example-node-2.svg "node_example_2") |
| |
| Figure 8: Node layout for a blob stored in several extents. Note that a blob's |
| extents may be scattered throughout the disk. |
| |
| ##### Blob fragmentation |
| |
| A newly created BlobFS image has all of its data blocks free. Extents of |
| arbitrary size can easily be found, and blobs tend to be stored in a single |
| large extent (or a few large extents). |
| |
| Over time, as blobs are allocated and deallocated, the block map will become |
| **fragmented** into many smaller extents. Newly created blobs will have to be |
| stored in multiple smaller extents. |
| |
| ![A fragmented block map](images/blobfs-fragmentation.svg "fragmentation") |
| |
| Figure 9: A fragmented block map. While there are plenty of free blocks, there |
| are few large extents available. |
| |
| Fragmentation is undesirable for several reasons: |
| |
| * **Slower Reads**: Reading a fragmented blob requires chasing pointers in the |
| Node Map. This affects both sequential reads and random-access reads |
| * **Slower Creation and Deletion:** Creating a blob requires finding free |
| extents for it; this takes longer if many small extents must be found. |
| Similarly, deleting a fragmented blob requires chasing down and freeing many |
| extents. |
| * **Metadata Overhead:** Storing fragmented blobs requires more nodes. There |
| are a finite number of nodes in the Node Map, which can be exhausted, |
| preventing blobs from being created. |
| |
| Currently BlobFS does not perform defragmentation. |
| |
| #### Journal |
| |
| TODO |
| |
| #### Data blocks |
| |
| Finally, the actual contents of the blobs must be stored somewhere. The |
| remaining storage blocks in the BlobFS image are designated for this purpose. |
| |
| Each blob is allocated enough extents to contain all of its data, as well as a |
| number of data blocks reserved for storing verification metadata of the blob. |
| This metadata is always stored in the first blocks of the blob. Metadata is |
| padded so that the actual data always starts at a block-aligned address. |
| |
| This verification metadata is called a **Merkle Tree**, a data structure that |
| uses cryptographic hashes to guarantee the integrity of the blob's contents. |
| |
| ##### Merkle tree |
| |
| A blob's Merkle Tree is constructed as follows (for more details, see |
| [Fuchsia Merkle Roots](/docs/concepts/packages/merkleroot.md)): |
| |
| * Each leaf node is a sha256 hash of a single block's worth of data. |
| * Each non-leaf node is a sha256 hash combining its children's hashes. |
| * The tree terminates at the level where there is a single sha256 hash. |
| |
| The hash value at the top-most node is known as the **Merkle Root** of the blob. |
| This value is used as the name of the blob. |
| |
| ![A simplified example Merkle Tree](images/blobfs-example-merkle.svg "example_merkle") |
| |
| Figure 10: A simplified example Merkle Tree. Note that in practice more |
| information is included in each hash value (such as the block offset and |
| length), and each non-leaf node is significantly wider (in particular, each |
| non-leaf node can contain up to 8192 / 32 == 256 children). |
| |
| ### Implementation of BlobFS |
| |
| Like other Fuchsia filesystems, BlobFS is implemented as a userspace process |
| that serves clients through a FIDL interface. |
| |
| <!-- |
| |
| #### Startup and initialization |
| |
| TODO |
| |
| #### Blob lifecycle |
| |
| ##### Creation |
| |
| TODO |
| |
| ##### Deletion |
| |
| TODO |
| |
| ## BlobFS and Fuchsia |
| |
| TODO: Finish this section describing Blobfs' role in the Fuchsia system and its |
| relationship to other components, such as pkgfs. |
| |
| --> |
| |