blob: 5dc5b5a51b1259b7df039e5ad2d0155858cca0bd [file] [log] [blame] [view]
# BlobFS
**BlobFS** is a content-addressable filesystem optimized for write-once,
read-often files, such as binaries and libraries. On Fuchsia, BlobFS is the
storage system used for all software packages.
When mounted, BlobFS presents a single logical directory containing all files
(a.k.a., blobs):
```
blob/
├── 00aeb9b5652a4adbf630d04a6ca22668f9c8469746f3f175687b3c0ff6699a49
├── 01289d3e1d2cdbc7d1b4977210877c5bbdffdbad463d992badc149152962a205
├── 018951bcf92091fd5d294cbd1f3a48d6ca59be7759587f28077b2eb754b437c0
└── 01bad8536a7aee498ffd323f53e06232b8a81edd507ac2a95bd0e819c4983138
```
Files in BlobFS are:
* **Immutable**: Once created, a blob cannot be modified (except removal).
* **Content-Addressable**: Blob names are deterministically derived from their
contents.
* **Verified**: Cryptographic checksums are used to ensure integrity of blob
data.
These properties of blobs make BlobfS a key component of Fuchsia's security
posture, ensuring that software packages' contents can be verified before they
are executed.
## Design and implementation of BlobFS
### On-disk format
BlobFS stores each blob in a linked list of non-adjacent extents (a contiguous
range of data blocks). Each blob has an associated Inode, which describes where
the block's data starts on disk and other metadata about the blob.
BlobFS divides a disk (or a partition thereof) into five chunks:
* The **Superblock** storing filesystem-wide metadata,
* The **Block Map**, a bitmap used to keep track of free and allocated data
blocks,
* The **Node Map**, a flat array of Inodes (reference to where a blob's data
starts on disk) or ExtentContainers (reference to several extents containing
some of a blob's data).
* The **Journal**, a log of filesystem operations that ensures filesystem
integrity, even if the device reboots or loses power during an operation,
and
* The **Data Blocks**, where blob contents and their verification metadata are
stored in a series of extents.
![BlobFS disk layout](images/blobfs-disk-format.svg "disk_format")
Figure 1: BlobFS disk layout
#### Superblock
The superblock is the first block in a BlobFS-formatted partition. It describes
the location and size of the other chunks of the filesystem, as well as other
filesystem-level metadata.
When a BlobFS-formatted filesystem is mounted, this block is mapped into memory
and parsed to determine where the rest of the filesystem lives. The block is
modified whenever a new blob is created, and (for FVM-managed BlobFS instances)
whenever the size of the BlobFS filesystem shrinks or grows.
![BlobFS superblock](images/blobfs-superblock-layout.svg "superblock_layout")
Figure 2: BlobFS superblock
When BlobFS is managed by FVM, the superblock contains some additional metadata
describing the FVM slices that contain the BlobFS filesystem. These fields
(yellow in the above diagram) are ignored for non-FVM, fixed-size BlobFS images.
#### Block map
The block map is a simple bit-map that marks each data block as allocated or
not. This map is used during block allocation to find contiguous ranges of
blocks, known as _extents_, to store blob contents in.
![Example block map](images/blobfs-example-block-map.svg "block_map_example")
Figure 3: An example block-map with several free extents of varying size.
When a BlobFS image is mounted, the block map is mapped into memory where it can
be read by the block allocator. The block map is written back to disk whenever a
block is allocated (during blob creation) or deallocated (during blob deletion).
#### Node map
The node map is an array of all nodes on the filesystem, which can come in two
variations:
* **Inodes**, which describe a single blob on the filesystem, or
* **ExtentContainers**, which point to an extent containing part of a blob's
data.
Nodes of both types are stored together in a single flat array. Each node has a
common header that describes what type the node is, and whether the node is
allocated. Both node types are the same size, so there is no internal
fragmentation of the array.
##### Inodes
Each blob in the filesystem has a corresponding Inode, which describes where the
blob's data starts and some other metadata about the blob.
![Layout of a BlobFS Inode](images/blobfs-inode-layout.svg "inode_layout")
Figure 4: Layout of a BlobFS Inode.
For small blobs, the Inode may be the only node necessary to describe where the
blob is on disk. In this case `extent_count` is one, `next_node` must not be
used, and `inline_extent` describes the blob's single extent.
Larger blobs will likely occupy multiple extents, especially on a fragmented
BlobFS image. In this case, the first extent of the blob is stored in
`inline_extent`, and all subsequent extents are stored in a linked list of
ExtentContainers starting at `next_node.`
![Format of an Extent](images/blobfs-extent-format.svg "extent_format")
Figure 5: Format of an Extent (occupying 64 bits). This format is used both in
Inodes and ExtentContainers.
Note that this representation of extents implies that an extent can have at most
2\*\*16 blocks in it (the maximum value of Extent Size).
##### ExtentContainers
An ExtentContainer holds references to several (up to 6) extents, which store
some of the contents of a blob.
The extents in an ExtentContainer are logically contiguous (i.e. the logical
addressable chunk of the blob stored in `extents[0]` is before `extents[1]`) and are
filled in order. If `next_node` is set, then the ExtentContainer must be full.
![Layout of a BlobFS ExtentContainer](images/blobfs-extentcontainer-layout.svg "extentcontainer_layout")
Figure 6: Layout of a BlobFS ExtentContainer.
##### Properties of the node linked-list
A blob's extents are held in a linked-list of a single Inode (which holds the
first extent) and zero or more ExtentContainers (each of which holds up to 6
extents).
This linked list has the following properties. Violating any of these properties
results in blobfs treating the blob as corrupted.
* Extents are logically contiguous:
* If Node A precedes Node B in the list, then all extents in Node A have
lower logical offsets into the blob's contents.
* Within a given ExtentContainer, for extents 𝑥 and 𝑦, if 𝑥 < 𝑦, then
extent 𝑥 has a lower logical offset into the blob's contents than extent
𝑦.
* Nodes are packed before a new node is linked. That is, if a Node has a
non-null `next_node`, then it must be full of extents (*extent for Inodes
and 6 extents for ExtentContainers).
* The total number of extents in the linked-list must equal to the Inode's
`extent_count`.
* The sum of the size of all extents in the linked-list must equal to the
Inode's `block_count`.
* The end of the list is determined based on the `extent_count` in the Inode
being satisfied. `next_node` in the final node should not be used.
##### Example Node layouts
This section contains some examples of different ways a blob's Nodes may be
formatted.
* [Example: Single-extent blob](#example-single-extent-blob)
* [Example: Multiple-extent blob](#example-multiple-extent-blob)
###### Example: Single-extent blob {: #example-single-extent-blob }
![Example: Single-extent blob](images/blobfs-example-node-1.svg "node_example_1")
Figure 7: Node layout for a blob stored in a single extent
###### Example: Multiple-extent blob {: #example-multiple-extent-blob }
![Example: Multiple-extent blob](images/blobfs-example-node-2.svg "node_example_2")
Figure 8: Node layout for a blob stored in several extents. Note that a blob's
extents may be scattered throughout the disk.
##### Blob fragmentation
A newly created BlobFS image has all of its data blocks free. Extents of
arbitrary size can easily be found, and blobs tend to be stored in a single
large extent (or a few large extents).
Over time, as blobs are allocated and deallocated, the block map will become
**fragmented** into many smaller extents. Newly created blobs will have to be
stored in multiple smaller extents.
![A fragmented block map](images/blobfs-fragmentation.svg "fragmentation")
Figure 9: A fragmented block map. While there are plenty of free blocks, there
are few large extents available.
Fragmentation is undesirable for several reasons:
* **Slower Reads**: Reading a fragmented blob requires chasing pointers in the
Node Map. This affects both sequential reads and random-access reads
* **Slower Creation and Deletion:** Creating a blob requires finding free
extents for it; this takes longer if many small extents must be found.
Similarly, deleting a fragmented blob requires chasing down and freeing many
extents.
* **Metadata Overhead:** Storing fragmented blobs requires more nodes. There
are a finite number of nodes in the Node Map, which can be exhausted,
preventing blobs from being created.
Currently BlobFS does not perform defragmentation.
#### Journal
TODO
#### Data blocks
Finally, the actual contents of the blobs must be stored somewhere. The
remaining storage blocks in the BlobFS image are designated for this purpose.
Each blob is allocated enough extents to contain all of its data, as well as a
number of data blocks reserved for storing verification metadata of the blob.
This metadata is always stored in the first blocks of the blob. Metadata is
padded so that the actual data always starts at a block-aligned address.
This verification metadata is called a **Merkle Tree**, a data structure that
uses cryptographic hashes to guarantee the integrity of the blob's contents.
##### Merkle tree
A blob's Merkle Tree is constructed as follows (for more details, see
[Fuchsia Merkle Roots](/docs/concepts/packages/merkleroot.md)):
* Each leaf node is a sha256 hash of a single block's worth of data.
* Each non-leaf node is a sha256 hash combining its children's hashes.
* The tree terminates at the level where there is a single sha256 hash.
The hash value at the top-most node is known as the **Merkle Root** of the blob.
This value is used as the name of the blob.
![A simplified example Merkle Tree](images/blobfs-example-merkle.svg "example_merkle")
Figure 10: A simplified example Merkle Tree. Note that in practice more
information is included in each hash value (such as the block offset and
length), and each non-leaf node is significantly wider (in particular, each
non-leaf node can contain up to 8192 / 32 == 256 children).
### Implementation of BlobFS
Like other Fuchsia filesystems, BlobFS is implemented as a userspace process
that serves clients through a FIDL interface.
<!--
#### Startup and initialization
TODO
#### Blob lifecycle
##### Creation
TODO
##### Deletion
TODO
## BlobFS and Fuchsia
TODO: Finish this section describing Blobfs' role in the Fuchsia system and its
relationship to other components, such as pkgfs.
-->