BlobFS is a content-addressable filesystem optimized for write-once, read-often files, such as binaries and libraries. On Fuchsia, BlobFS is the storage system used for all software packages.
When mounted, BlobFS presents a single logical directory containing all files (a.k.a., blobs):
blob/ ├── 00aeb9b5652a4adbf630d04a6ca22668f9c8469746f3f175687b3c0ff6699a49 ├── 01289d3e1d2cdbc7d1b4977210877c5bbdffdbad463d992badc149152962a205 ├── 018951bcf92091fd5d294cbd1f3a48d6ca59be7759587f28077b2eb754b437c0 └── 01bad8536a7aee498ffd323f53e06232b8a81edd507ac2a95bd0e819c4983138
Files in BlobFS are:
These properties of blobs make BlobfS a key component of Fuchsia‘s security posture, ensuring that software packages’ contents can be verified before they are executed.
BlobFS stores each blob in a linked list of non-adjacent extents (a contiguous range of data blocks). Each blob has an associated Inode, which describes where the block's data starts on disk and other metadata about the blob.
BlobFS divides a disk (or a partition thereof) into five chunks:
Figure 1: BlobFS disk layout
The superblock is the first block in a BlobFS-formatted partition. It describes the location and size of the other chunks of the filesystem, as well as other filesystem-level metadata.
When a BlobFS-formatted filesystem is mounted, this block is mapped into memory and parsed to determine where the rest of the filesystem lives. The block is modified whenever a new blob is created, and (for FVM-managed BlobFS instances) whenever the size of the BlobFS filesystem shrinks or grows.
Figure 2: BlobFS superblock
When BlobFS is managed by FVM, the superblock contains some additional metadata describing the FVM slices that contain the BlobFS filesystem. These fields (yellow in the above diagram) are ignored for non-FVM, fixed-size BlobFS images.
The block map is a simple bit-map that marks each data block as allocated or not. This map is used during block allocation to find contiguous ranges of blocks, known as extents, to store blob contents in.
Figure 3: An example block-map with several free extents of varying size.
When a BlobFS image is mounted, the block map is mapped into memory where it can be read by the block allocator. The block map is written back to disk whenever a block is allocated (during blob creation) or deallocated (during blob deletion).
The node map is an array of all nodes on the filesystem, which can come in two variations:
Nodes of both types are stored together in a single flat array. Each node has a common header that describes what type the node is, and whether the node is allocated. Both node types are the same size, so there is no internal fragmentation of the array.
Each blob in the filesystem has a corresponding Inode, which describes where the blob's data starts and some other metadata about the blob.
Figure 4: Layout of a BlobFS Inode.
For small blobs, the Inode may be the only node necessary to describe where the blob is on disk. In this case extent_count
is one, next_node
must not be used, and inline_extent
describes the blob's single extent.
Larger blobs will likely occupy multiple extents, especially on a fragmented BlobFS image. In this case, the first extent of the blob is stored in inline_extent
, and all subsequent extents are stored in a linked list of ExtentContainers starting at next_node.
Figure 5: Format of an Extent (occupying 64 bits). This format is used both in Inodes and ExtentContainers.
Note that this representation of extents implies that an extent can have at most 2**16 blocks in it (the maximum value of Extent Size).
An ExtentContainer holds references to several (up to 6) extents, which store some of the contents of a blob.
The extents in an ExtentContainer are logically contiguous (i.e. the logical addressable chunk of the blob stored in extents[0]
is before extents[1]
) and are filled in order. If next_node
is set, then the ExtentContainer must be full.
Figure 6: Layout of a BlobFS ExtentContainer.
A blob's extents are held in a linked-list of a single Inode (which holds the first extent) and zero or more ExtentContainers (each of which holds up to 6 extents).
This linked list has the following properties. Violating any of these properties results in blobfs treating the blob as corrupted.
next_node
, then it must be full of extents (*extent for Inodes and 6 extents for ExtentContainers).extent_count
.block_count
.extent_count
in the Inode being satisfied. next_node
in the final node should not be used.This section contains some examples of different ways a blob's Nodes may be formatted.
Figure 7: Node layout for a blob stored in a single extent
Figure 8: Node layout for a blob stored in several extents. Note that a blob's extents may be scattered throughout the disk.
A newly created BlobFS image has all of its data blocks free. Extents of arbitrary size can easily be found, and blobs tend to be stored in a single large extent (or a few large extents).
Over time, as blobs are allocated and deallocated, the block map will become fragmented into many smaller extents. Newly created blobs will have to be stored in multiple smaller extents.
Figure 9: A fragmented block map. While there are plenty of free blocks, there are few large extents available.
Fragmentation is undesirable for several reasons:
Currently BlobFS does not perform defragmentation.
TODO
Finally, the actual contents of the blobs must be stored somewhere. The remaining storage blocks in the BlobFS image are designated for this purpose.
Each blob is allocated enough extents to contain all of its data, as well as a number of data blocks reserved for storing verification metadata of the blob. This metadata is always stored in the first blocks of the blob. Metadata is padded so that the actual data always starts at a block-aligned address.
This verification metadata is called a Merkle Tree, a data structure that uses cryptographic hashes to guarantee the integrity of the blob's contents.
A blob's Merkle Tree is constructed as follows (for more details, see Fuchsia Merkle Roots):
The hash value at the top-most node is known as the Merkle Root of the blob. This value is used as the name of the blob.
Figure 10: A simplified example Merkle Tree. Note that in practice more information is included in each hash value (such as the block offset and length), and each non-leaf node is significantly wider (in particular, each non-leaf node can contain up to 8192 / 32 == 256 children).
Like other Fuchsia filesystems, BlobFS is implemented as a userspace process that serves clients through a FIDL interface.