{% set rfcid = “RFC-0136” %} {% include “docs/contribute/governance/rfcs/_common/_rfc_header.md” %}
Fxfs is a new Fuchsia filesystem that has been under development within the storage team. This proposal outlines the motivation and higher level design decisions for Fxfs.
Fuchsia's existing mutable filesystem, Minfs, has some limitations that mean it will be difficult to add the features we want to add going forward and achieve our performance goals. Fxfs is being developed as a candidate for replacing Minfs to achieve these goals.
There are some features that will almost certainly be required, such as:
inotify
support)flock
support)There are other features that we might eventually want:
We also have a goal of attaining filesystem performance which is comparable to other operating systems. Of all of the considerations, this is likely to be the most difficult to achieve and where most effort will be spent.
There are basic filesystem goals that all filesystems need to consider:
Much of the work required to achieve these goals will be covered by work elsewhere in the storage stack and in the kernel and will not be covered here; this RFC focuses on just the filesystem implementation, Fxfs.
While Fxfs is well suited to run on flash storage devices, there is nothing in the design that precludes its use on spinning storage devices, and we expect that Fxfs would run well on spinning storage devices as well.
Facilitator:
Reviewers: abarth@google.com, brettw@google.com, palmer@google.com
Consulted:
Fuchsia's security team was consulted for questions related to encryption.
Socialization:
A detailed design review of Fxfs was held within the Local Storage team.
An overview of log structured merge (LSM) trees can be found on Wikipedia. Fxfs uses LSM trees to provide a persistent key-value data structure that is used within Fxfs.
LSM trees offer some appealing properties:
There are some downsides:
It is worth noting that even with the choice of LSM trees, there is still the option of using a mutable persistent layer format if we find that a hybrid approach makes sense.
The API to Fxfs's LSM tree implementation require a merge function which dictates how records are to be merged.
Fxfs's in-memory layer is represented by an efficient, concurrency-friendly data structure (currently a skip-list).
Fxfs is comprised of a hierarchy of object stores. Object stores provide a file-like API keyed by object ID, a 64-bit unsigned integer. Simple namespace features (i.e. directory support or similar) are supported, but not mandatory; it is possible to use an object store and just refer to objects using object IDs. Object stores keep their metadata in a persistent key-value data structure (LSM trees). The metadata comprises typical file information such as the size of object and extents which map to device offsets.
Fxfs's LSM trees use object stores to store their persistent layers which presents a cyclic dependency: object stores use LSM trees to store their metadata and then LSM trees use object stores to store their persistent layers. To resolve this, object stores are arranged in a hierarchy:
Root Parent Store
The root parent object store is backed by an in-memory only LSM tree, thus has no dependencies on object stores.
The root parent store only contains the layer files for the root store and the journal file (discussed later). Note that the LSM tree only contains metadata (e.g. extent information) and it is only that information that is permanently resident in memory.
Root Store
The root store uses the root parent store to store its persistent layer files.
The root store contains all other objects to support the filesystem, such as objects used by the allocator and the super-blocks.
Child Stores
Child stores use the root store to store their persistent layer files. They store user data. There can be many child stores, but they can only use the root store as their parent, which limits the hierarchy to three levels. Note that we could support deeper hierarchies of object stores if needed.
Objects are identified by a 64-bit unsigned integer that is unique within the store that they are stored in, so to uniquely identify an object within the filesystem, you need to specify the store.
Zero is reserved and used to indicate an invalid object ID.
Objects support multiple attributes, indexed by a 64-bit attribute ID which is unique within the object. Attributes have a file-like API — it is possible to read and write arbitrary amounts at arbitrary offsets within an attribute. Initially, Fxfs will only expose access to a single attribute for objects (i.e. the contents of the file). Later, additional attributes might be used to support extended attributes. Attributes can be sparse.
There are two modes available for writing to objects: a copy-on-write mode (the same as Minfs) and an overwrite mode. With the copy-on-write mode, any writes to blocks that are already allocated will involve writing to newly allocated blocks, updating metadata to point at the new blocks and then freeing the old blocks. The overwrite mode will overwrite existing blocks and thus will not require updating the same metadata. Most objects will be written using the copy-on-write mode and this will be the only mode initially exposed to external users.
Fxfs's journal serves two purposes:
All mutations to in-memory data structures must go through the journal. Unlike Minfs' journal, the journal is a logical journal rather than a physical journal. This substantially reduces the amount of space taken by the journal — an insertion record can be represented by a few bytes, whereas changes to data structures within Minfs might involve several blocks.
Note that Fxfs' journal (like Minfs') does not include data. However, Fxfs' journal does include checksums on data written in COW mode, and Fxfs verifies these checksums during journal replay for data written since the last known device flush, which means that Fxfs can detect a torn COW write and will revert the file contents to its last completely written version. Minfs does not have this data-checksumming feature.
The journal exists as a single ever-growing file. Mutations are streamed to the file. As trees are compacted, extents at the beginning of the journal file can be freed which means the journal will have a sparse prefix. All objects have 64-bit sizes and the write rates we need to support will mean wrapping will not be a concern in our lifetime. To counter the fact that devices are free to reorder writes, a checksum will be placed at the end of every 4 KiB chunk. At replay time, replay will end at the first chunk in the journal that has a checksum mismatch. The checksum of one chunk is used as the seed for the next chunk which should guard against inadvertently accepting blocks intended for a different offset or that are remnants from a previous instance of Fxfs.
When replaying the journal, there is the challenge of knowing the extents for the journal file that aren't covered by the super-block. To solve this issue, the journal will use the overwrite mode of operation and will preallocate extents. During replay, any extents that are in the journal stream are used to read later offsets in the journal file.
Fxfs's super-blocks are more than just blocks but they serve similar purposes to conventional super-blocks so the same name is used. The super-blocks exist as objects in the root object store. Unlike other objects, their first extent is placed at fixed locations on the device (all other objects have no restrictions on location). The first part of the super-block contains a serialized structure that contains similar information to that found on other filesystems (e.g. block sizes, object numbers for important objects, etc.) After that, the super-block contains all the records contained in the root parent object store (at the time the super-block was written). The super-block is written in the same way that the journal is written: the end of each 4 KiB chunk has a checksum.
There are two super-blocks. During mount, the super-block with the latest sequence number is chosen.
Mounting the filesystem involves:
The allocator uses an LSM tree to store reference counted extents (this can be effectively seen as persistently storing <device-range> => <reference-count>
. Initially Fxfs will only support reference counts of one, but in future, it is possible that Fxfs will support file clones or extent sharing which will lead to reference counts of more than one.
Object stores support directories by storing records in the form: <object-id><child-name> => <child-object-id><child-type>
. This information is recorded in the same tree that is used for other object metadata.
Initially compactions will be driven by the need to release space in the journal. A reasonable policy will be chosen initially and will evolve as required; whatever we do is not tied to the format so this is easily changeable.
The specific encoding which Fxfs uses for filenames is left unspecified for now, but we note that the encoding choice will be minimally compatible with UTF-8, which is the encoding used for FIDL strings and Rust strings. Fxfs is case-sensitive and performs no normalisation. This may need to change if compatibility issues arise.
2^63
bytes (the maximum value of a 64-bit signed integer). Internally, Fxfs can represent file sizes up to 2^64
bytes, but the existing filesystem API limits this (e.g. the off_t
type which is signed).2^64
bit seconds and 2^32
bit nanoseconds, which can represent UNIX epoch timestamps far into the future.Fxfs has a basic implementation of fsck, which verifies allocations and object reference-counts. More work will be done in the near future to expand the coverage of fsck.
Some limitations of Minfs guide our design choices for Fxfs:
Prototyping of Fxfs has been underway for some time, it already exists in-tree (//src/storage/fxfs) and options exist for using it instead of Minfs. Minfs will remain the default mutable filesystem for the time being. The decision to change the default is outside the scope of this RFC.
Fxfs uses Rust. We believe Rust offers some compelling benefits:
One of the motivating reasons for pursuing Fxfs is to improve performance of our mutable filesystems. The benchmarks we are using to evaluate performance are being developed independently. The precise nature of these benchmarks is not within the scope of this RFC.
Fxfs will initially exist as an alternative to Minfs with limited support (not suitable for production) for migrating from Minfs.
We may consider supporting more robust migrations in future.
Initially Fxfs can be used with Zxcrypt for volume-level encryption (just as Minfs is), so there should be little change in this area. Fxfs is developed in Rust which should, in theory, reduce some security risk.
Fxfs built-in encryption will require a thorough consideration of security aspects, which is outside the scope of this RFC.
N/A
Fxfs uses the same comprehensive filesystem test suite that Minfs uses. Where appropriate, additional unit and integration tests specific to Fxfs will exist.
Fxfs should be interchangeable with Minfs, so no additional external documentation is necessary; our APIs will not be changing. This RFC itself is the primary artifact documenting the high level design of Fxfs.
Developing a new filesystem is a significant undertaking. Whilst we have managed to implement a prototype that is close to Minfs in features and performance, there is some way to go before Fxfs is production ready.
Before embarking on this project we considered other strategies such as porting an existing open-source filesystem, or using the design from an existing filesystem. Some considerations were:
With that said, we have chosen to support external contributions to the porting of F2fs, with the goal of getting to the point where Fuchsia works well with multiple different filesystems.
LSM trees are not new and are widely used in various applications although examples of their use in smaller scale filesystems is limited.
Other aspects of Fxfs design such as a logical journal format and multi-volume support can be found in other filesystems.