| # Filesystem Architecture |
| |
| This document seeks to describe a high-level view of the Fuchsia filesystems, |
| from their initialization, discussion of standard filesystem operations (such as |
| Open, Read, Write, etc), and the quirks of implementing user-space filesystems. |
| Additionally, this document describes the VFS-level walking through a namespace, |
| which can be used to communicate with non-storage entities (such as system |
| services). |
| |
| ## Filesystems are Services |
| |
| Unlike more common monolithic kernels, Fuchsia’s filesystems live entirely |
| within userspace. They are not linked nor loaded with the kernel; they are |
| simply userspace processes that implement servers that can appear as |
| filesystems. As a consequence, Fuchsia’s filesystems themselves can be changed |
| with ease -- modifications don’t require recompiling the kernel. |
| |
| ![Filesystem block diagram](images/filesystem.svg "filesystem") |
| |
| Figure 1: Typical filesystem process block diagram. |
| |
| Like other native servers on Fuchsia, the primary mode of interaction with a |
| filesystem server is achieved using the handle primitive rather than system |
| calls. The kernel has no knowledge about files, directories, or filesystems. As |
| a consequence, filesystem clients cannot ask the kernel for “filesystem access” |
| directly. |
| |
| This architecture implies that the interaction with filesystems is limited to |
| the following interface: |
| |
| * The messages sent on communication channels established with the filesystem |
| server. These communication channels may be local for a client-side |
| filesystem, or remote. |
| * The initialization routine (which is expected to be configured heavily on a |
| per-filesystem basis; a networking filesystem would require network access, |
| persistent filesystems may require block device access, in-memory filesystems |
| would only require a mechanism to allocate new temporary pages). |
| |
| As a benefit of this interface, any resources accessible via a channel can make |
| themselves appear like filesystems by implementing the expected protocols for |
| files or directories. For example, components serve their |
| [outgoing directory][glossary.outgoing-directory], which contains their |
| [capabilities][capabilities-overview], in a filesystem-like structure. |
| |
| ## File Lifecycle |
| |
| ### Establishing a Connection |
| |
| To open a file, Fuchsia programs (clients) send RPC requests to filesystem |
| servers using a FIDL. |
| |
| FIDL defines the wire-format for transmitting messages and handles between a |
| filesystem client and server. Instead of interacting with a kernel-implemented |
| VFS layer, Fuchsia processes send requests to filesystem services which |
| implement protocols for Files, Directories, and Devices. To send one of these |
| open requests, a Fuchsia process must transmit an RPC message over an existing |
| handle to a directory; for more detail on this process, refer to the [life of an |
| open document](/docs/concepts/filesystems/life_of_an_open.md). |
| |
| ### Namespaces |
| |
| On Fuchsia, a [namespace](/docs/concepts/process/namespaces.md) is a small filesystem that exists |
| entirely within the client. At the most basic level, the idea of the client |
| saving “/” as root and associating a handle with it is a very primitive |
| namespace. Instead of a typical singular "global" filesystem namespace, Fuchsia |
| processes can be provided an arbitrary directory handle to represent "root", |
| limiting the scope of their namespace. In order to limit this scope, Fuchsia |
| filesystems [intentionally do not allow access to parent directories via |
| dotdot](/docs/concepts/filesystems/dotdot.md). |
| |
| Fuchsia processes may additionally redirect certain path operations to separate |
| filesystem servers. When a client refers to “/bin”, the client may opt to |
| redirect these requests to a local handle representing the “/bin” directory, |
| rather than sending a request directly to the “bin” directory within the “root” |
| directory. Namespaces, like all filesystem constructs, are not visible from the |
| kernel: rather, they are implemented in client-side runtimes (such as |
| [libfdio](/docs/concepts/filesystems/life_of_an_open.md#Fdio)) and are interposed between most client code |
| and the handles to remote filesystems. |
| |
| Since namespaces operate on handles, and most Fuchsia resources and services |
| are accessible through handles, they are extremely powerful concepts. |
| Filesystem objects (such as directories and files), services, devices, |
| packages, and environments (visible by privileged processes) all are usable |
| through handles, and may be composed arbitrarily within a child process. As a |
| result, namespaces allow for customizable resource discovery within |
| applications. The services that one process observes within “/svc” may or may |
| not match what other processes see, and can be restricted or redirected |
| according to application-launching policy. |
| |
| For more detail the mechanisms and policies applied to restricting process |
| capability, refer to the documentation on |
| [sandboxing](/docs/concepts/process/sandboxing.md). |
| |
| ### Passing Data |
| |
| Once a connection has been established, either to a file, directory, device, |
| or service, subsequent operations are also transmitted using RPC messages. |
| These messages are transmitted on one or more handles, using a wire format that |
| the server validates and understands. |
| |
| In the case of files, directories, devices, and services, these operations use the |
| FIDL protocol. |
| |
| As an example, to seek within a file, a client would send a `Seek` |
| message with the desired position and “whence” within the FIDL message, and the |
| new seek position would be returned. To truncate a file, a `Truncate` |
| message could be sent with the new desired filesystem, and a status message |
| would be returned. To read a directory, a `ReadDirents` message could be |
| sent, and a list of direntries would be returned. If these requests were sent to |
| a filesystem entity that can’t handle them, an error would be sent, and the |
| operation would not be executed (like a `ReadDirents` message sent to a text |
| file). |
| |
| ### Memory Mapping |
| |
| For filesystems capable of supporting it, memory mapping files is slightly more |
| complicated. To actually “mmap” part of a file, a client sends an “GetVmo” |
| message, and receives a Virtual Memory Object, or VMO, in response. This object |
| is then typically mapped into the client’s address space using a Virtual Memory |
| Address Region, or VMAR. Transmitting a limited view of the file’s internal |
| “VMO” back to the client requires extra work by the intermediate message |
| passing layers, so they can be aware they’re passing back a server-vendored |
| object handle. |
| |
| By passing back these virtual memory objects, clients can quickly access the |
| internal bytes representing the file without actually undergoing the cost of a |
| round-trip IPC message. This feature makes mmap an attractive option for |
| clients attempting high-throughput on filesystem interaction. |
| |
| ### Other Operations acting on paths |
| |
| In addition to the “open” operation, there are a couple other path-based |
| operations worth discussing: “rename” and “link”. Unlike “open”, these |
| operations actually act on multiple paths at once, rather than a single |
| location. This complicates their usage: if a call to “rename(‘/foo/bar’, |
| ‘baz’)” is made, the filesystem needs to figure out a way to: |
| |
| * Traverse both paths, even when they have distinct starting points (which is the |
| case this here; one path starts at root, and other starts at the CWD) |
| * Open the parent directories of both paths |
| * Operate on both parent directories and trailing pathnames simultaneously |
| |
| To satisfy this behavior, the VFS layer takes advantage of a Zircon concept |
| called “cookies”. These cookies allow client-side operations to store open |
| state on a server, using a handle, and refer to it later using that same |
| handles. Fuchsia filesystems use this ability to refer to one Vnode while |
| acting on the other. |
| |
| These multi-path operations do the following: |
| |
| * Open the parent source vnode (for “/foo/bar”, this means opening “/foo”) |
| * Open the target parent vnode (for “baz”, this means opening the current |
| working directory) and acquire a vnode token using the operation |
| `GetToken`, which is a handle to a filesystem cookie. |
| * Send a “rename” request to the source parent vnode, along with the source |
| and destination paths (“bar” and “baz”), along with the vnode token acquired |
| earlier. This provides a mechanism for the filesystem to safely refer to the |
| destination vnode indirectly -- if the client provides an invalid handle, the |
| kernel will reject the request to access the cookie, and the server can return |
| an error. |
| |
| ## Filesystem Lifecycle |
| |
| ### Mounting |
| |
| Fshost is responsible for mounting filesystems on the system. At time of |
| writing, changes are underway to make filesystems run as components (although |
| fshost will still control mounting of these filesystems). Where possible, |
| static routing will be used. See the fuchsia.fs.startup/Startup protocol. |
| |
| ### Filesystem Management |
| |
| There are a collection of filesystem operations that are considered related to |
| "administration", including "unmounting the current filesystem". These |
| operations are defined by the fs.Admin interface within |
| [admin.fidl](/sdk/fidl/fuchsia.fs/admin.fidl). Filesystems export this service |
| alongside access to the root of the filesystem. |
| |
| ## Current Filesystems |
| |
| Due to the modular nature of Fuchsia’s architecture, it is straightforward to |
| add filesystems to the system. At the moment, a handful of filesystems exist, |
| intending to satisfy a variety of distinct needs. |
| |
| ### MemFS: An in-memory filesystem |
| |
| [MemFS](/src/storage/memfs) |
| is used to implement requests to temporary filesystems like `/tmp`, where files |
| exist entirely in RAM, and are not transmitted to an underlying block device. |
| This filesystem is also currently used for the “bootfs” protocol, where a |
| large, read-only VMO representing a collection of files and directories is |
| unwrapped into user-accessible Vnodes at boot (these files are accessible in |
| `/boot`). |
| |
| ### MinFS: A persistent filesystem |
| |
| [MinFS](/src/storage/minfs/bin/) |
| is a simple, traditional filesystem that is capable of storing files |
| persistently. Like MemFS, it makes extensive use of the VFS layers mentioned |
| earlier, but unlike MemFS, it requires an additional handle to a block device |
| (which is transmitted on startup to a new MinFS process). For ease of use, |
| MinFS also supplies a variety of tools: “mkfs” for formatting, “fsck” for |
| verification, as well as “mount” and “umount” for adding and subtracting MinFS |
| filesystems to a namespace from the command line. |
| |
| ### Blobfs: An immutable, integrity-verifying package storage filesystem |
| |
| [Blobfs](/src/storage/blobfs/bin/) |
| is a simple, flat filesystem optimized for “write-once, then read-only” [signed |
| data](/docs/concepts/packages/merkleroot.md), such as |
| [packages](/docs/concepts/packages/package.md). |
| Other than two small prerequisites (file names, which are deterministic, content |
| addressable hashes of a file’s Merkle Tree root, for integrity-verification) |
| and forward knowledge of file size (identified to Blobfs by a call to |
| “ftruncate” before writing a blob to storage), Blobfs appears like a |
| typical filesystem. It can be mounted and unmounted, it appears to contain a |
| single flat directory of hashes, and blobs can be accessed by operations like |
| “open”, “read”, “stat” and “mmap”. |
| |
| ### FVM |
| |
| [Fuchsia Volume Manager](/src/storage/fvm/driver/) |
| is a "logical volume manager" that adds flexibility on top of existing block |
| devices. The current features include ability to add, remove, extend and |
| shrink virtual partitions. To make these features possible FVM internally |
| maintains physical to virtual mapping from (virtual partitions, blocks) to |
| (slice, physical block). To keep maintenance overhead minimal, it allows |
| partitions to shrink/grow in chunks called slices. A slice is a multiple of the |
| native block size. Metadata aside, the rest of the device is divided into |
| slices. Each slice is either free or it belongs to one and only one partition. |
| If a slice belongs to a partition, FVM maintains metadata about which |
| partition is using the slice, and the virtual address of the slice within |
| that partition. |
| |
| The on-disk layout of the FVM looks like the following, and is declared |
| [here](/src/storage/fvm/format.h#27). |
| |
| ```c |
| +---------------------------------+ <- Physical block 0 |
| | metadata | |
| | +-----------------------------+ | |
| | | metadata copy 1 | | |
| | | +------------------------+ | | |
| | | | superblock | | | |
| | | +------------------------+ | | |
| | | | partition table | | | |
| | | +------------------------+ | | |
| | | | slice allocation table | | | |
| | | +------------------------+ | | |
| | +-----------------------------+ | <- Size of metadata is described by |
| | | metadata copy 2 | | superblock |
| | +-----------------------------+ | |
| +---------------------------------+ <- Superblock describes start of |
| | | slices |
| | Slice 1 | |
| +---------------------------------+ |
| | | |
| | Slice 2 | |
| +---------------------------------+ |
| | | |
| | Slice 3 | |
| +---------------------------------+ |
| | | |
| ``` |
| |
| The partition table is made of several virtual partition |
| entries (`VPartitionEntry`). In addition to containing name and partition |
| identifiers, each of these vpart entries contains the number of allocated |
| slices for this partition. |
| |
| The slice allocation table is made up of tightly packed slice entries |
| (`SliceEntry`). Each entry contains |
| |
| * allocation status |
| * if it is allocated, |
| * what partition it belongs to and |
| * what logical slice within the partition the slice maps to |
| |
| FVM library can be found |
| [here](/src/storage/fvm/). During |
| [paving](/docs/development/build/fx.md#what-is-paving), |
| some partitions are copied from host to target. So the partitions and FVM |
| file itself may be created on host. To do this there is host side utility |
| [here](/src/storage/bin/fvm). |
| Integrity of the FVM device/file can be verbosely verified with |
| [fvm-check](/src/devices/block/bin/fvm-check) |
| |
| [glossary.outgoing-directory]: /docs/glossary/README.md#outgoing-directory |
| [capabilities-overview]: /docs/concepts/components/v2/capabilities/README.md |