| # Virtualization Overview |
| |
| Fuchsia’s virtualization stack provides the ability to run guest operating |
| systems. Zircon implements a [Type-2 hypervisor][define.type-2-hypervisor] that |
| exposes syscalls to enable userspace components to create and configure CPU and |
| memory virtualization. The Virtual Machine Manager (`VMM`) component builds on |
| top of the hypervisor to assemble a virtual machine by defining a memory map, |
| setting up traps, and emulating various devices and peripherals. Guest manager |
| components then sit atop the `VMM` to provide guest-specific binaries and |
| configuration. Fuchsia supports 3 guest packages today; an unmodified Debian |
| guest, a Zircon guest, and a [Termina][define.termina]-based linux guest. |
| |
| Fuchsia virtualization is supported on Intel-based x64 devices that have VMX |
| enabled and most arm64 (ARMv8.0 and above) devices that can boot into EL2. |
| Notably, AMD SVM is not currently supported. |
| |
| ![Diagram showing virtualization components][image.overview] |
| |
| ## Hypervisor |
| |
| The hypervisor exposes syscalls to allow creation of kernel objects to support |
| virtualization. Syscalls that create new hypervisor objects require that the |
| caller has access to the hypervisor resource so that a component’s ability to |
| create a virtual machine may be controlled by the [product][define.product]. In |
| other words, a Fuchsia component must be granted the capability to create a |
| guest operating system so products have the ability to limit which components |
| are capable of utilizing these features. |
| |
| ### CPU Virtualization |
| |
| The `zx_vcpu_create` syscall creates a new virtual-CPU (VCPU) object and binds |
| that VCPU to the calling thread. The `VMM` can then use the |
| `zx_vcpu_{read|write}_state` syscalls to read and write the architectural |
| registers for that VCPU. The `zx_vcpu_enter` syscall is a blocking syscall used |
| to context switch into the guest and a return from `zx_vcpu_enter` represents a |
| context switch back to the host. In other words, if there are no threads |
| currently inside `zx_vcpu_enter` then there will be nothing executing within the |
| guest context. All of `zx_vcpu_read_state`, `zx_vcpu_write_state`, and |
| `zx_vcpu_enter` _must_ be called from the same thread that called |
| `zx_vcpu_create`. |
| |
| The `zx_vcpu_kick` syscall exists to allow the host to explicitly request that a |
| VCPU exit back and cause any call to `zx_vcpu_enter` to return. |
| |
| ### Memory & IO Virtualization |
| |
| The `zx_guest_create` syscall creates a new guest kernel object. Critically, |
| this syscall returns a [Virtual Memory Address Region][define.vmar] (`vmar`) |
| handle that represents the Guest’s Physical Address Space. The `VMM` is then |
| able to supply the guest ‘physical memory’ by mapping a [Virtual Memory |
| Object][define.vmo] (`vmo`) into this vmar. Since this `vmar` represents the |
| Guest-Physical Address space, offsets into this `vmar` will correspond to |
| Guest-Physical Addresses. For example, if the `VMM` wishes to expose 1GiB of |
| memory at Guest-Physical address range `[0x00000000 - 0x40000000)`, the `VMM` |
| would create a 1GiB `vmo` and map it into the Guest-Physical `vmar` at offset 0. |
| |
| This Guest-Physical `vmar` is implemented using [Second Level Address |
| Translation][define.slat] (SLAT), which allows the hypervisor to define |
| translations for Host-Physical Addresses (HPA) to a Guest-Physical Addresses |
| (GPA). The guest operating system is then able to install their own page tables |
| that handle translations from a Guest-Virtual Address (GVA) to a Guest-Physical |
| Address. |
| |
| ![Diagram Showing 2-Level Address Translation][image.slat] |
| |
| The `zx_guest_set_trap` syscall allows for the `VMM` to install traps that are |
| used for device emulation. Guests can interface with hardware using |
| [Memory-Mapped I/O][define.mmio] (MMIO) which involves the guest reading and |
| writing the device using the same instructions that are used for memory |
| accesses. For MMIO, there will be no mapping present in the SLAT for the |
| device's GPA which causes the guest to trap into the hypervisor. |
| |
| x86 provides an alternate way of addressing IO devices called [Port-Mapped |
| I/O][define.pio] (PIO). With PIO the guest will use alternate instructions to |
| access a device, but these instructions will still cause the guest to trap into |
| the hypervisor for handling. |
| |
| The details of how a trap is handled is specific to the type of trap that was |
| created: |
| |
| `ZX_GUEST_TRAP_MEM` - Sets a trap for MMIO. Read or write operations to the |
| address range in Guest-Physical Address Space associated with this trap will |
| cause the `zx_vcpu_enter` syscall to return to the `VMM`, which is then |
| responsible for emulating the access, updating the VCPU register state, and then |
| calling `zx_vcpu_resume` again to return back to the guest. |
| |
| `ZX_GUEST_TRAP_IO` - Similar to `ZX_GUEST_TRAP_MEM`, except instead of setting |
| the trap in guest-physical address space, the trap will be installed into the IO |
| space of the processor. This will fail if the architecture does not support PIO. |
| |
| `ZX_GUEST_TRAP_BELL` - Sets an async trap for MMIO. When a guest writes to the |
| guest-physical address range associated with this trap, instead of causing |
| `zx_vcpu_enter` to return to the `VMM`, the hypervisor will instead queue a |
| message on the port associated with this trap and immediately resume VCPU |
| execution without returning to userspace. This can be used to emulate devices |
| that are designed to work with this pattern. For example, `Virtio` devices allow |
| the guest driver to notify the virtual device that there is work to be done by |
| writing to a special page in Guest-Physical Memory. |
| |
| Setting an async trap in `IO` space is not supported. Reads from a region with a |
| `ZX_GUEST_TRAP_BELL` set are not supported. |
| |
| ### Trap Handling |
| |
| A VCPU thread will typically spend most of its time blocked on `zx_vcpu_enter`, |
| meaning it’s executing within the guest context. A return from this syscall to |
| the `VMM`, indicates either an error has occurred, or more typically, that the |
| `VMM` needs to intervene to emulate some behavior. |
| |
| To demonstrate, we consider a couple specific examples of how traps can be |
| handled by the `VMM`. |
| |
| #### MMIO Sync Trap Example |
| |
| For example, consider the ARM PL011 serial port emulation. Note that while this |
| is an ARM-specific device in practice, the trap handling will happen similarly |
| on both ARM and x86. |
| |
| First, the `VMM` [registers a synchronous MMIO trap][example.register_sync_trap] |
| on the Guest-Physical Address range of `[0x808300000 - 0x808301000)`, which |
| tells the hypervisor that any access to this region must cause `zx_vcpu_enter` |
| to return control flow to the `VMM`. |
| |
| Next the `VMM` will call `zx_vcpu_enter` on one or more VCPUs to context switch |
| into the guest. At some point, the PL011 driver will attempt to read data from |
| the serial port control register `UARTCR` register in the `PL011` device. This |
| register is located at offset `0x30` so this corresponds to Guest-Physical |
| Address `0x808300030` in this example. |
| |
| Since a trap is registered for Guest-Physical Address `0x808300030`, this read |
| causes the guest to trap into the Hypervisor for handling. The hypervisor can |
| observe that this access has an associated `ZX_GUEST_TRAP_MEM` and passes |
| control flow to the `VMM` by returning from `zx_vcpu_enter` with details about |
| the trap contained within the `zx_port_packet_t`. The `VMM` can then use the |
| Guest-Physical Address of the access to associate it with the [corresponding |
| virtual device logic][example.pl011_cr_handler]. In this situation, the device |
| is maintaining the register value in a member variable. |
| |
| ```c++ |
| // `relative_addr` is relative to the base address of the trapped region. |
| zx_status_t Pl011::Read(uint64_t relative_addr, IoValue* value) { |
| switch (static_cast<Pl011Register>(relative_addr)) { |
| case Pl011Register::CR: { |
| std::lock_guard<std::mutex> lock(mutex_); |
| value->u16 = control_; |
| return ZX_OK; |
| } |
| // Handle other registers... |
| } |
| } |
| ``` |
| |
| This returns a 16-bit value, but we still need to expose this result to the |
| guest. Since the guest has performed an MMIO, the guest will be expecting the |
| result to be in the whatever register was specified in the load instruction. |
| This is accomplished by using the `zx_vcpu_read_state` and `zx_vcpu_write_state` |
| syscalls to [update the value of the target register][example.update_registers] |
| with the result of the emulated MMIO. |
| |
| ![Diagram showing a synchronous MMIO trap][image.mmio_sync] |
| |
| #### Bell Trap Example |
| |
| Next we demonstrate the operation of a Bell trap. In this situation we have a |
| `Virtio Device` being implemented in a component outside of the main `VMM`. |
| During initialization, the VMM requests that the `Virtio Device` register Bell |
| traps itself so that the traps will be delivered to the `Virtio Device` |
| component and not the `VMM`. Once the `Virtio Device` completes setting up any |
| traps, the `VMM` begins executing VCPUs with `zx_vcpu_enter` and control flow is |
| transferred into the guest. |
| |
| At some point a guest driver will issue a MMIO write to a Guest-Physical Address |
| that has been trapped by the `Virtio Device`. At this point the guest will trap |
| out of guest context into the hypervisor, which will cause a notification to be |
| delivered to the `Virtio Device` using a `zx_port_packet_t`. Notably in this |
| situation `zx_vcpu_enter` never returns during the handling of this trap and the |
| hypervisor can quickly context switch back into the guest, minimizing the amount |
| of time the VCPU spends blocked. |
| |
| Once the `Virtio Device` receives the `zx_port_packet_t`, it will take |
| device-specific steps to handle that trap. Typically this involves reading and |
| writing directly to Guest-Physical memory, but it can do this without blocking |
| VCPU execution. Once the device has completed the request it can notify the |
| driver in the guest by sending an interrupt using `zx_vcpu_interrupt`. |
| |
| Since this vast majority of communiciation is done using shared memory and not |
| using synchronous traps, `Virtio` devices are much more efficient than devices |
| that rely heavily on synchronous traps. |
| |
| ![Diagram showing an async MMIO trap][image.mmio_bell] |
| |
| #### Architectural Differences in Trap Handling |
| |
| While much of the trap handling is the same, there are some important |
| differences in what needs to be done in response to a trap depending on the |
| underlying hardware support. Most notably, on ARM, the underlying data abort |
| that is generated by the hardware provides some decoded information about the |
| access that we can forward to userspace (ex: access size, read or write, target |
| register, etc). On Intel this does not occur and as a result the `VMM` needs to |
| do some [instruction decoding][refer.instruction_decode] to infer this same |
| information. |
| |
| ### Interrupt Virtualization |
| |
| Fuchsia implements what some platforms call a ‘split irqchip’, with emulation of |
| the LAPIC/GICC done in the kernel and the I/OAPIC/GICD emulation occurring in |
| userspace. The userspace I/OAPIC and GICD forward interrupts to a target cpu |
| using the zx_vcpu_interrupt syscall. |
| |
| ## Virtual Machine Manager (`VMM`) |
| |
| The [Virtual Machine Montior][code.vmm] (`VMM`) is the userspace component that |
| uses the hypervisor syscall to build and manage a virtual machine and perform |
| device emulation. The `VMM` constructs the virtual machine using the |
| [GuestConfig][code.guest_config] FIDL structure provided to it, which contains |
| both configuration about which devices should be provided to the virtual machine |
| as well as resources for the guest kernel, ramdisks, and block devices. |
| |
| At a high level, the `VMM` assembles the virtual machine by using the hypervisor |
| syscalls to create the guest and VCPU kernel objects. It allocates guest RAM by |
| creating a VMO and maps it into the Guest-Physical Memory vmar. It uses |
| zx_guest_set_trap to register MMIO and port-io handlers for virtual hardware |
| emulation. The `VMM` emulates a PCI bus and can connect devices to that bus. It |
| loads the guest kernel into memory and sets up boot data with various resources |
| needed by the guest kernel, such as device tree blobs or ACPI tables. |
| |
| ### Memory |
| |
| The `VMM` will allocate a vmo to use as guest-physical memory and map this vmo |
| into the Guest-Physical Memory vmar (created by `zx_guest_create`). When |
| addressing memory in the guest-physical memory vmar we call these addresses |
| ‘Guest-Physical Addresses’ (GPA). The `VMM` will also map the same vmo into its |
| process address space so that it can directly access this memory. When |
| addressing memory in the `VMM`’s vmar we call these addresses ‘Host-Virtual |
| Addresses’ (HVA). The `VMM` is able to translate a GPA into an HVA since it |
| knows both the guest memory map, as well as the address in its own vmar that the |
| guest memory is mapped. |
| |
| ### Virtio Devices & Components |
| |
| Many devices are exposed to the guest using [Virtual I/O][refer.virtio] |
| (`Virtio`) over PCI. The `Virtio` specification defines a set of devices that |
| are designed to run efficiently in a virtualization context by relying heavily |
| on DMA accesses to Guest-Physical memory and minimizing the number of |
| synchronous IO traps. To increase security and isolation between devices, we run |
| each `Virtio` device in its own zircon process and only route the capabilities |
| needed by that component. For example, a `Virtio` Block device is only provided |
| a handle to the specific file(s) or device that backs the virtual disk, and a |
| `Virtio` Console only has access to the zx::socket for the serial stream. |
| |
| Communication between the `VMM` and devices is done using the |
| `fuchsia.virtualization.hardware` [FIDL library][code.hardware_fidl]. For each |
| device, there is a small piece of code that is linked into the `VMM`, called the |
| [controller][code.device_controller], that acts as the client to these FIDL |
| services and connects to the [component that implements the device][code.device] |
| during startup. There is one process per device instance, so if a virtual |
| machine has 3 `Virtio` Block devices, there will be 3 controller instances and 3 |
| `Virtio` Block components in 3 zircon processes. |
| |
| `Virtio` devices operate on the concept of shared data structures that reside in |
| Guest-Physical memory. The guest driver will allocate and initialize these |
| structures at boot and provide the `VMM` with pointers to these structures in |
| Guest-Physical Memory. When the driver wants to notify the device that it has |
| published new work to these structures, it will write to a special |
| device-specific ‘notify’ page in Guest-Physical Memory and the device can infer |
| specific events based on the offset of the write into this ‘notify’ page. Each |
| device component will register a `ZX_GUEST_TRAP_BELL` for this region so that |
| the hypervisor can forward these events directly to the target component, |
| without needing to bounce through the `VMM`. The device components can then read |
| and write these structures directly by reading these structures by their HVA. |
| |
| ### Booting |
| |
| The `VMM` does not provide any guest BIOS or firmware but instead loads the |
| guest resources into memory directly and configures the boot VCPU to jump |
| directly to the kernel entry point. The details of this vary which kernel is |
| being loaded. |
| |
| #### Linux Guests |
| |
| For x64 Linux guests, the `VMM` loads a bootable kernel image (ex: bzImage) into |
| Guest-Physical Memory in accordance with the Linux [boot |
| protocol][refer.linux_x64_boot_protocol] and updates the Real-Mode Kernel Header |
| and [Zero Page][refer.linux_zero_page] with other kernel resources (ramdisk, |
| kernel command-line). The `VMM` will also generate and load a set of [ACPI |
| Tables][code.setup_acpi] that describe the emulated hardware offered to the |
| guest. |
| |
| Arm64 Linux guests behave similarly, except we follow the arm64 [boot |
| protocol][refer.linux_arm64_boot_protocol] and offer a device tree blob (DTB) |
| instead of ACPI tables. |
| |
| #### Zircon Guests |
| The `VMM` also supports booting Zircon guests according to the Zircon boot |
| requirements. Some details of how zircon boots can be found here. |
| |
| ## Guest Managers |
| |
| The role of the Guest Manager components is to package up the guest binaries |
| (kernel, ramdisk, disk images) with configuration (which devices to enable, |
| guest kernel configuration options) and provide these to a `VMM` at startup. |
| |
| There are 3 Guest Managers available in-tree, two of which are fairly simple and |
| one more advanced. The simple guest managers don’t have any guest-specific code, |
| only configuration and binaries that are passed along to the `VMM`. These guests |
| are then used over the virtual console or virtual frame-buffer. |
| |
| Simple Guest Managers: ZirconGuestManager DebianGuestManager |
| |
| The more advanced Guest Manager is TerminaGuestManager which exposes additional |
| functionality using gRPC services running over `Virtio` Vsock. The |
| TerminaGuestManager has additional functionality to connect to these services |
| and provide more functionality (run commands in the guest, mount filesystems, |
| launch applications). |
| |
| For more information on how to launch and use virtualization on Fuchsia, see |
| [Getting Started with Fuchsia Virtualization][refer.virtualization_get_started]. |
| |
| [define.type-2-hypervisor]: |
| https://en.wikipedia.org/wiki/Hypervisor#Classification |
| [define.termina]: |
| https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/master/project-termina/ |
| [define.product]: /docs/development/build/build_system/boards_and_products.md |
| [define.vmar]: /docs/reference/kernel_objects/vm_address_region.md |
| [define.vmo]: /docs/reference/kernel_objects/vm_object.md |
| [define.mmio]: https://en.wikipedia.org/wiki/Memory-mapped_I/O |
| [define.pio]: https://wiki.osdev.org/Port_IO |
| [define.slat]: https://en.wikipedia.org/wiki/Second_Level_Address_Translation |
| [example.pl011_cr_handler]: https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/arch/arm64/pl011.cc;l=52;drc=9fcf7ef29730fa8ecc2f1cdbe025bb3ab9741a90 |
| [example.register_sync_trap]: https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/arch/arm64/pl011.cc;l=47 |
| [example.update_registers]: |
| https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/arch/arm64/vcpu.cc;l=56;drc=8c5ab55a0467643618ef12e0d2b987f9f3d24acd |
| |
| [refer.linux_arm64_boot_protocol]: |
| https://www.kernel.org/doc/Documentation/arm/Booting |
| [refer.linux_x64_boot_protocol]: |
| https://www.kernel.org/doc/Documentation/x86/boot.txt |
| [refer.linux_zero_page]: |
| https://www.kernel.org/doc/Documentation/x86/zero-page.txt |
| [refer.instruction_decode]: |
| https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/arch/x64/decode.cc;l=185;drc=f09260d405305bd46e76b6717ecd13b073e67fc6 |
| [refer.virtio]: |
| https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html |
| [refer.virtualization_get_started]: get_started.md |
| |
| [code.guest_config]: |
| https://cs.opensource.google/fuchsia/fuchsia/+/main:sdk/fidl/fuchsia.virtualization/guest_config.fidl |
| [code.hardware_fidl]: |
| https://cs.opensource.google/fuchsia/fuchsia/+/main:sdk/fidl/fuchsia.virtualization.hardware/device.fidl |
| [code.device_controller]: |
| https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/controller/ |
| [code.device]: |
| https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/device/ |
| [code.setup_acpi]: |
| https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/arch/x64/acpi.cc;l=114;drc=544ca58a71d461be5aa5afea58522234df6be33dø |
| [code.vmm]: |
| https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/ |
| [code.zircon_loader]: |
| https://cs.opensource.google/fuchsia/fuchsia/+/main:sdk/fidl/zbi/kernel.fidl;l=9-67;drc=14ae4735d15e85d0e2016acb1355102aad4dada1 |
| |
| [image.mmio_bell]: images/mmio_bell_trap.png |
| [image.mmio_sync]: images/mmio_sync_trap.png |
| [image.overview]: images/virtualization_stack.png |
| [image.slat]: images/second_level_address_translation.png |