blob: 3dd6a08e41e6d954a552248cb0bce53bc96486ae [file] [log] [blame] [view]
# Virtualization Overview
Fuchsia’s virtualization stack provides the ability to run guest operating
systems. Zircon implements a [Type-2 hypervisor][define.type-2-hypervisor] that
exposes syscalls to enable userspace components to create and configure CPU and
memory virtualization. The Virtual Machine Manager (`VMM`) component builds on
top of the hypervisor to assemble a virtual machine by defining a memory map,
setting up traps, and emulating various devices and peripherals. Guest manager
components then sit atop the `VMM` to provide guest-specific binaries and
configuration. Fuchsia supports 3 guest packages today; an unmodified Debian
guest, a Zircon guest, and a [Termina][define.termina]-based linux guest.
Fuchsia virtualization is supported on Intel-based x64 devices that have VMX
enabled and most arm64 (ARMv8.0 and above) devices that can boot into EL2.
Notably, AMD SVM is not currently supported.
![Diagram showing virtualization components][image.overview]
## Hypervisor
The hypervisor exposes syscalls to allow creation of kernel objects to support
virtualization. Syscalls that create new hypervisor objects require that the
caller has access to the hypervisor resource so that a component’s ability to
create a virtual machine may be controlled by the [product][define.product]. In
other words, a Fuchsia component must be granted the capability to create a
guest operating system so products have the ability to limit which components
are capable of utilizing these features.
### CPU Virtualization
The `zx_vcpu_create` syscall creates a new virtual-CPU (VCPU) object and binds
that VCPU to the calling thread. The `VMM` can then use the
`zx_vcpu_{read|write}_state` syscalls to read and write the architectural
registers for that VCPU. The `zx_vcpu_enter` syscall is a blocking syscall used
to context switch into the guest and a return from `zx_vcpu_enter` represents a
context switch back to the host. In other words, if there are no threads
currently inside `zx_vcpu_enter` then there will be nothing executing within the
guest context. All of `zx_vcpu_read_state`, `zx_vcpu_write_state`, and
`zx_vcpu_enter` _must_ be called from the same thread that called
`zx_vcpu_create`.
The `zx_vcpu_kick` syscall exists to allow the host to explicitly request that a
VCPU exit back and cause any call to `zx_vcpu_enter` to return.
### Memory & IO Virtualization
The `zx_guest_create` syscall creates a new guest kernel object. Critically,
this syscall returns a [Virtual Memory Address Region][define.vmar] (`vmar`)
handle that represents the Guest’s Physical Address Space. The `VMM` is then
able to supply the guest ‘physical memory’ by mapping a [Virtual Memory
Object][define.vmo] (`vmo`) into this vmar. Since this `vmar` represents the
Guest-Physical Address space, offsets into this `vmar` will correspond to
Guest-Physical Addresses. For example, if the `VMM` wishes to expose 1GiB of
memory at Guest-Physical address range `[0x00000000 - 0x40000000)`, the `VMM`
would create a 1GiB `vmo` and map it into the Guest-Physical `vmar` at offset 0.
This Guest-Physical `vmar` is implemented using [Second Level Address
Translation][define.slat] (SLAT), which allows the hypervisor to define
translations for Host-Physical Addresses (HPA) to a Guest-Physical Addresses
(GPA). The guest operating system is then able to install their own page tables
that handle translations from a Guest-Virtual Address (GVA) to a Guest-Physical
Address.
![Diagram Showing 2-Level Address Translation][image.slat]
The `zx_guest_set_trap` syscall allows for the `VMM` to install traps that are
used for device emulation. Guests can interface with hardware using
[Memory-Mapped I/O][define.mmio] (MMIO) which involves the guest reading and
writing the device using the same instructions that are used for memory
accesses. For MMIO, there will be no mapping present in the SLAT for the
device's GPA which causes the guest to trap into the hypervisor.
x86 provides an alternate way of addressing IO devices called [Port-Mapped
I/O][define.pio] (PIO). With PIO the guest will use alternate instructions to
access a device, but these instructions will still cause the guest to trap into
the hypervisor for handling.
The details of how a trap is handled is specific to the type of trap that was
created:
`ZX_GUEST_TRAP_MEM` - Sets a trap for MMIO. Read or write operations to the
address range in Guest-Physical Address Space associated with this trap will
cause the `zx_vcpu_enter` syscall to return to the `VMM`, which is then
responsible for emulating the access, updating the VCPU register state, and then
calling `zx_vcpu_resume` again to return back to the guest.
`ZX_GUEST_TRAP_IO` - Similar to `ZX_GUEST_TRAP_MEM`, except instead of setting
the trap in guest-physical address space, the trap will be installed into the IO
space of the processor. This will fail if the architecture does not support PIO.
`ZX_GUEST_TRAP_BELL` - Sets an async trap for MMIO. When a guest writes to the
guest-physical address range associated with this trap, instead of causing
`zx_vcpu_enter` to return to the `VMM`, the hypervisor will instead queue a
message on the port associated with this trap and immediately resume VCPU
execution without returning to userspace. This can be used to emulate devices
that are designed to work with this pattern. For example, `Virtio` devices allow
the guest driver to notify the virtual device that there is work to be done by
writing to a special page in Guest-Physical Memory.
Setting an async trap in `IO` space is not supported. Reads from a region with a
`ZX_GUEST_TRAP_BELL` set are not supported.
### Trap Handling
A VCPU thread will typically spend most of its time blocked on `zx_vcpu_enter`,
meaning it’s executing within the guest context. A return from this syscall to
the `VMM`, indicates either an error has occurred, or more typically, that the
`VMM` needs to intervene to emulate some behavior.
To demonstrate, we consider a couple specific examples of how traps can be
handled by the `VMM`.
#### MMIO Sync Trap Example
For example, consider the ARM PL011 serial port emulation. Note that while this
is an ARM-specific device in practice, the trap handling will happen similarly
on both ARM and x86.
First, the `VMM` [registers a synchronous MMIO trap][example.register_sync_trap]
on the Guest-Physical Address range of `[0x808300000 - 0x808301000)`, which
tells the hypervisor that any access to this region must cause `zx_vcpu_enter`
to return control flow to the `VMM`.
Next the `VMM` will call `zx_vcpu_enter` on one or more VCPUs to context switch
into the guest. At some point, the PL011 driver will attempt to read data from
the serial port control register `UARTCR` register in the `PL011` device. This
register is located at offset `0x30` so this corresponds to Guest-Physical
Address `0x808300030` in this example.
Since a trap is registered for Guest-Physical Address `0x808300030`, this read
causes the guest to trap into the Hypervisor for handling. The hypervisor can
observe that this access has an associated `ZX_GUEST_TRAP_MEM` and passes
control flow to the `VMM` by returning from `zx_vcpu_enter` with details about
the trap contained within the `zx_port_packet_t`. The `VMM` can then use the
Guest-Physical Address of the access to associate it with the [corresponding
virtual device logic][example.pl011_cr_handler]. In this situation, the device
is maintaining the register value in a member variable.
```c++
// `relative_addr` is relative to the base address of the trapped region.
zx_status_t Pl011::Read(uint64_t relative_addr, IoValue* value) {
switch (static_cast<Pl011Register>(relative_addr)) {
case Pl011Register::CR: {
std::lock_guard<std::mutex> lock(mutex_);
value->u16 = control_;
return ZX_OK;
}
// Handle other registers...
}
}
```
This returns a 16-bit value, but we still need to expose this result to the
guest. Since the guest has performed an MMIO, the guest will be expecting the
result to be in the whatever register was specified in the load instruction.
This is accomplished by using the `zx_vcpu_read_state` and `zx_vcpu_write_state`
syscalls to [update the value of the target register][example.update_registers]
with the result of the emulated MMIO.
![Diagram showing a synchronous MMIO trap][image.mmio_sync]
#### Bell Trap Example
Next we demonstrate the operation of a Bell trap. In this situation we have a
`Virtio Device` being implemented in a component outside of the main `VMM`.
During initialization, the VMM requests that the `Virtio Device` register Bell
traps itself so that the traps will be delivered to the `Virtio Device`
component and not the `VMM`. Once the `Virtio Device` completes setting up any
traps, the `VMM` begins executing VCPUs with `zx_vcpu_enter` and control flow is
transferred into the guest.
At some point a guest driver will issue a MMIO write to a Guest-Physical Address
that has been trapped by the `Virtio Device`. At this point the guest will trap
out of guest context into the hypervisor, which will cause a notification to be
delivered to the `Virtio Device` using a `zx_port_packet_t`. Notably in this
situation `zx_vcpu_enter` never returns during the handling of this trap and the
hypervisor can quickly context switch back into the guest, minimizing the amount
of time the VCPU spends blocked.
Once the `Virtio Device` receives the `zx_port_packet_t`, it will take
device-specific steps to handle that trap. Typically this involves reading and
writing directly to Guest-Physical memory, but it can do this without blocking
VCPU execution. Once the device has completed the request it can notify the
driver in the guest by sending an interrupt using `zx_vcpu_interrupt`.
Since this vast majority of communiciation is done using shared memory and not
using synchronous traps, `Virtio` devices are much more efficient than devices
that rely heavily on synchronous traps.
![Diagram showing an async MMIO trap][image.mmio_bell]
#### Architectural Differences in Trap Handling
While much of the trap handling is the same, there are some important
differences in what needs to be done in response to a trap depending on the
underlying hardware support. Most notably, on ARM, the underlying data abort
that is generated by the hardware provides some decoded information about the
access that we can forward to userspace (ex: access size, read or write, target
register, etc). On Intel this does not occur and as a result the `VMM` needs to
do some [instruction decoding][refer.instruction_decode] to infer this same
information.
### Interrupt Virtualization
Fuchsia implements what some platforms call a ‘split irqchip’, with emulation of
the LAPIC/GICC done in the kernel and the I/OAPIC/GICD emulation occurring in
userspace. The userspace I/OAPIC and GICD forward interrupts to a target cpu
using the zx_vcpu_interrupt syscall.
## Virtual Machine Manager (`VMM`)
The [Virtual Machine Montior][code.vmm] (`VMM`) is the userspace component that
uses the hypervisor syscall to build and manage a virtual machine and perform
device emulation. The `VMM` constructs the virtual machine using the
[GuestConfig][code.guest_config] FIDL structure provided to it, which contains
both configuration about which devices should be provided to the virtual machine
as well as resources for the guest kernel, ramdisks, and block devices.
At a high level, the `VMM` assembles the virtual machine by using the hypervisor
syscalls to create the guest and VCPU kernel objects. It allocates guest RAM by
creating a VMO and maps it into the Guest-Physical Memory vmar. It uses
zx_guest_set_trap to register MMIO and port-io handlers for virtual hardware
emulation. The `VMM` emulates a PCI bus and can connect devices to that bus. It
loads the guest kernel into memory and sets up boot data with various resources
needed by the guest kernel, such as device tree blobs or ACPI tables.
### Memory
The `VMM` will allocate a vmo to use as guest-physical memory and map this vmo
into the Guest-Physical Memory vmar (created by `zx_guest_create`). When
addressing memory in the guest-physical memory vmar we call these addresses
‘Guest-Physical Addresses’ (GPA). The `VMM` will also map the same vmo into its
process address space so that it can directly access this memory. When
addressing memory in the `VMM`’s vmar we call these addresses ‘Host-Virtual
Addresses’ (HVA). The `VMM` is able to translate a GPA into an HVA since it
knows both the guest memory map, as well as the address in its own vmar that the
guest memory is mapped.
### Virtio Devices & Components
Many devices are exposed to the guest using [Virtual I/O][refer.virtio]
(`Virtio`) over PCI. The `Virtio` specification defines a set of devices that
are designed to run efficiently in a virtualization context by relying heavily
on DMA accesses to Guest-Physical memory and minimizing the number of
synchronous IO traps. To increase security and isolation between devices, we run
each `Virtio` device in its own zircon process and only route the capabilities
needed by that component. For example, a `Virtio` Block device is only provided
a handle to the specific file(s) or device that backs the virtual disk, and a
`Virtio` Console only has access to the zx::socket for the serial stream.
Communication between the `VMM` and devices is done using the
`fuchsia.virtualization.hardware` [FIDL library][code.hardware_fidl]. For each
device, there is a small piece of code that is linked into the `VMM`, called the
[controller][code.device_controller], that acts as the client to these FIDL
services and connects to the [component that implements the device][code.device]
during startup. There is one process per device instance, so if a virtual
machine has 3 `Virtio` Block devices, there will be 3 controller instances and 3
`Virtio` Block components in 3 zircon processes.
`Virtio` devices operate on the concept of shared data structures that reside in
Guest-Physical memory. The guest driver will allocate and initialize these
structures at boot and provide the `VMM` with pointers to these structures in
Guest-Physical Memory. When the driver wants to notify the device that it has
published new work to these structures, it will write to a special
device-specific ‘notify’ page in Guest-Physical Memory and the device can infer
specific events based on the offset of the write into this ‘notify’ page. Each
device component will register a `ZX_GUEST_TRAP_BELL` for this region so that
the hypervisor can forward these events directly to the target component,
without needing to bounce through the `VMM`. The device components can then read
and write these structures directly by reading these structures by their HVA.
### Booting
The `VMM` does not provide any guest BIOS or firmware but instead loads the
guest resources into memory directly and configures the boot VCPU to jump
directly to the kernel entry point. The details of this vary which kernel is
being loaded.
#### Linux Guests
For x64 Linux guests, the `VMM` loads a bootable kernel image (ex: bzImage) into
Guest-Physical Memory in accordance with the Linux [boot
protocol][refer.linux_x64_boot_protocol] and updates the Real-Mode Kernel Header
and [Zero Page][refer.linux_zero_page] with other kernel resources (ramdisk,
kernel command-line). The `VMM` will also generate and load a set of [ACPI
Tables][code.setup_acpi] that describe the emulated hardware offered to the
guest.
Arm64 Linux guests behave similarly, except we follow the arm64 [boot
protocol][refer.linux_arm64_boot_protocol] and offer a device tree blob (DTB)
instead of ACPI tables.
#### Zircon Guests
The `VMM` also supports booting Zircon guests according to the Zircon boot
requirements. Some details of how zircon boots can be found here.
## Guest Managers
The role of the Guest Manager components is to package up the guest binaries
(kernel, ramdisk, disk images) with configuration (which devices to enable,
guest kernel configuration options) and provide these to a `VMM` at startup.
There are 3 Guest Managers available in-tree, two of which are fairly simple and
one more advanced. The simple guest managers don’t have any guest-specific code,
only configuration and binaries that are passed along to the `VMM`. These guests
are then used over the virtual console or virtual frame-buffer.
Simple Guest Managers: ZirconGuestManager DebianGuestManager
The more advanced Guest Manager is TerminaGuestManager which exposes additional
functionality using gRPC services running over `Virtio` Vsock. The
TerminaGuestManager has additional functionality to connect to these services
and provide more functionality (run commands in the guest, mount filesystems,
launch applications).
For more information on how to launch and use virtualization on Fuchsia, see
[Getting Started with Fuchsia Virtualization][refer.virtualization_get_started].
[define.type-2-hypervisor]:
https://en.wikipedia.org/wiki/Hypervisor#Classification
[define.termina]:
https://chromium.googlesource.com/chromiumos/overlays/board-overlays/+/master/project-termina/
[define.product]: /docs/development/build/build_system/boards_and_products.md
[define.vmar]: /docs/reference/kernel_objects/vm_address_region.md
[define.vmo]: /docs/reference/kernel_objects/vm_object.md
[define.mmio]: https://en.wikipedia.org/wiki/Memory-mapped_I/O
[define.pio]: https://wiki.osdev.org/Port_IO
[define.slat]: https://en.wikipedia.org/wiki/Second_Level_Address_Translation
[example.pl011_cr_handler]: https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/arch/arm64/pl011.cc;l=52;drc=9fcf7ef29730fa8ecc2f1cdbe025bb3ab9741a90
[example.register_sync_trap]: https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/arch/arm64/pl011.cc;l=47
[example.update_registers]:
https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/arch/arm64/vcpu.cc;l=56;drc=8c5ab55a0467643618ef12e0d2b987f9f3d24acd
[refer.linux_arm64_boot_protocol]:
https://www.kernel.org/doc/Documentation/arm/Booting
[refer.linux_x64_boot_protocol]:
https://www.kernel.org/doc/Documentation/x86/boot.txt
[refer.linux_zero_page]:
https://www.kernel.org/doc/Documentation/x86/zero-page.txt
[refer.instruction_decode]:
https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/arch/x64/decode.cc;l=185;drc=f09260d405305bd46e76b6717ecd13b073e67fc6
[refer.virtio]:
https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virtio-v1.1-csprd01.html
[refer.virtualization_get_started]: get_started.md
[code.guest_config]:
https://cs.opensource.google/fuchsia/fuchsia/+/main:sdk/fidl/fuchsia.virtualization/guest_config.fidl
[code.hardware_fidl]:
https://cs.opensource.google/fuchsia/fuchsia/+/main:sdk/fidl/fuchsia.virtualization.hardware/device.fidl
[code.device_controller]:
https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/controller/
[code.device]:
https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/device/
[code.setup_acpi]:
https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/arch/x64/acpi.cc;l=114;drc=544ca58a71d461be5aa5afea58522234df6be33dø
[code.vmm]:
https://cs.opensource.google/fuchsia/fuchsia/+/main:src/virtualization/bin/vmm/
[code.zircon_loader]:
https://cs.opensource.google/fuchsia/fuchsia/+/main:sdk/fidl/zbi/kernel.fidl;l=9-67;drc=14ae4735d15e85d0e2016acb1355102aad4dada1
[image.mmio_bell]: images/mmio_bell_trap.png
[image.mmio_sync]: images/mmio_sync_trap.png
[image.overview]: images/virtualization_stack.png
[image.slat]: images/second_level_address_translation.png