{% set rfcid = “RFC-0201” %} {% include “docs/contribute/governance/rfcs/_common/_rfc_header.md” %}
{# Fuchsia RFCs use templates to display various fields from _rfcs.yaml. View the #} {# fully rendered RFCs at https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs #}
Allow the host to reclaim memory used by the guest.
Running a memory hungry application in the guest and starting a memory hungry application in the host may cause OOM even if guest memory is no longer in use.
The reasons for this are twofold.
Example user journey
When OS boot, it asks the hardware (hypervisor in a case of the guest) how much physical memory there is. If the host told the guest it has 4GiB of RAM, the guest will know that it can allocate a million or so 4KiB pages.
The way to do it is to introduce a level of mapping between what the guest OS considers a physical address (it's called a guest physical address) and the actual physical address (which is called host physical address). As a reminder, the first level of address translation is mapping between a guest virtual memory address and the guest physical address. In other words, the guest virtual address is first translated to the guest physical address by the hardware which does the translation using page tables managed by the guest OS, and then to the host physical address - the latter translation is controlled by the page tables managed by the hypervisor.
The guest handles its own page faults that occur when accessing guest virtual to guest physical addresses that have not been mapped. The hypervisor handles page faults for guest-physical to host-physical translations.
We are using paged memory which means when hypervisor gives the guest X GiB of RAM it does not allocate any memory up-front, but only when those pages are actually required. Required in this case means the guest attempts to access the page. From the hypervisor point of view, allocated physical memory pages belong to the guest and cannot be used for host processes.
As a result, the guest may end up with a lot of allocated physical memory pages which are not being used while the host can be low on memory to run its own processes.
Below we'll talk about 2 ways to make the guest memory available to the host:
The host can tell the guest virtio-balloon driver to inflate
the balloon to certain size by configuring the target balloon size. Inflating the balloon means the guest will reserve a required amount of memory pages and report guest-physical addresses of allocated memory pages back to the host. From this point the host can decommit reported memory pages and use physical memory which was backing out reported memory.
Guest can start using the memory again at any time if virtio-balloon negotiated VIRTIO_BALLOON_F_DEFLATE_ON_OOM. See the Virtio Spec 5.5.6.1. Currently we enable VIRTIO_BALLOON_F_DEFLATE_ON_OOM.
Host can allow guest to re-use pages from the balloon by reducing the target size of the balloon. The guest may re-use pages previously given to the ballon if the configured balloon size is less than the actual number of pages in the balloon. If the guest wants to use memory again it will deflate
the balloon, letting the host know that the guest will use a range of physical guest addresses in the future. After the deflate when the guest access pages that have been removed from the balloon, it will hit a guest physical to host physical page fault and hypervisor will allocate a new physical memory page for the guest to use.
See Virtio Balloon. slides and Virtio Balloon. video for a detailed explanation of the virtio-balloon core functionality
In 2020 virtio-ballon has received a new feature called free page reporting.
Free page reporting adds a way for the guest to report free memory pages back to the host. The guest does so by adding 4MiB sized (Linux implementation constant) free pages to the free page report and sending report to the host. The guest guarantees not to reuse any free pages until the host acknowledges the report.
When the host receives a free page report, it decommits memory pages making them available for host applications and acks report back to the guest. At this point the guest may reuse pages that were previously free'd and acknowledged. If the guest decides to re-use the page, the host detects a guest physical to host physical page fauls and allocates a new physical page to fulfill the guest request.
See Free page reporting: by Alexander Duyck ( Intel ). Slide 10 for a detailed explanation.
Facilitator:
Reviewers:
Consulted:
Socialization:
This RFC went through a review with the Virtualization team. Approach was discussed with cwd@google.com who is solving a similar albeit larger problem for ChromeOS.
We'll use Free page reporting feature to report and reclaim all free memory to the host.
All being anything of order PAGE_REPORTING_MIN_ORDER or higher (defined in Linux kernel to 4MiB).
Use of free page reporting will reclaim most of the memory over the next 30 seconds. Reported free page size is 2MiB or 4MiB, some amount of memory fragmentation is expected. On the guest side free page reporting is staggered over time to minimize the performance impact.
Linux-5.15 will report most free memory over 30 seconds in 2MiB and 4MiB blocks.
Free page reporting won't evict Linux page cache which could be a problem if the guest is running IO intensive workloads. See Linux Page Cache Basics for information about the page cache in Linux.
This might change in the future when our Linux guest images start using MGLRU See the “Drawbacks, alternatives and unknown” section on MGLRU.
Not trashing page cache arbitrarily is a good thing, it exists for a reason. The host should take memory from the guest page cache only when the host actually needs it. This means we'll need to provide a way to reclaim the guest memory being used for the page cache when the host is under memory pressure.
Second change is to use the memorypressure provider to inflate the balloon on WARNING and CRITICAL host memory levels.
Inflating the balloon achieves two goals:
Inflation volume will be proportional to the available guest memory. Balloon will be inflated to 90% of the available guest memory on WARNING and CRITICAL host memory events. We have to inflate both on WARNING and CRITICAL events in case free memory sharply goes down from NORMAL to CRITICAL. The balloon will be deflated to 0% when host memory pressure is back to NORMAL.
We want to avoid constantly inflating and deflating ballon when the host is under memory pressure. Balloon inflation has a performance cost for both the guest and the host. On top of it Constant balloon resizes cause many TLB shootdowns according to Intel.
To prevent balloon size bouncing back and forth we'll throttle balloon inflation operations to 1 inflate per X seconds. X to be configured during teamfood testing. Initial value will be 1 minute. There is a potential to have more timeouts, such as a timeout to deflate balloon if host stays in memory pressure WARNING for too long. Additional timeouts can be added based on the teamfood testing telemetry.
Implementing memory reclaim would improve the host memory performance when user is running memory hungry applications in the guest. The host would have more memory available to work with instead of resorting to memory compression and other CPU expensive ways to get memory while guest has available memory to reclaim.
Free page report operation is staggered over 30 seconds in Linux implementation to reduce the performance impact on the guest. We expect 1%-2% performance impact in memory intensive guest workloads. See Free page reporting benchmarks.
Inflating the ballon when the host is low on memory might add extra load on both the host and the guest.
We'll need to measure the number of the guest “TLB shootdown” interrupts when the host is operating under memory pressure with and without memory reclaim enabled.
Benchmarks:
Metrics to capture
Reclaimed free pages are zero'ed as part of decommit operation, same as ballon inflate. This will prevent the guest information leaking to the host and other guests.
The bulk of the work will be done by unit tests and 2 integration tests. One integration tests will cover the free page reporting memory reclaim and another will cover the interaction of the guest page cache and the balloon inflate.
The user journey described in the motivation section will be tested manually. User journey reliance on a guest launch makes it non hermetic. If we had an automated end-to-end virtualization test it could be extended to cover this scenario. We don't think it is practical to take a dependency on building an automated end-to-end virtualization test for this RFC.
Multigenerational LRU Framework aka MGLRU is a memory improvement feature for the Linux kernel. The current page reclaim in Linux is too expensive in terms of CPU usage and it often makes poor choices about what to evict. MGLRU aims to make better choices than the current Linux kernel page reclaim code and to do so more efficiently. Numbers from Google engineers were cold start times reduced by up to 16% while enjoying fewer low-memory kills, Chrome OS saw upwards of 59% fewer out-of-memory kills and 96% fewer low-memory tab discards in its browser, and server results have been very promising too.
See PATCH v14 00/14 Multi-Gen LRU Framework for more details.
Termina 5.15 which we currently use does not have MGLRU patches. MGLRU is not available in upstream. Termina kernel developers are waiting for MGLRU to get accepted to the upstream to backport ot Termina 5.15
MGLRU API could be used to drive the free page reporting logic. It's worth investigating using MGLRU API to improve the free page reporting performance in Linux kernel. If successful this can be proposed to be merged upstream. This would be an optimisation of the existing free page reporting solution in Linux kernel.
There is no dependency on MGLRU in allowing the host to reclaim the guest memory.
Originally suggested way to performance a memory reclaim, hence the name memory daemon.
This is the approach currently used by the ChromeOS. It has a number of drawbacks
ChromeOS is currently working on the next iteration of Responsive virtio-balloon. The idea in the nutshell is to use a low memory killer in both the guest and the host to adjust the balloon size instead of killing the application. There are also plans to use MGLRU to guide the balloon size adjustment logic.
ChromeOS is solving much a harder and a different problem:
Fuchsia virtualization doesn‘t have a large device fleet to collect statistics from. The problem we are trying to solve is much simpler. Fuchsia doesn’t have the OOM killer to hook up to.
Fuchsia host does allow running multiple guests (Termina, Debian, Fuchsia) simultaneously. Currently this is a not a main use case, typically users and tests run a single guest. This might change once we start using more powerful hardware. Proposed solution would work for multiple oversubscribed guests if guests do not use all the available memory. E.g. idle Debian guest and active Termina guest.
Problem space gets much bigger if we have to support multiple guests which do try to use all the available memory. Deciding which guest application or which guest is more important should be a product policy. We should focus on exposing the right tools to the product but at a platform level we do not want to be prescriptive about how low memory is handled.
We will go with the simpler and more predictable solution to solve the problem at hand while adding data collection to analyze OOMs and see if we need to add more complex heuristics.
DAX mapping allows the guest to directly access the file contents from the hosts caches and thus avoids duplication between the guest and host. Adding virtio-fs support with DAX, enabling page cache sharing and adding page cache discard on memory pressure is a alternative solution to the virtio balloon inflation to clear the guest page cache. Granular page cache control would be better then the blanket page cache eviction because of the balloon inflation. We can discard old page cache while keeping the new one to alleviate memory pressure without affecting host/guest performance too much.