blob: 5522b08e9880c96509164946a8b97a20a9607bec [file] [log] [blame] [view] [edit]
# Memory reclamation in Fuchsia
Most operating systems employ memory reclamation strategies to ensure that the
working set of running processes at any point in time can efficiently utilize
all available physical memory. The operating system has a fixed amount of
physical memory (RAM) to distribute amongst all running processes, and it might
not be possible to accommodate them all at the same time.
In its simplest form, memory reclamation is page replacement, where pages that
are not as important for the current user activity are replaced with pages that
might be more important. Most operating systems maintain a pool of free pages,
so that incoming memory allocations are quickly fulfilled, rather than blocked
while waiting for a page in use to be freed.
Fuchsia also employs a similar strategy where the system tries to keep the
amount of free memory larger than a certain threshold. Fuchsia uses several
memory reclamation techniques, both within the kernel and in userspace. This
guide describes how these memory reclamation techniques work. Fuchsia also
provides a set of tools to analyze and dump memory usage (see
[Memory usage analysis tools](/docs/development/kernel/memory/memory.md#userspace_memory)).
## Pager-backed memory eviction
Userspace filesystems use
[pagers](/docs/reference/kernel_objects/pager.md)
to [page](https://en.wikipedia.org/wiki/Memory_paging) in files on demand from
an external source, like a disk. Filesystems represent files in memory using
VMOs, whose pages are populated by the pager service as and when they are
accessed.
On Fuchsia,
[blobfs](/docs/concepts/filesystems/blobfs.md) is
an immutable filesystem that hosts all executable files using a pager to
populate pages on demand. When the system comes under memory pressure, i.e. the
amount of available free memory starts running low, the kernel evicts pages
backed by blobfs in order to reclaim memory. Since these pages exist on disk,
they can be fetched back in when required.
The kernel tracks all pager-backed memory. When free memory is low, it finds
suitable candidates to evict. Pages are tracked in several LRU (least recently
used) page queues, which a kernel background thread rotates periodically, in
order to "age" pages. Another background thread evicts pages from the oldest
page queue under memory pressure.
As free memory dips lower, the kernel adjusts its aging and eviction policies
to be more aggressive, aging pages quicker in order to find more evictable
candidates. In order to prevent thrashing, pages in the MRU (most recently used)
page queue are never evicted. The length of this queue varies based on the
amount of churn on the system.
If the system is relatively quiet and the memory usage is stable, the kernel
ages pages slower and more pages accumulate in the MRU queue. On the other hand,
if the user is cycling through several activities, constantly switching the
working set, the kernel tries to keep up by aging pages more aggressively.
Userspace processes can also use
[eviction hints](/docs/contribute/governance/rfcs/0068_eviction_hints.md)
to influence the kernel eviction strategy. Processes can use the `DONT_NEED`
hint to indicate pages are no longer in use and would be good candidates for
eviction. They can also use `ALWAYS_NEED` to indicate pages are important and
should not be considered for eviction, thereby avoiding the cost of fetching
them back in when they're accessed again.
Learn more about eviction hints in the reference docs:
[`zx_vmo_op_range`](/docs/reference/syscalls/vmo_op_range.md)
and
[`zx_vmar_op_range`](/docs/reference/syscalls/vmar_op_range.md).
## Zero page deduplication
Pages in anonymous VMOs (non-pager-backed) get populated / committed only on a
write. Reads are fulfilled by the kernel using a singleton zero page. Even after
pages have been committed on a write, the kernel tries to deduplicate pages that
are filled only with zeros back to the singleton zero page in order to save
memory. The kernel periodically scans physical pages in anonymous VMOs, looking
for opportunities to deduplicate zero pages.
## Page table reclamation
As explained in [Address spaces](/docs/concepts/memory/address_spaces.md),
the VMAR hierarchy helps the kernel track virtual to physical memory mappings.
When a virtual address is accessed for the first time,
the address space's VMAR tree is used to look up the underlying physical page.
The virtual-to-physical mapping is then stored in the hardware page tables,
which the MMU uses for future lookups.
Under memory pressure,
the kernel reclaims memory in hardware page tables that hasn't been accessed for a while.
When those mappings are needed again,
they can be reconstructed from the VMAR tree.
## Discardable VMOs
Userspace processes can create a special flavor of
[VMOs that are discardable](/docs/contribute/governance/rfcs/0012_zircon_discardable_memory.md).
Clients can
[lock and unlock ](/docs/reference/syscalls/vmo_op_range.md)discardable
VMOs depending on whether or not they are being used. When the system is under
memory pressure, the kernel finds discardable VMOs that are unlocked and frees
them.
Sample code (modulo error handling):
```cpp
// Create a discardable VMO.
zx_handle_t vmo_handle;
zx_vmo_create(vmo_size, ZX_VMO_DISCARDABLE, &vmo);
// Lock the VMO.
zx_vmo_lock_state_t lock_state = {};
zx_vmo_op_range(vmo, ZX_VMO_OP_LOCK, 0, vmo_size, &lock_state,
sizeof(lock_state)));
// Use the VMO as desired.
zx_vmo_read(vmo, buf, 0, sizeof(buf));
// Unlock the VMO. The kernel is free to discard it now.
vmo_op_range(vmo, ZX_VMO_OP_UNLOCK, 0, vmo_size, nullptr, 0);
// Lock the VMO again before use.
zx_vmo_op_range(vmo, ZX_VMO_OP_LOCK, 0, vmo_size, &lock_state,
sizeof(lock_state)));
if (lock_state.discarded_size > 0) {
// The kernel discarded the VMO. Re-initialize it if required.
zx_vmo_write(vmo, data, 0, sizeof(data));
} else {
// The kernel did not discard the VMO. Previous contents were preserved.
}
```
## Memory pressure signals
Fuchsia provides userspace processes the ability to directly control their
memory consumption in response to system-wide available memory. Clients can
register to receive
[memory pressure
signals](https://fuchsia.dev/reference/fidl/fuchsia.memorypressure.md)
and take actions depending on the observed memory pressure level. There are
[three memory pressure levels](https://cs.opensource.google/fuchsia/fuchsia/+/main:sdk/fidl/fuchsia.memorypressure/memorypressure.fidl;l=8):
<table>
<tr><th>Name</th><th>Value</th><th>Description</th></tr>
<tr id="Level.NORMAL">
<td><h3 id="Level.NORMAL" class="add-link hide-from-toc">NORMAL</h3></td>
<td><code>1</code></td>
<td><p>The memory pressure level is healthy.</p>
<p>Registered clients are free to hold on to caches and allocate memory
unrestricted.</p>
<p>However, clients should take care to not proactively re-create caches on a
transition back to the NORMAL level, causing a memory spike that immediately
pushes the level over to WARNING again.</p>
</td>
</tr>
<tr id="Level.WARNING">
<td><h3 id="Level.WARNING" class="add-link hide-from-toc">WARNING</h3></td>
<td><code>2</code></td>
<td><p>The memory pressure level is somewhat constrained, and might cross over to
the critical pressure range if left unchecked.</p>
<p>Registered clients are expected to optimize their operation to limit memory
usage, rather than for best performance, for example, by reducing cache sizes
and non-essential memory allocations.</p>
<p>Clients must take care to regulate the amount of work they undertake in
order to reclaim memory, and ensure that it does not cause visible
performance degradation. There exists some memory pressure, but not enough
to justify trading off user responsiveness to reclaim memory.</p>
</td>
</tr>
<tr id="Level.CRITICAL">
<td><h3 id="Level.CRITICAL" class="add-link hide-from-toc">CRITICAL</h3></td>
<td><code>3</code></td>
<td><p>The memory pressure level is very constrained.</p>
<p>Registered clients are expected to drop all non-essential memory, and refrain
from allocating more memory. Failing to do so might result in the job
getting terminated, or the system being rebooted in the case of global
memory pressure.</p>
<p>Clients may undertake expensive work to reclaim memory if required, since
failing to do so might result in termination. The client might decide that a
performance hit is a fair tradeoff in this case.</p>
</td>
</tr>
</table>
### Comparing memory pressure signals to discardable VMOs
Userspace clients can pick between memory pressure signals and discardable
VMOs, or use a combination of both reclamation mechanisms based on their
needs.These are some things to consider when making the choice:
- Memory pressure signals allow clients to do more than just trim caches.
For example, jobs can tear down non-essential processes in their job tree.
They can also stop certain memory-intensive activities, or hold off on
starting new ones until the pressure level is Normal.
- With discardable VMOs, the userspace client gives up control over when
the memory is freed to the kernel. The kernel decides when to free the
memory based on various factors: the amount of available memory, memory
that can be reclaimed by other means, etc. If the client wishes to finely
control the lifetimes of its caches, when to trim what, etc., memory
pressure signals might be more suitable.
- It is possible that discardable VMOs end up preserving their contents
for longer than if the process was tearing down the VMOs itself in response
to memory pressure signals. The kernel drives freeing of discardable VMOs,
and the kernel has more global context around the amount of free memory, so
it knows exactly how much to reclaim. The kernel also has other means of
reclaiming memory at its disposal, so it's possible that not all
discardable VMOs need to be freed up. On the other hand, if the userspace
client is responding to memory pressure itself, it will likely react in the
same manner every time, trimming all its caches.
- Discardable memory can also allow the kernel to reclaim memory more
quickly so that the system recovers faster. With memory pressure signals,
there can be some IPC and scheduling latency involved, between the kernel
signaling the pressure level change, and the userspace process responding to it.
## OOM (Out-of-memory) reboot
It is possible for all memory reclamation strategies to fail to free up enough
memory in the face of certain aggressive memory allocation patterns. When that
happens, the kernel opts to reboot after cleanly shutting down filesystems to
prevent data loss. When the free memory level falls below a preconfigured OOM
threshold, an OOM reboot is triggered.
## Tools to test memory pressure response
### Observing and testing kernel memory reclamation
Use the `k scanner` command to observe and test reclamation techniques the
kernel uses: pager-backed eviction, discardable VMO reclamation, zero page
deduplication, and page table reclamation. It can also be used to test the page
queue rotation / aging strategy used to drive eviction. Run `k scanner` on a
serial console to see all available options:
```posix-terminal
k scanner
usage:
scanner dump : dump scanner info
scanner push_disable : increase scanner disable count
scanner pop_disable : decrease scanner disable count
scanner reclaim_all : attempt to reclaim all possible memory
scanner rotate_queue : immediately rotate the page queues
scanner reclaim <MB> [only_old] : attempt to reclaim requested MB of memory.
scanner pt_reclaim [on|off] : turn unused page table reclamation on or off
scanner harvest_accessed : harvest all page accessed information
```
`k scanner dump` dumps the current state of the page queues and other relevant
memory counters the kernel uses for reclamation:
```posix-terminal
k scanner dump
[SCAN]: Scanner enabled. Triggering informational scan
[SCAN]: Found 4303 zero pages across all of memory
[SCAN]: Found 8995 user-pager backed pages in queue 0
[SCAN]: Found 3278 user-pager backed pages in queue 1
[SCAN]: Found 8947 user-pager backed pages in queue 2
[SCAN]: Found 10776 user-pager backed pages in queue 3
[SCAN]: Found 3981 user-pager backed pages in queue 4
[SCAN]: Found 0 user-pager backed pages in queue 5
[SCAN]: Found 0 user-pager backed pages in queue 6
[SCAN]: Found 0 user-pager backed pages in queue 7
[SCAN]: Found 1347 user-pager backed pages in DontNeed queue
[SCAN]: Found 40 zero forked pages
[SCAN]: Found 0 locked pages in discardable vmos
[SCAN]: Found 0 unlocked pages in discardable vmos
pq: MRU generation is 12 set 10.720698018s ago due to "Active ratio", LRU generation is 6
pq: Pager buckets [8995],[3278],8947,10776,3981,0,{0},0, evict first: 1347, live active/inactive totals: 12273/25051
```
Test reclaiming memory with `k scanner reclaim` or `k scanner reclaim_all`:
```posix-terminal
k scanner reclaim_all
[EVICT]: Free memory before eviction was 7161MB and after eviction is 7290MB
[EVICT]: Evicted 33004 user pager backed pages
[SCAN]: De-duped 25 pages that were recently forked from the zero page
```
Test page table reclamation with `k pmm drop_user_pt`:
```posix-terminal
k pmm
pmm drop_user_pt : drop all user hardware page tables
```
### Observing and generating memory pressure
Use the `k pmm mem_avail_state` command to generate memory pressure on the
system, by allocating memory to reach the specified memory pressure level. This
is useful for testing system-wide response to memory pressure:
```posix-terminal
k pmm mem_avail_state
pmm mem_avail_state info : dump memory availability state info
pmm mem_avail_state [step] <state> [<nsecs>] : allocate memory to go to memstate <state>, hold the state for <nsecs> (10s by default). Only works if going to <state> from current state requires allocating memory, can't free up pre-allocated memory. In optional [step] mode, allocation pauses for 1 second at each intermediate memory availability state until <state> is reached.
```
`k pmm mem_avail_state info` dumps the current memory pressure state.
```posix-terminal
k pmm mem_avail_state info
watermarks: [50M, 60M, 150M, 300M]
debounce: 1M
current state: 4
current bounds: [299M, 16.0E]
free memory: 7253.5M
```
The memory availability states are numbered starting from 0, and are a superset
of the levels mentioned previously for [memory pressure
signals](#memory_pressure_signals).
- `OOM` is state 0. This is the free memory level below which the kernel
decides to reboot the system.
- `Imminent-OOM` is state 1. This is a diagnostic-only memory level, set
at a small delta from the OOM level. Its sole purpose is to provide a means
to collect OOM diagnostic information safely, as it might be too late to
gather diagnostics at the OOM level. Learn more about this level in
[RFC-0091](/docs/contribute/governance/rfcs/0091_getevent_imminent_oom.md).
- `Critical` is state 2. This is the level that triggers the CRITICAL
memory pressure signal.
- `Warning` is state 3. This is the level that triggers the WARNING memory
pressure signal.
- `Normal` is state 4. This is the level that triggers the NORMAL memory
pressure signal.
In the example above, the `current state` is 4, i.e. Normal.
The `watermarks` show the memory thresholds that delineate the different memory
availability states. The output in the above example shows these memory
thresholds:
```none {:.devsite-disable-click-to-copy}
OOM: 50MB, Imminent-OOM: 60MB, Critical: 150MB, Warning: 300MB
```
The `debounce` is the slack or error margin used when computing memory state
boundaries. In this example, it is 1MB.
The `current bounds` shows the free memory bounds applicable to the current
memory state. Given the current state is `Normal`, referring to the
`watermarks`, `Normal` starts at the 300MB threshold. Using the 1MB debounce,
the lower limit is 299MB. There isn't an applicable upper limit for the `Normal`
level, which is set to `UINT64_MAX` here.
Lastly, the total `free memory` on the system is currently 7253.5MB.
Use the command `k pmm mem_avail_state X` to transition to memory availability
state `X`, where `X` is the numerical memory state as described above.
Optionally provide a duration for which the requested state is to be held. There
is also an option to "step" through intermediate states, pausing at each of
them.
For example, this triggers a transition to the `Critical` memory state:
```posix-terminal
k pmm mem_avail_state 2
memory-pressure: memory availability state - Critical
pq: MRU generation is 714 set 4.144414945s ago due to "Active ratio", LRU generation is 708
pq: Pager buckets [3482],[115],317,0,199,0,{6939},0, evict first: 0, live active/inactive totals: 3597/7455
memory-pressure: set target memory to evict 1MB (free memory is 149MB)
Leaked 1817528 pages
Sleeping for 10 seconds...
[EVICT]: Free memory before eviction was 147MB and after eviction is 151MB
[EVICT]: Evicted 986 user pager backed pages
Freed 1817528 pages
memory-pressure: memory availability state - Normal
pq: MRU generation is 717 set 1.213355379s ago due to "Timeout", LRU generation is 711
pq: Pager buckets [4351],[258],149,37,0,1,{5798},0, evict first: 0, live active/inactive totals: 4609/5985
```
Here the system transitioned to `Critical` by allocating 1817528 pages (the
page size is 4KB). Then there was a sleep for 10 seconds (default for holding
the state) during which the `Critical` pressure persisted. Finally, the 1817528
allocated pages were freed up, and the memory pressure dropped back to `Normal`.
The `Critical` state transition caused some pager-backed memory to be evicted as
well, as can be seen by the `[EVICT]` lines.
The `k pmm mem_avail_state` command is a useful tool to test memory pressure
response of the system as a whole. Since it works by allocating actual physical
memory, it exercises all the reclamation mechanisms the system has at its
disposal, both within the kernel and in userspace.
These are additional `k pmm oom` commands used to test system response
specifically at the OOM level.
```none {:.devsite-disable-click-to-copy}l
pmm oom [<rate>] : leak memory until oom is triggered, optionally specify the rate at which to leak (in MB per second)
pmm oom hard : leak memory aggressively and keep on leaking
pmm oom signal : trigger oom signal without leaking memory
```
Sample output with `k pmm oom`:
```posix-terminal
k pmm oom
Disabling VM scanner
memory-pressure: free memory is 49MB, evicting pages to prevent OOM...
pq: MRU generation is 13 set 7.979442243s ago due to "Active ratio", LRU generation is 7
pq: Pager buckets [4538],[4517],3624,4606,13716,4976,{0},0, evict first: 1347, live active/inactive totals: 9055/28269
memory-pressure: found no pages to evict
memory-pressure: free memory after OOM eviction is 49MB
memory-pressure: pausing for 8s after OOM mem signal
[00028.317] 02811:03481> [fshost] INFO: [admin-server.cc(33)] received shutdown command over admin interface
[00028.317] 02811:03481> [fshost] INFO: [fs-manager.cc(281)] filesystem shutdown initiated
[00028.317] 02811:38032> [fshost] INFO: [fs-manager.cc(310)] Shutting down /data
[00028.318] 12900:12902> [minfs] INFO: [minfs.cc(1471)] Shutting down
[00028.340] 12900:12902> [minfs] WARNING: [src/storage/bin/minfs/main.cc(53)] Unmounted
[00028.341] 02811:03481> [fshost] INFO: [admin-server.cc(39)] shutdown complete
[00028.342] 02811:02813> [fshost] INFO: [main.cc(309)] terminating
[00028.342] 02687:02689> [driver_manager.cm] INFO: [suspend_handler.cc(205)] Successfully waited for VFS exit completion
memory-pressure: rebooting due to OOM
memory-pressure: stowing crashlog
ZIRCON REBOOT REASON (OOM)
Shutting down debuglog
platform_halt suggested_action 1 reason 3
Rebooting...
```
### Simulating memory pressure signals in userspace
Use the `fx mem --signal` command to simulate memory pressure signals in
userspace without actually leaking any memory. This is useful when the goal is
to test the response of a particular userspace process to memory pressure
signals without altering the memory state of the system.
```posix-terminal
fx mem --help
--signal=L Signal userspace clients with memory pressure level L
where L can be CRITICAL, WARNING or NORMAL. Clients can
use this command to test their response to memory pressure.
Does not affect the real memory pressure level on the system,
or trigger any kernel memory reclamation tasks.
```
For example, with `fx mem --signal=WARNING`, the following shows in the `fx
log` output:
```none {:.devsite-disable-click-to-copy}
[00213.059579][26701][26703][memory_monitor] INFO: [pressure_notifier.cc(106)] Simulating memory pressure level WARNING
```
Note that this command does not actually allocate any memory. It simply
simulates a one-time memory pressure signal for the requested level in
userspace, without affecting the kernel's memory availability state. As such, it
will not trigger any kernel memory reclamation, like eviction of pager-backed
memory.