| # Memory reclamation in Fuchsia |
| |
| Most operating systems employ memory reclamation strategies to ensure that the |
| working set of running processes at any point in time can efficiently utilize |
| all available physical memory. The operating system has a fixed amount of |
| physical memory (RAM) to distribute amongst all running processes, and it might |
| not be possible to accommodate them all at the same time. |
| |
| In its simplest form, memory reclamation is page replacement, where pages that |
| are not as important for the current user activity are replaced with pages that |
| might be more important. Most operating systems maintain a pool of free pages, |
| so that incoming memory allocations are quickly fulfilled, rather than blocked |
| while waiting for a page in use to be freed. |
| |
| Fuchsia also employs a similar strategy where the system tries to keep the |
| amount of free memory larger than a certain threshold. Fuchsia uses several |
| memory reclamation techniques, both within the kernel and in userspace. This |
| guide describes how these memory reclamation techniques work. Fuchsia also |
| provides a set of tools to analyze and dump memory usage (see |
| [Memory usage analysis tools](/docs/development/kernel/memory/memory.md#userspace_memory)). |
| |
| ## Pager-backed memory eviction |
| |
| Userspace filesystems use |
| [pagers](/docs/reference/kernel_objects/pager.md) |
| to [page](https://en.wikipedia.org/wiki/Memory_paging) in files on demand from |
| an external source, like a disk. Filesystems represent files in memory using |
| VMOs, whose pages are populated by the pager service as and when they are |
| accessed. |
| |
| On Fuchsia, |
| [blobfs](/docs/concepts/filesystems/blobfs.md) is |
| an immutable filesystem that hosts all executable files using a pager to |
| populate pages on demand. When the system comes under memory pressure, i.e. the |
| amount of available free memory starts running low, the kernel evicts pages |
| backed by blobfs in order to reclaim memory. Since these pages exist on disk, |
| they can be fetched back in when required. |
| |
| The kernel tracks all pager-backed memory. When free memory is low, it finds |
| suitable candidates to evict. Pages are tracked in several LRU (least recently |
| used) page queues, which a kernel background thread rotates periodically, in |
| order to "age" pages. Another background thread evicts pages from the oldest |
| page queue under memory pressure. |
| |
| As free memory dips lower, the kernel adjusts its aging and eviction policies |
| to be more aggressive, aging pages quicker in order to find more evictable |
| candidates. In order to prevent thrashing, pages in the MRU (most recently used) |
| page queue are never evicted. The length of this queue varies based on the |
| amount of churn on the system. |
| |
| If the system is relatively quiet and the memory usage is stable, the kernel |
| ages pages slower and more pages accumulate in the MRU queue. On the other hand, |
| if the user is cycling through several activities, constantly switching the |
| working set, the kernel tries to keep up by aging pages more aggressively. |
| |
| Userspace processes can also use |
| [eviction hints](/docs/contribute/governance/rfcs/0068_eviction_hints.md) |
| to influence the kernel eviction strategy. Processes can use the `DONT_NEED` |
| hint to indicate pages are no longer in use and would be good candidates for |
| eviction. They can also use `ALWAYS_NEED` to indicate pages are important and |
| should not be considered for eviction, thereby avoiding the cost of fetching |
| them back in when they're accessed again. |
| |
| Learn more about eviction hints in the reference docs: |
| [`zx_vmo_op_range`](/docs/reference/syscalls/vmo_op_range.md) |
| and |
| [`zx_vmar_op_range`](/docs/reference/syscalls/vmar_op_range.md). |
| |
| ## Zero page deduplication |
| |
| Pages in anonymous VMOs (non-pager-backed) get populated / committed only on a |
| write. Reads are fulfilled by the kernel using a singleton zero page. Even after |
| pages have been committed on a write, the kernel tries to deduplicate pages that |
| are filled only with zeros back to the singleton zero page in order to save |
| memory. The kernel periodically scans physical pages in anonymous VMOs, looking |
| for opportunities to deduplicate zero pages. |
| |
| ## Page table reclamation |
| |
| As explained in [Address spaces](/docs/concepts/memory/address_spaces.md), |
| the VMAR hierarchy helps the kernel track virtual to physical memory mappings. |
| When a virtual address is accessed for the first time, |
| the address space's VMAR tree is used to look up the underlying physical page. |
| The virtual-to-physical mapping is then stored in the hardware page tables, |
| which the MMU uses for future lookups. |
| Under memory pressure, |
| the kernel reclaims memory in hardware page tables that hasn't been accessed for a while. |
| When those mappings are needed again, |
| they can be reconstructed from the VMAR tree. |
| |
| ## Discardable VMOs |
| |
| Userspace processes can create a special flavor of |
| [VMOs that are discardable](/docs/contribute/governance/rfcs/0012_zircon_discardable_memory.md). |
| Clients can |
| [lock and unlock ](/docs/reference/syscalls/vmo_op_range.md)discardable |
| VMOs depending on whether or not they are being used. When the system is under |
| memory pressure, the kernel finds discardable VMOs that are unlocked and frees |
| them. |
| |
| Sample code (modulo error handling): |
| |
| ```cpp |
| // Create a discardable VMO. |
| zx_handle_t vmo_handle; |
| zx_vmo_create(vmo_size, ZX_VMO_DISCARDABLE, &vmo); |
| |
| // Lock the VMO. |
| zx_vmo_lock_state_t lock_state = {}; |
| zx_vmo_op_range(vmo, ZX_VMO_OP_LOCK, 0, vmo_size, &lock_state, |
| sizeof(lock_state))); |
| |
| // Use the VMO as desired. |
| zx_vmo_read(vmo, buf, 0, sizeof(buf)); |
| |
| // Unlock the VMO. The kernel is free to discard it now. |
| vmo_op_range(vmo, ZX_VMO_OP_UNLOCK, 0, vmo_size, nullptr, 0); |
| |
| // Lock the VMO again before use. |
| zx_vmo_op_range(vmo, ZX_VMO_OP_LOCK, 0, vmo_size, &lock_state, |
| sizeof(lock_state))); |
| |
| if (lock_state.discarded_size > 0) { |
| // The kernel discarded the VMO. Re-initialize it if required. |
| zx_vmo_write(vmo, data, 0, sizeof(data)); |
| } else { |
| // The kernel did not discard the VMO. Previous contents were preserved. |
| } |
| ``` |
| |
| ## Memory pressure signals |
| |
| Fuchsia provides userspace processes the ability to directly control their |
| memory consumption in response to system-wide available memory. Clients can |
| register to receive |
| [memory pressure |
| signals](https://fuchsia.dev/reference/fidl/fuchsia.memorypressure.md) |
| and take actions depending on the observed memory pressure level. There are |
| [three memory pressure levels](https://cs.opensource.google/fuchsia/fuchsia/+/main:sdk/fidl/fuchsia.memorypressure/memorypressure.fidl;l=8): |
| |
| <table> |
| <tr><th>Name</th><th>Value</th><th>Description</th></tr> |
| <tr id="Level.NORMAL"> |
| <td><h3 id="Level.NORMAL" class="add-link hide-from-toc">NORMAL</h3></td> |
| <td><code>1</code></td> |
| <td><p>The memory pressure level is healthy.</p> |
| <p>Registered clients are free to hold on to caches and allocate memory |
| unrestricted.</p> |
| <p>However, clients should take care to not proactively re-create caches on a |
| transition back to the NORMAL level, causing a memory spike that immediately |
| pushes the level over to WARNING again.</p> |
| </td> |
| </tr> |
| <tr id="Level.WARNING"> |
| <td><h3 id="Level.WARNING" class="add-link hide-from-toc">WARNING</h3></td> |
| <td><code>2</code></td> |
| <td><p>The memory pressure level is somewhat constrained, and might cross over to |
| the critical pressure range if left unchecked.</p> |
| <p>Registered clients are expected to optimize their operation to limit memory |
| usage, rather than for best performance, for example, by reducing cache sizes |
| and non-essential memory allocations.</p> |
| <p>Clients must take care to regulate the amount of work they undertake in |
| order to reclaim memory, and ensure that it does not cause visible |
| performance degradation. There exists some memory pressure, but not enough |
| to justify trading off user responsiveness to reclaim memory.</p> |
| </td> |
| </tr> |
| <tr id="Level.CRITICAL"> |
| <td><h3 id="Level.CRITICAL" class="add-link hide-from-toc">CRITICAL</h3></td> |
| <td><code>3</code></td> |
| <td><p>The memory pressure level is very constrained.</p> |
| <p>Registered clients are expected to drop all non-essential memory, and refrain |
| from allocating more memory. Failing to do so might result in the job |
| getting terminated, or the system being rebooted in the case of global |
| memory pressure.</p> |
| <p>Clients may undertake expensive work to reclaim memory if required, since |
| failing to do so might result in termination. The client might decide that a |
| performance hit is a fair tradeoff in this case.</p> |
| </td> |
| </tr> |
| </table> |
| |
| ### Comparing memory pressure signals to discardable VMOs |
| |
| Userspace clients can pick between memory pressure signals and discardable |
| VMOs, or use a combination of both reclamation mechanisms based on their |
| needs.These are some things to consider when making the choice: |
| |
| - Memory pressure signals allow clients to do more than just trim caches. |
| For example, jobs can tear down non-essential processes in their job tree. |
| They can also stop certain memory-intensive activities, or hold off on |
| starting new ones until the pressure level is Normal. |
| - With discardable VMOs, the userspace client gives up control over when |
| the memory is freed to the kernel. The kernel decides when to free the |
| memory based on various factors: the amount of available memory, memory |
| that can be reclaimed by other means, etc. If the client wishes to finely |
| control the lifetimes of its caches, when to trim what, etc., memory |
| pressure signals might be more suitable. |
| - It is possible that discardable VMOs end up preserving their contents |
| for longer than if the process was tearing down the VMOs itself in response |
| to memory pressure signals. The kernel drives freeing of discardable VMOs, |
| and the kernel has more global context around the amount of free memory, so |
| it knows exactly how much to reclaim. The kernel also has other means of |
| reclaiming memory at its disposal, so it's possible that not all |
| discardable VMOs need to be freed up. On the other hand, if the userspace |
| client is responding to memory pressure itself, it will likely react in the |
| same manner every time, trimming all its caches. |
| - Discardable memory can also allow the kernel to reclaim memory more |
| quickly so that the system recovers faster. With memory pressure signals, |
| there can be some IPC and scheduling latency involved, between the kernel |
| signaling the pressure level change, and the userspace process responding to it. |
| |
| ## OOM (Out-of-memory) reboot |
| |
| It is possible for all memory reclamation strategies to fail to free up enough |
| memory in the face of certain aggressive memory allocation patterns. When that |
| happens, the kernel opts to reboot after cleanly shutting down filesystems to |
| prevent data loss. When the free memory level falls below a preconfigured OOM |
| threshold, an OOM reboot is triggered. |
| |
| ## Tools to test memory pressure response |
| |
| ### Observing and testing kernel memory reclamation |
| |
| Use the `k scanner` command to observe and test reclamation techniques the |
| kernel uses: pager-backed eviction, discardable VMO reclamation, zero page |
| deduplication, and page table reclamation. It can also be used to test the page |
| queue rotation / aging strategy used to drive eviction. Run `k scanner` on a |
| serial console to see all available options: |
| |
| ```posix-terminal |
| k scanner |
| usage: |
| scanner dump : dump scanner info |
| scanner push_disable : increase scanner disable count |
| scanner pop_disable : decrease scanner disable count |
| scanner reclaim_all : attempt to reclaim all possible memory |
| scanner rotate_queue : immediately rotate the page queues |
| scanner reclaim <MB> [only_old] : attempt to reclaim requested MB of memory. |
| scanner pt_reclaim [on|off] : turn unused page table reclamation on or off |
| scanner harvest_accessed : harvest all page accessed information |
| ``` |
| |
| `k scanner dump` dumps the current state of the page queues and other relevant |
| memory counters the kernel uses for reclamation: |
| |
| ```posix-terminal |
| k scanner dump |
| [SCAN]: Scanner enabled. Triggering informational scan |
| [SCAN]: Found 4303 zero pages across all of memory |
| [SCAN]: Found 8995 user-pager backed pages in queue 0 |
| [SCAN]: Found 3278 user-pager backed pages in queue 1 |
| [SCAN]: Found 8947 user-pager backed pages in queue 2 |
| [SCAN]: Found 10776 user-pager backed pages in queue 3 |
| [SCAN]: Found 3981 user-pager backed pages in queue 4 |
| [SCAN]: Found 0 user-pager backed pages in queue 5 |
| [SCAN]: Found 0 user-pager backed pages in queue 6 |
| [SCAN]: Found 0 user-pager backed pages in queue 7 |
| [SCAN]: Found 1347 user-pager backed pages in DontNeed queue |
| [SCAN]: Found 40 zero forked pages |
| [SCAN]: Found 0 locked pages in discardable vmos |
| [SCAN]: Found 0 unlocked pages in discardable vmos |
| pq: MRU generation is 12 set 10.720698018s ago due to "Active ratio", LRU generation is 6 |
| pq: Pager buckets [8995],[3278],8947,10776,3981,0,{0},0, evict first: 1347, live active/inactive totals: 12273/25051 |
| ``` |
| |
| Test reclaiming memory with `k scanner reclaim` or `k scanner reclaim_all`: |
| |
| ```posix-terminal |
| k scanner reclaim_all |
| [EVICT]: Free memory before eviction was 7161MB and after eviction is 7290MB |
| [EVICT]: Evicted 33004 user pager backed pages |
| [SCAN]: De-duped 25 pages that were recently forked from the zero page |
| ``` |
| |
| Test page table reclamation with `k pmm drop_user_pt`: |
| |
| ```posix-terminal |
| k pmm |
| … |
| pmm drop_user_pt : drop all user hardware page tables |
| ``` |
| |
| ### Observing and generating memory pressure |
| |
| Use the `k pmm mem_avail_state` command to generate memory pressure on the |
| system, by allocating memory to reach the specified memory pressure level. This |
| is useful for testing system-wide response to memory pressure: |
| |
| ```posix-terminal |
| k pmm mem_avail_state |
| pmm mem_avail_state info : dump memory availability state info |
| pmm mem_avail_state [step] <state> [<nsecs>] : allocate memory to go to memstate <state>, hold the state for <nsecs> (10s by default). Only works if going to <state> from current state requires allocating memory, can't free up pre-allocated memory. In optional [step] mode, allocation pauses for 1 second at each intermediate memory availability state until <state> is reached. |
| ``` |
| |
| `k pmm mem_avail_state info` dumps the current memory pressure state. |
| |
| ```posix-terminal |
| k pmm mem_avail_state info |
| watermarks: [50M, 60M, 150M, 300M] |
| debounce: 1M |
| current state: 4 |
| current bounds: [299M, 16.0E] |
| free memory: 7253.5M |
| ``` |
| |
| The memory availability states are numbered starting from 0, and are a superset |
| of the levels mentioned previously for [memory pressure |
| signals](#memory_pressure_signals). |
| |
| - `OOM` is state 0. This is the free memory level below which the kernel |
| decides to reboot the system. |
| - `Imminent-OOM` is state 1. This is a diagnostic-only memory level, set |
| at a small delta from the OOM level. Its sole purpose is to provide a means |
| to collect OOM diagnostic information safely, as it might be too late to |
| gather diagnostics at the OOM level. Learn more about this level in |
| [RFC-0091](/docs/contribute/governance/rfcs/0091_getevent_imminent_oom.md). |
| - `Critical` is state 2. This is the level that triggers the CRITICAL |
| memory pressure signal. |
| - `Warning` is state 3. This is the level that triggers the WARNING memory |
| pressure signal. |
| - `Normal` is state 4. This is the level that triggers the NORMAL memory |
| pressure signal. |
| |
| In the example above, the `current state` is 4, i.e. Normal. |
| |
| The `watermarks` show the memory thresholds that delineate the different memory |
| availability states. The output in the above example shows these memory |
| thresholds: |
| |
| ```none {:.devsite-disable-click-to-copy} |
| OOM: 50MB, Imminent-OOM: 60MB, Critical: 150MB, Warning: 300MB |
| ``` |
| |
| The `debounce` is the slack or error margin used when computing memory state |
| boundaries. In this example, it is 1MB. |
| |
| The `current bounds` shows the free memory bounds applicable to the current |
| memory state. Given the current state is `Normal`, referring to the |
| `watermarks`, `Normal` starts at the 300MB threshold. Using the 1MB debounce, |
| the lower limit is 299MB. There isn't an applicable upper limit for the `Normal` |
| level, which is set to `UINT64_MAX` here. |
| |
| Lastly, the total `free memory` on the system is currently 7253.5MB. |
| |
| Use the command `k pmm mem_avail_state X` to transition to memory availability |
| state `X`, where `X` is the numerical memory state as described above. |
| Optionally provide a duration for which the requested state is to be held. There |
| is also an option to "step" through intermediate states, pausing at each of |
| them. |
| |
| For example, this triggers a transition to the `Critical` memory state: |
| |
| ```posix-terminal |
| k pmm mem_avail_state 2 |
| memory-pressure: memory availability state - Critical |
| pq: MRU generation is 714 set 4.144414945s ago due to "Active ratio", LRU generation is 708 |
| pq: Pager buckets [3482],[115],317,0,199,0,{6939},0, evict first: 0, live active/inactive totals: 3597/7455 |
| memory-pressure: set target memory to evict 1MB (free memory is 149MB) |
| Leaked 1817528 pages |
| Sleeping for 10 seconds... |
| [EVICT]: Free memory before eviction was 147MB and after eviction is 151MB |
| [EVICT]: Evicted 986 user pager backed pages |
| Freed 1817528 pages |
| memory-pressure: memory availability state - Normal |
| pq: MRU generation is 717 set 1.213355379s ago due to "Timeout", LRU generation is 711 |
| pq: Pager buckets [4351],[258],149,37,0,1,{5798},0, evict first: 0, live active/inactive totals: 4609/5985 |
| ``` |
| |
| Here the system transitioned to `Critical` by allocating 1817528 pages (the |
| page size is 4KB). Then there was a sleep for 10 seconds (default for holding |
| the state) during which the `Critical` pressure persisted. Finally, the 1817528 |
| allocated pages were freed up, and the memory pressure dropped back to `Normal`. |
| The `Critical` state transition caused some pager-backed memory to be evicted as |
| well, as can be seen by the `[EVICT]` lines. |
| |
| The `k pmm mem_avail_state` command is a useful tool to test memory pressure |
| response of the system as a whole. Since it works by allocating actual physical |
| memory, it exercises all the reclamation mechanisms the system has at its |
| disposal, both within the kernel and in userspace. |
| |
| These are additional `k pmm oom` commands used to test system response |
| specifically at the OOM level. |
| |
| ```none {:.devsite-disable-click-to-copy}l |
| pmm oom [<rate>] : leak memory until oom is triggered, optionally specify the rate at which to leak (in MB per second) |
| pmm oom hard : leak memory aggressively and keep on leaking |
| pmm oom signal : trigger oom signal without leaking memory |
| ``` |
| |
| Sample output with `k pmm oom`: |
| |
| ```posix-terminal |
| k pmm oom |
| Disabling VM scanner |
| memory-pressure: free memory is 49MB, evicting pages to prevent OOM... |
| pq: MRU generation is 13 set 7.979442243s ago due to "Active ratio", LRU generation is 7 |
| pq: Pager buckets [4538],[4517],3624,4606,13716,4976,{0},0, evict first: 1347, live active/inactive totals: 9055/28269 |
| memory-pressure: found no pages to evict |
| memory-pressure: free memory after OOM eviction is 49MB |
| … |
| memory-pressure: pausing for 8s after OOM mem signal |
| [00028.317] 02811:03481> [fshost] INFO: [admin-server.cc(33)] received shutdown command over admin interface |
| [00028.317] 02811:03481> [fshost] INFO: [fs-manager.cc(281)] filesystem shutdown initiated |
| [00028.317] 02811:38032> [fshost] INFO: [fs-manager.cc(310)] Shutting down /data |
| [00028.318] 12900:12902> [minfs] INFO: [minfs.cc(1471)] Shutting down |
| [00028.340] 12900:12902> [minfs] WARNING: [src/storage/bin/minfs/main.cc(53)] Unmounted |
| [00028.341] 02811:03481> [fshost] INFO: [admin-server.cc(39)] shutdown complete |
| [00028.342] 02811:02813> [fshost] INFO: [main.cc(309)] terminating |
| [00028.342] 02687:02689> [driver_manager.cm] INFO: [suspend_handler.cc(205)] Successfully waited for VFS exit completion |
| |
| memory-pressure: rebooting due to OOM |
| memory-pressure: stowing crashlog |
| ZIRCON REBOOT REASON (OOM) |
| Shutting down debuglog |
| platform_halt suggested_action 1 reason 3 |
| Rebooting... |
| ``` |
| |
| ### Simulating memory pressure signals in userspace |
| |
| Use the `fx mem --signal` command to simulate memory pressure signals in |
| userspace without actually leaking any memory. This is useful when the goal is |
| to test the response of a particular userspace process to memory pressure |
| signals without altering the memory state of the system. |
| |
| ```posix-terminal |
| fx mem --help |
| … |
| --signal=L Signal userspace clients with memory pressure level L |
| where L can be CRITICAL, WARNING or NORMAL. Clients can |
| use this command to test their response to memory pressure. |
| Does not affect the real memory pressure level on the system, |
| or trigger any kernel memory reclamation tasks. |
| ``` |
| |
| For example, with `fx mem --signal=WARNING`, the following shows in the `fx |
| log` output: |
| |
| ```none {:.devsite-disable-click-to-copy} |
| [00213.059579][26701][26703][memory_monitor] INFO: [pressure_notifier.cc(106)] Simulating memory pressure level WARNING |
| ``` |
| |
| Note that this command does not actually allocate any memory. It simply |
| simulates a one-time memory pressure signal for the requested level in |
| userspace, without affecting the kernel's memory availability state. As such, it |
| will not trigger any kernel memory reclamation, like eviction of pager-backed |
| memory. |