| <!-- mdformat off(templates not supported) --> |
| {% set rfcid = "RFC-0159" %} |
| {% include "docs/contribute/governance/rfcs/_common/_rfc_header.md" %} |
| # {{ rfc.name }}: {{ rfc.title }} |
| <!-- SET the `rfcid` VAR ABOVE. DO NOT EDIT ANYTHING ELSE ABOVE THIS LINE. --> |
| |
| <!-- mdformat on --> |
| |
| <!-- This should begin with an H2 element (for example, ## Summary).--> |
| |
| ## Summary |
| |
| This document proposes changes to kernel APIs to support binaries with |
| execute-only segments, by adding a new feature check in |
| `zx_system_get_features` and changing the `launchpad` and `process_builder` |
| loaders as well as the dynamic linker in Fuchsia's in-tree libc to support '--x' |
| segments. It lays out a plan for eventual kernel support for mapping |
| execute-only pages on hardware that supports it. |
| |
| We don't typically need to read executable memory after it has been loaded. |
| Enabling execute-only code by default increases security of Fuchsia’s userspace |
| processes and furthers the engineering best practice of least permissions. |
| |
| ## Motivation |
| |
| Support for execute-only pages was added to ARM MMUs in ARMv7m and allows pages |
| of memory to be mapped such that they are only executable and not readable or |
| writable. Though writable code pages have been considered a security threat for |
| a long time, allowing code to remain readable has been shown to expose |
| applications to needless risk. Specifically, reading code pages is often a first |
| step in an attack chain, and preventing code from being read hinders |
| adversaries. See [Readable Code Security](#readable-code-security). Moreover, |
| supporting execute-only pages fits well with Fuchsia’s permissions model and |
| more strongly aligns with the principle of least privilege: often code doesn’t |
| need to be read, but just be executed. |
| |
| ## Stakeholders |
| |
| _Facilitator:_ |
| |
| - cpu@google.com |
| |
| _Reviewers:_ |
| |
| - phosek@google.com |
| - mvanotti@google.com |
| - maniscalco@google.com |
| - travisg@google.com |
| |
| ## Background |
| |
| ### Execute-only Memory |
| |
| Execute-only memory (XOM) describes memory pages that have neither read nor |
| write permissions and can only be executed. ARMv7m and above have native support |
| for XOM, however there are some considerations on older ISA’s. Discussed further |
| in [XOM and PAN](#xom-and-pan). |
| |
| This doc focuses almost exclusively on AArch64, however the implementation is |
| architecture agnostic. When hardware and toolchain support matures for other |
| architectures, they would all easily be able to take advantage of execute-only |
| support in Fuchsia. |
| |
| ### Permissions of Code Pages |
| |
| Initially, computers supported direct memory access to physical memory without |
| any checks or protections. The introduction of MMUs provided a key abstraction, |
| in the form of virtual memory, by decoupling a program's view of memory from the |
| underlying physical resources. This facilitated a more flexible, safe, and |
| secure programming model by allowing OS implementers to provide strong isolation |
| between their programs via the process abstraction. Today's MMUs provide a |
| number of critical facilities, such as paged memory, fast address translation, |
| and permission checking. They also allow users significant control over how |
| memory regions can be accessed and used, via the page permissions that typically |
| control if memory pages can be read, written to, or executed. This is a key |
| property for program safety, fault isolation, and security, since it restricts a |
| program's ability to misuse system resources through hardware enforced |
| permission checks. |
| |
| Memory that is both writable and executable is particularly dangerous because it |
| provides an easy way for an adversary to achieve arbitrary code execution |
| through common vulnerabilities, like buffer overflows. For this reason, many OS |
| configurations explicitly disallow pages to be both writable and executable |
| (W^X). This has been the standard for over a decade, OpenBSD added support for |
| W^X in 2003 with OpenBSD 3.3 [openbsd-wxorx]. See also SELinux W^X policies |
| [selinux-wxorx]. Writable code can be useful for things like just-in-time (JIT) |
| compilation, which writes executable instructions to memory at runtime. Having |
| W|X pages can be disallowed and JIT’s need to work around this. An easy way is |
| to write code to non-executable pages and later change the page protections, |
| i.e., through `mprotect` or `zx_vmar_protect`, to be executable but not writable |
| [example-fuchsia-test]. In nearly all cases pages that are W|X are too |
| permissive. Similarly, executable pages rarely ever need to be read [See |
| exceptions](#readable-code). Allowing read operations on executable pages is |
| generally unnecessary and should not be the default. |
| |
| ### Readable Code |
| |
| Because of ARM’s fixed instruction width, immediate values have size |
| constraints. For this reason loads are done using PC-Relative addressing. To get |
| around this, the pseudo instruction `ldr Rd, =imm` will emit `imm` in literal |
| pools close to the code loading it. This is incompatible with XOM because it |
| puts data in the text section which must be readable. When searching for use of |
| literal pools in the codebase to ensure we don’t read executable segments, we |
| have found some usages of `ldr Rd, =imm` in Zircon, but all has since been |
| removed. Clang will not use literal pools for aarch64, instead it will emit |
| multiple instructions to create a large immediate. Clang has a `-mexecute-only` |
| flag and alias `-mpure-code` but these are only meaningful on arm32 because |
| these flags are inherent when targeting aarch64. |
| |
| #### Example: Large Intermediates |
| |
| This example shows how Clang compiles this C code to assembly given different |
| targets [clang-example]. The top row shows aarch64, and the bottom shows arm32: |
| |
| ``` |
| uint32_t a() { |
| return 0x12345678u; |
| } |
| ``` |
| ``` |
| # -target aarch64 |
| a: |
| mov w0, #22136 |
| movk w0, #4660, lsl #16 |
| ret |
| ``` |
| ``` |
| # -target arm |
| a: |
| ldr r0, .LCPI0_0 |
| bx lr |
| .LCPI0_0: |
| .long 305419896 |
| ``` |
| |
| ### XOM and PAN |
| |
| Privileged access never (PAN) is a security feature on ARM chips that prevents |
| normal memory access to user pages from kernel mode. It helps protect against |
| potential kernel vulnerabilities because the kernel cannot touch user memory |
| with a normal load or store instructions. Instead the OS would need to turn PAN |
| off or use the `ldtr` and `sttr` instructions for accessing those pages. PAN is |
| not currently enabled for Fuchsia, but there are already plans to support it in |
| zircon [pan-fxb]. |
| |
| Aarch64 page table entries have 4 relevant bits to control page permissions. 2 |
| bits are used for user and privileged execute-never. The remaining two are used |
| to describe read and write page permissions for both access levels. An |
| execute-only mapping has both read and write access removed but allows user |
| execution. |
| |
| This table from the ARMv8 Reference Manual shows the possible memory protections |
| using the only 4 available bits. EL0 is the exception level for userspace. Rows |
| 0 and 2 show how to create userspace execute-only pages. See Table D5-34 Stage 1 |
| from the ARMv8 Reference Manual. |
| |
| | UXN | PXN | AP[2:1] | Access from a higher Exception level | Access from EL0 | |
| |-----|-----|---------|--------------------------------------|------------------| |
| | 0 | 1 | 00 | R, W | X | |
| | 0 | 1 | 01 | R, W | R, W, X | |
| | 0 | 1 | 10 | R | X | |
| | 0 | 1 | 11 | R | R, X | |
| |
| Unfortunately, PAN’s algorithm for deciding if a page should not be privileged |
| accessible checks if the page is user-readable. From the perspective of PAN, a |
| user-execute-only page looks like a privileged mapping. This allows the kernel |
| to access user memory where it otherwise should not, thereby bypassing PAN’s |
| intended purpose and making PAN and XOM incompatible [pan-issue]. This would |
| make any future usage of PAN not useful against attacks trying to exploit the |
| kernel touching user memory, however it would still be useful for detecting |
| kernel bugs. |
| |
| This problem caused both Linux and Android to drop support for XOM. This was |
| particularly noticeable for Android who dropped support indefinitely in Android |
| 11 after being added and made the default for all aarch64 binaries in Android 10 |
| [linux-revert][android-xom]. They plan to re-enable the feature as hardware |
| which fixes the problem becomes more ubiquitous but there is no concrete time |
| frame when it will be readded. |
| |
| ARM has since proposed a solution with “enhanced” PAN or ePAN, which changes PAN |
| to check not just if a page is user readable but also not user executable. |
| Unfortunately, hardware with the feature may not be on any Fuchsia-targeted |
| devices for years. Linux has since re-added their implementation of XOM after |
| ePAN was made [linux-re-land]. Support for ePAN on devices is out of our control |
| and the incompatibility with PAN and XOM should not block the kernel’s |
| implementation of PAN [See more](#risks). |
| |
| From figure 2, there is no possible configuration where read permission can be |
| stripped from the kernel. The only exception is PAN, which can cause an |
| exception when the kernel tries to touch a user-readable page. For this reason, |
| it is not possible to create an execute-only mapping for the kernel, since the |
| kernel cannot mark a page executable at EL1 but not readable. Thus, it is only |
| possible to create an execute-only mapping for userspace processes. |
| |
| ### Targeting XOM Hardware |
| |
| Segment permissions in ELF indicate what permissions the code requires to run |
| correctly. In other words, software doesn’t need to know at build time if the |
| hardware it will run on can support XOM or not. Instead, it should |
| unconditionally use XOM if it will not need to read code pages. It is up to the |
| OS and loaders to enforce those permissions to the greatest extent the system |
| allows [elf-segment-perm]. |
| |
| ### Virtual Memory Permissions |
| |
| POSIX specifies that `mmap` may permit read access to pages where `PROT_READ` |
| has not been explicitly set [posix-mmap]. Both Linux and macOS on x86, and macOS |
| on M1 chips, will not fail when requesting pages from mmap with just `PROT_EXEC` |
| and instead make the pages `PROT_READ | PROT_EXEC`. These implementations have |
| syscalls which are “best-effort” in their ability to honor a user's requests. |
| Fuchsia syscalls, on the other hand, are always explicit in what they can and |
| cannot honor. The `zx_vmar_*` syscalls do not silently escalate permissions of |
| pages like their POSIX counterparts are permitted to by the standard. Requesting |
| pages without `ZX_VM_PERM_READ` will currently always fail as the hardware and |
| the OS do not support mapping pages without read permissions. A graceful |
| transition to supporting binaries with execute-only segments and userspace |
| programs which allocate execute-only memory will require a way to check if the |
| OS can map execute-only pages prior to requesting them. |
| |
| ### Readable Code Security |
| |
| Many attacks rely on finding out information about the process through reading |
| code pages to find “gadgets”, or executable code of interest. Address space |
| layout randomization (ASLR) is a technique used by operating systems to load |
| binary segments at semi-random places in the process's address space. It is used |
| by Fuchsia and many other OS to hinder attacks which rely on knowing where code |
| or other data is in memory. Making code unreadable further reduces the attack |
| surface. |
| |
| Code reuse attacks, like “return-to-libc” [rtl-attack], are used to return |
| control of a function to a known address. libc is a logical choice to return or |
| jump into since it contains rich functionality useful to an attacker, and |
| because it is extremely likely the process will link against libc. It has been |
| demonstrated that the available gadgets in a typical program are |
| Turing-complete, giving an adversary the ability to execute arbitrary code. |
| |
| In many cases an adversary's objective is to obtain a shell. ASLR makes these |
| kinds of attacks harder because the addresses of functions are different between |
| invocations of a program. However, ASLR isn’t a comprehensive mitigation, |
| because attackers can read code pages to find the address of functions that they |
| would otherwise not know by looking at their address in the binary. XOM makes it |
| impossible for ASLR to be broken in this way and attackers will need to use |
| another way to find out information about the location of specific code pages. |
| |
| ### Common Notation |
| |
| #### ‘rwx/r-x/–x’ |
| |
| These represent permissions of ELF segments, which get mapped into the processes |
| address space with the corresponding permissions. This notation is used commonly |
| both when describing permissions of files, as well as ELF segments by tools like |
| `readelf`. r, w and x mean read, write and execute respectively and ‘-’ means |
| the permission is not granted. An execute-only segment will have ‘--x’ |
| permissions. |
| |
| #### R^X, W|X, etc… |
| |
| As above, R, W and X refer to read, write and execute. ‘^’ and ‘|’ are C-like |
| operators for xor and or. R^X is read as “read xor execute”. |
| |
| #### "ax" |
| |
| This is assembler syntax which marks a section as allocated and executable. |
| Currently linkers will put “ax” sections into segments that are ‘r-x’. The |
| `--execute-only` flag in lld will mark these segments as ‘--x’ instead. |
| |
| ## Design |
| |
| To increase security of our userspace programs by supporting XOM, both our |
| toolchain and loaders will need to be updated. The clang driver will need to |
| pass the ‘--execute-only’ flag to the linker to ensure “ax” sections which would |
| otherwise be mapped to ‘r-x’ segments are instead mapped to ‘--x’ segments. The |
| loaders will also need to change the sanity checks that all requested |
| permissions contain at least read, because this will no longer be true. |
| |
| As it will only be possible to use XOM on hardware that has ePAN, we will need |
| to gracefully support the transition. We have two options: |
| |
| 1. Change `vmar_*` functions to be best effort like many `mmap` implementations |
| 1. Create a way to query the kernel if it supports execute-only mappings and |
| have the loader escalate permissions of a ‘--x’ segment to ‘r-x’ if XOM is not |
| available. |
| 1. Add a new `ZX_VM_PERM_READ_IF_XOM_UNSUPPORTED` flag for loaders to use with |
| ‘--x’ segments. |
| |
| In all cases, there will be a potential silent escalation of permissions. The |
| first option would be the easiest, the loaders would need no changes other than |
| removing their sanity checks. The second option is not significantly more |
| complex, it just would add a simple check in the loaders before deciding what |
| memory permissions to request from the OS. The third option is helpful because |
| it is less error prone in user code. |
| |
| The first option would end up breaking Fuchsia’s current strict contract with |
| userspace of always being explicit about what a syscall can and cannot honor. |
| The 2nd and 3rd option also end up with ambiguous handling of memory permissions |
| when loading ELF files. However this fits within the ELF specification. Segment |
| permissions don’t specify 1:1 what permissions the memory allocated for a |
| segment will have, but rather which permissions the memory must at least have |
| for the program to operate correctly. ELF loaders are within their rights to map |
| a ‘--x’ segment into ‘r-x’ memory [elf-segment-perm]. |
| |
| The first option of breaking Fuchsia’s current contract of explicit syscall |
| handling isn’t ideal. Both option 2 and 3 have value and the implementation |
| proposed in this RFC will be based on both options. |
| |
| ## Implementation |
| |
| ### System Call Additions |
| |
| A new flag `ZX_VM_PERM_READ_IF_XOM_UNSUPPORTED` will be added which will make |
| the various `zx_vmar_*` syscalls which take a permissions flag in `options` |
| which will implicitly add read permission if XOM is not supported. |
| `ZX_VM_PERM_READ_IF_XOM_UNSUPPORTED` is logically only useful with |
| `ZX_VM_PERM_EXEC` and not `ZX_VM_PERM_READ`, however the various syscall which |
| accept this flag will not be treating this as an invariant. It is safe to have |
| `ZX_VM_PERM_READ_IF_XOM_UNSUPPORTED` with any other combination of flags, it |
| will just be treated as `ZX_VM_PERM_READ` in contexts where the system |
| cannot map execute-only pages. |
| |
| A new `kind` value `ZX_FEATURE_KIND_VM` will be added for |
| `zx_system_get_features`, which will yield a bitset similar to |
| `ZX_FEATURE_KIND_CPU`. There will also be a new feature |
| `ZX_VM_FEATURE_CAN_MAP_XOM`. The current implementation will always keep this |
| bit false because XOM will not be enabled until later. This will not be used by |
| the loaders because ‘r-x’ memory permissions are valid for a ‘--x’ segments, but |
| is still important for userspace to be able to query for this functionality. |
| |
| ### System Loader ABI Changes |
| |
| Current and future loaders will ensure '--x' segments can be loaded into memory |
| even if the target can't support XOM. The loaders will add |
| `ZX_VM_PERM_READ_IF_XOM_UNSUPPORTED` when mapping execute-only segments. |
| |
| ### Shipped Dynamic Linker ABI Changes |
| |
| Similarly, the dynamic linker in Fuchsia’s libc shipped with the SDK will also |
| escalate permissions where necessary when allocating memory for ‘--x’ segments |
| with `ZX_VM_PERM_READ_IF_XOM_UNSUPPORTED`. |
| |
| ### Compiler Toolchain Changes |
| |
| The clang driver will also be changed to always pass `--execute-only` to the |
| linker when targeting `aarch64-*-fuchsia`. We will also need a way to opt out of |
| this behavior, most likely by adding a new ‘--no-execute-only’ flag to the |
| linker, so programs can easily opt out of the new default behavior. |
| |
| ### Kernel XOM Implementation |
| |
| Once hardware arrives that supports ePAN, the kernel can service a request for |
| memory pages to have just `ZX_VM_PERM_EXECUTE`. The arm64 user-copy |
| implementation may need updates to ensure it's consistent with how user memory |
| access is constrained. `user_copy` should be updated to use the `ldtr` and |
| `sttr` instructions. This will ensure that users cannot trick the kernel to read |
| unreadable pages for them. Moreover, the kernel makes assumptions about mappings |
| being readable in a couple of places and these will need to be changed where |
| appropriate. This work will be done later. |
| |
| ### Unnecessary Changes |
| |
| `zx_process_read_memory` does not need to be changed, and debuggers should work |
| normally when debugging execute-only binaries. `zx_process_read_memory` ignores |
| the permissions of the pages it is reading from, and only checks that the |
| process handle has `ZX_RIGHT_READ` and `ZX_RIGHT_WRITE`. |
| |
| `zx_vmar_protect` will continue to work as it does currently. Most notably this |
| means that processes can protect their code pages with read permission in cases |
| where that is necessary. |
| |
| ## Performance |
| |
| There is no expected impact in performance. |
| |
| ## Security |
| |
| Until XOM is implemented in the kernel a binary with ‘--x’ segments will be just |
| as secure as an equivalent binary using ‘r-x’ segments. Once XOM is supported |
| both by hardware and the OS, programs which elect to use execute-only memory |
| will become more secure. See sections [Permissions of Code |
| Pages](#permissions-of-code-pages), [XOM and PAN](#xom-and-pan) and [Readable |
| Code Security](#readable-code-security). |
| |
| ## Privacy |
| |
| No extra considerations other than those mentioned in [Security](#security). |
| |
| ## Testing |
| |
| `zx_system_get_features` will have trivial testing when we are forcing XOM |
| support in the kernel where we can know at build time what we expect the |
| syscall to return. |
| |
| The `ZX_VM_PERM_READ_IF_XOM_UNSUPPORTED` will be tested that it makes a page |
| readable when it is reported by `zx_system_get_features` that the OS cannot |
| create execute-only pages. |
| |
| Likewise, the elfload library doesn't have any real testing, save for fuzz tests |
| which don't test expected functionality. Instead its functionality is inherently |
| tested by other components that rely on it. Testing should be added here to |
| ensure '--x' segments are correctly mapped. The process_builder library does |
| have tests, and these will ensure it properly requests readable and executable |
| memory when XOM is not available. |
| |
| The changes to the current dynamic linker will not be tested directly. A new |
| dynamic linker is planned and it will have extensive testing, including testing |
| of ‘--x’ segments. |
| |
| The changes to the clang driver will have testing in upstream LLVM. |
| |
| We will also set up testing configuration for enabling XOM on test bots, even if |
| that hardware does not have ePAN and we would otherwise not enable XOM. This |
| will help us catch in tree programs that read their code pages and need to opt |
| out of execute-only. |
| |
| ## Documentation |
| |
| The changes to `zx_system_get_features` will be documented, as well as the |
| motivation for why user space would want to query with the kind |
| `ZX_VM_FEATURE_CAN_MAP_XOM`. Likewise the new |
| `ZX_VM_PERM_READ_IF_XOM_UNSUPPORTED` flag will also be documentated. Changes to |
| the various loaders and the clang driver defaults will not be documented outside |
| of this RFC. |
| |
| ## Drawbacks, Alternatives, Unknowns |
| |
| It is unknown how much current and future out of tree code relies on executable |
| code being readable. This could be from use of data constants in text from |
| handwritten assembly, code compiled from other toolchains or program |
| introspection. Regardless, programs which need to have readable code pages, will |
| still benefit because their shared library dependencies, including libc, will be |
| marked execute only. Changing our clang toolchain to default to execute-only |
| segments will break programs which depend on readable code. There is no easy way |
| to check at build time if a program relies on this behavior. However once it is |
| identified that a program needs ‘r-x’ segments, opting out of the default ‘--x’ |
| will be simple. |
| |
| For programs which need to be able to read some of their code but not all, |
| current tooling cannot easily support this. The `--execute-only linker` flag |
| will strip read permissions from any executable segment, and there is no way to |
| mark a single section as needed to be read. Programs which want this behavior |
| will need to opt out of execute-only completely. |
| |
| ## Risks |
| |
| It is possible that the clang driver defaults to using `--execute-only` and code |
| that reads from a ‘--x’ segment won’t be broken until hardware and kernel |
| support for XOM lands. This creates potential forward compatibility problems for |
| software that didn’t change. Testing will exist for in tree software, but most |
| likely not for out of tree code. |
| |
| ## Prior Art and References |
| |
| Because of the ambiguous handling of `mmap` permission flags in many POSIX |
| implementations, they have no need for an analogue to |
| `zx_system_get_features(ZX_FEATURE_KIND_CAN_MAP_XOM, &feature)`. |
| |
| Darwin supports XOM on newer Apple chips, but their implementation is more |
| robust using proprietary hardware features. Their chips have hardware support |
| for stripping individual permission bits from both kernel and user memory. It is |
| not enabled for userspace in macOS. [apple-xom] |
| |
| [example-fuchsia-test]: https://source.corp.google.com/fuchsia/zircon/system/utest/core/memory-mapping/memory-mapping.cc;l=126 |
| [openbsd-wxorx]: https://www.openbsd.org/33.html |
| [selinux-wxorx]: https://akkadia.org/drepper/selinux-mem.html |
| [clang-example]: https://godbolt.org/z/hGzr49qYs |
| [android-xom]: https://source.android.com/devices/tech/debug/execute-only-memory |
| [elf-segment-perm]: https://www.sco.com/developers/gabi/latest/ch5.pheader.html |
| [posix-mmap]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/mmap.html |
| [rtl-attack]: https://dl.acm.org/doi/10.1145/1315245.1315313 |
| [pan-fxb]: https://fxbug.dev/59284 |
| [pan-issue]: https://blog.siguza.net/PAN/ |
| [linux-revert]: https://github.com/torvalds/linux/commit/cab15ce604e550020bb7115b779013b91bcdbc21 |
| [linux-re-land]: https://github.com/torvalds/linux/commit/18107f8a2df6bf1c6cac8d0713f757f866d5af51 |
| [apple-xom]: https://i.blackhat.com/USA-19/Thursday/us-19-Krstic-Behind-The-Scenes-Of-IOS-And-Mas-Security.pdf |