|  | # Thread Local Storage # | 
|  |  | 
|  | The ELF Thread Local Storage ABI (TLS) is a storage model for variables that | 
|  | allows each thread to have a unique copy of a global variable. This model | 
|  | is used to implement C++'s `thread_local` storage model. On thread creation the | 
|  | variable will be given its initial value from the initial TLS image. TLS | 
|  | variables are for instance useful as buffers in thread safe code or for per | 
|  | thread book keeping. C style errors like errno or dlerror can also be handled | 
|  | this way. | 
|  |  | 
|  | TLS variables are much like any other global/static variable. In implementation | 
|  | their initial data winds up in the `PT_TLS` segment. The `PT_TLS` segment | 
|  | is inside of a read only `PT_LOAD` segment despite TLS variables being writable. | 
|  | This segment is then copied into the process for each thread in a unique | 
|  | writable location. The location the `PT_TLS` segment is copied to is influenced | 
|  | by the segment's alignment to ensure that the alignment of TLS variables is | 
|  | respected. | 
|  |  | 
|  | ## ABI ## | 
|  |  | 
|  | The actual interface that the compiler, linker, and dynamic linker must adhere | 
|  | to is actually quite simple despite the details of the implementation being more | 
|  | complex. The compiler and the linker must emit code and dynamic relocations that | 
|  | use one of the 4 access models (described in a following section). The dynamic | 
|  | linker and thread implementation must then set everything up so that this | 
|  | actually works. Different architectures have different ABIs but they're similar | 
|  | enough at broad strokes that we can speak about most of them as if there was | 
|  | just one ABI. This document will assume that either x86-64 or AArch64 is being | 
|  | used and will point out differences when they occur. | 
|  |  | 
|  | The TLS ABI makes use of a few terms: | 
|  |  | 
|  | * Thread Pointer: This is a unique address in each thread, generally stored | 
|  | in a register. Thread local variables lie at offsets from the thread pointer. | 
|  | Thread Pointer will be abbreviated and used as `$tp` in this document. `$tp` | 
|  | is what `__builtin_thread_pointer()` returns on AArch64. On AArch64 `$tp` | 
|  | is given by a special register named `TPIDR_EL0` that can be accessed using | 
|  | `mrs <reg>, TPIDR_EL0`. On `x86_64` the `fs.base` segment base is used and | 
|  | can be accessed with `%fs:` and can be loaded from `%fs:0` or `rdfsbase` | 
|  | instruction. | 
|  | * TLS Segment: This is the image of data in each module and specified by the | 
|  | `PT_TLS` program header in each module. Not every module has a `PT_TLS` | 
|  | program header and thus not every module has a TLS segment. Each module | 
|  | has at most one TLS segment and correspondingly at most one `PT_TLS` | 
|  | program header. | 
|  | * Static TLS set: This is the sum total of modules that are known to the | 
|  | dynamic linker at program start up time. It consists of the main executable | 
|  | and every library transitively mentioned by `DT_NEEDED`. Modules that | 
|  | require being in the Static TLS set have `DF_STATIC_TLS` set on their | 
|  | `DT_FLAGS` entry in their dynamic table (given by the `PT_DYNAMIC` segment). | 
|  | * TLS Region: This is a contiguous region of memory unique to each | 
|  | thread. `$tp` will point to some point in this region. It contains the | 
|  | TLS segment of every module in Static TLS set as well as some | 
|  | implementation-private data, which is sometimes called the TCB (Thread | 
|  | Control Block). On AArch64 a 16-byte reserved space starting at `$tp` is | 
|  | also sometimes called the TCB. We will refer to this space as the "ABI TCB" | 
|  | in this doc. | 
|  | * TLS Block: This is an individual thread's copy of a TLS segment. There is | 
|  | one TLS block per TLS segment per thread. | 
|  | * Module ID: The module ID is not statically known except for the main | 
|  | executable's module ID which is always 1. Other module's module IDs are | 
|  | chosen by the dynamic linker. It's just a unique non-zero ID for each | 
|  | module. In theory it could be any non-zero 64-bit value that is unique to | 
|  | the module like a hash or something. In practice it's just a simple counter | 
|  | that the dynamic linker maintains. | 
|  | * The main executable: This is the module that contains the start address. It, | 
|  | is also treated in a special way in one of the access models. It always | 
|  | has a Module ID of 1. This is the only module that can use fixed offsets | 
|  | from `$tp` via the Local Exec model described below. | 
|  |  | 
|  | To comply with the ABI all access models must be supported. | 
|  |  | 
|  | #### Access Models #### | 
|  |  | 
|  | There are 4 access models specified by the ABI: | 
|  |  | 
|  | * `global-dynamic` | 
|  | * `local-dynamic` | 
|  | * `initial-exec` | 
|  | * `local-exec` | 
|  |  | 
|  | These are the values that can be used for `-ftls-model=...` and | 
|  | `__attribute__((tls_model("...")))` | 
|  |  | 
|  | Which model is used relates to: | 
|  |  | 
|  | 1. Which module is performing the access: | 
|  | 1. The main executable | 
|  | 2. A module in the static TLS set | 
|  | 3. A module that was loaded after startup, e.g. by `dlopen` | 
|  | 2. Which module the variable being accessed is defined in: | 
|  | 1. Within the same module (i.e. `local-*`) | 
|  | 2. In a different module (i.e. `global-*`) | 
|  |  | 
|  | * `global-dynamic` Can be used from anywhere, for any variable. | 
|  | * `local-dynamic` Can be used by any module, for any variable defined in that | 
|  | same module. | 
|  | * `initial-exec` Can be used by any module for any variable defined in the static | 
|  | TLS set. | 
|  | * `local-exec` Can be used by the main executable for variables defined in the | 
|  | main executable. | 
|  |  | 
|  | ###### Global Dynamic ###### | 
|  |  | 
|  | Global dynamic is the most general access format. It is also the slowest. | 
|  | Any thread-local global variable should be accessible with this method. This | 
|  | access model *must* be used if a dynamic library accesses a symbol defined in | 
|  | another module (see exception in section on Initial Exec). Symbols defined | 
|  | within the executable need not use this access model. The main executable can | 
|  | also avoid using this access model. This is the default access model when | 
|  | compiling with `-fPIC` as is the norm for shared libraries. | 
|  |  | 
|  | This access model works by calling a function defined in the dynamic linker. | 
|  | There are two ways functions might be called, via TLSDESC, or via | 
|  | `__tls_get_addr`. | 
|  |  | 
|  | In the case of `__tls_get_addr` it is passed the pair of `GOT` entries | 
|  | associated with this symbol. Specifically it is passed the pointer to the first | 
|  | and the second entry comes right after it. For a given symbol `S`, the first | 
|  | entry, denoted `GOT_S[0]`, must contain the Module ID of the module in which | 
|  | `S` was defined. The second entry, denoted `GOT_S[1]`, must contain offset into | 
|  | TLS Block, which is the same as the offset of the symbol in the `PT_TLS` segment | 
|  | of the associated module. The pointer to `S` is then computed using | 
|  | `__tls_get_addr(GOT_S)`. The implementation of `__tls_get_addr` will be | 
|  | discussed later. | 
|  |  | 
|  | TLSDESC is an alternative ABI for `global-dynamic` access (and `local-dynamic`) | 
|  | where a different pair of `GOT` slots are used where the first `GOT` slot | 
|  | contains a function pointer. The second contains some dynamic linker defined | 
|  | auxiliary data. This allows the dynamic linker a choice over which function is | 
|  | called depending on circumstance. | 
|  |  | 
|  | In both cases the calls to these functions must be implemented by a specific | 
|  | code sequence and a specific set of relocs. This allows the linker to recognize | 
|  | these accesses and potentially relax them to the `local-dynamic` access model. | 
|  |  | 
|  | (NOTE: The following paragraph contains details about how the compiler upholds | 
|  | its end of the ABI. Skip this paragraph if you don't care about that.) | 
|  |  | 
|  | For the compiler to emit code for this access model a call needs to be emitted | 
|  | against `__tls_get_addr` (defined by the dynamic linker) and a reference to the | 
|  | symbol name. Specifically the compiler the emits code for (minding the | 
|  | additional relocation needed for the GOT itself) `__tls_get_addr(GOT_S)`. The | 
|  | linker then emits two dynamic relocations when generating the GOT. On `x86_64` | 
|  | these are `R_X86_64_DTPMOD` and `R_X86_64_DTPOFF`. On AArch64 these are | 
|  | `R_AARCH64_DTPMOD` and `R_AARCH64_DTPOFF`. These relocations reference the symbol | 
|  | regardless of whether or not the module defines a symbol by that name or not. | 
|  |  | 
|  | ###### Local Dynamic ###### | 
|  |  | 
|  | Local dynamic the same as Global Dynamic but for local symbols. It can be | 
|  | thought of as a single `global-dynamic` access to the TLS block of this module. | 
|  | Then because every variable defined in the module is at fixed offsets from the | 
|  | TLS block the compiler can optimize multiple `global-dynamic` calls into one. | 
|  | The compiler will relax a `global-dynamic` access to a `local-dynamic` access | 
|  | whenever the variables are local/static or have hidden visibility. The linker | 
|  | may sometimes be able to relax some `global-dynamic` accesses to `local-dynamic` | 
|  | as well. | 
|  |  | 
|  | The following gives an example of how the compiler might emit code for this | 
|  | access model: | 
|  |  | 
|  | ``` | 
|  | static thread_local char buf[buf_cap]; | 
|  | static thread_local size_t buf_size = 0; | 
|  | while(*str && buf_size < buf_cap) { | 
|  | buf[buf_size++] = *str++; | 
|  | } | 
|  | ``` | 
|  |  | 
|  | might be lowered to | 
|  |  | 
|  | ``` | 
|  | // GOT_module[0] is the module ID of this module | 
|  | // GOT_module[1] is just 0 | 
|  | // <X> denotes the offset of X in this module's TLS block | 
|  | tls = __tls_get_addr(GOT_module) | 
|  | while(*str && *(size_t*)(tls+<buf_size>) < buf_cap) { | 
|  | (char*)(tls+<buf>)[*(size_t*)(tls+<buf_size>)++] = *str++; | 
|  | } | 
|  | ``` | 
|  |  | 
|  | If this code used global dynamic it would have to make at least 2 calls, one to | 
|  | get the pointer for buf and the other to get the pointer for `buf_size`. | 
|  |  | 
|  | ###### Initial Exec ###### | 
|  |  | 
|  | This access model can be used anytime the compiler knows the module that the | 
|  | symbol being accessed is defined in will be loaded in the initial set of | 
|  | executables rather than opened using `dlopen`. This access model is generally | 
|  | only used when the main executable is accessing a global symbol with default | 
|  | visibility. This is because compiling an executable is the only time the | 
|  | compiler knows that any code generated will be in the initial executable set. If | 
|  | a DSO is compiled to make thread local accesses use this model then the DSO | 
|  | cannot be safely opened with `dlopen`. This is acceptable in performance | 
|  | critical applications and in cases where you know the binary will never be | 
|  | dlopen-ed such as in the case of libc. Modules compiled/linked this way have | 
|  | their `DF_STATIC_TLS` flag set. | 
|  |  | 
|  | Initial Exec is the default when compiling without `-fPIC`. | 
|  |  | 
|  | The compiler emits code without even calling `__tls_get_addr` for this access | 
|  | model. It does so using a single GOT entry, which we'll denote `GOT_s` for symbol | 
|  | `s`, for which the compiler emits relocations, to ensure that | 
|  |  | 
|  | ``` | 
|  | extern thread_local int a; | 
|  | extern thread_local int b; | 
|  | int main() { | 
|  | return a + b; | 
|  | } | 
|  | ``` | 
|  |  | 
|  | would be lowered to something like the following | 
|  |  | 
|  | ``` | 
|  | int main() { | 
|  | return *(int*)($tp + GOT[a]) + *(int*)($tp + GOT[b]); | 
|  | } | 
|  | ``` | 
|  |  | 
|  | Note that on x86 architectures `GOT[s]` will actually resolve to a negative | 
|  | value. | 
|  |  | 
|  | ###### Local Exec ###### | 
|  |  | 
|  | This is the fastest access model and can only be used if the symbol is in the | 
|  | first TLS block, which is the TLS block of the main executable. In practice only | 
|  | the main executable can use this access mode because any shared library can't | 
|  | (and normally wouldn't need to) know if it is accessing something from the main | 
|  | executable. The linker will relax `initial-exec` to `local-exec`. The compiler | 
|  | can't do this without explicit instructions via `-ftls-model` or | 
|  | `__attribute__((tls_model("...")))` because the compiler cannot know if the | 
|  | current translation unit is going to be linked into a main executable or a | 
|  | shared library. | 
|  |  | 
|  | The precise details of how this offset is computed changes a bit | 
|  | from architecture to architecture. | 
|  |  | 
|  | example code: | 
|  |  | 
|  | ``` | 
|  | static thread_local int a; | 
|  | static thread_local int b; | 
|  |  | 
|  | int main() { | 
|  | return a + b; | 
|  | } | 
|  | ``` | 
|  |  | 
|  | would be lowered to | 
|  |  | 
|  | ``` | 
|  | int main() { | 
|  | return (int*)($tp+TPOFF_a) + (int*)($tp+TPOFF_b)); | 
|  | } | 
|  | ``` | 
|  |  | 
|  | On AArch64 `TPOFF_a == max(16, p_align) + <a>` where `p_align` is exactly the | 
|  | `p_align` field of the main executable's `PT_TLS` segment and `<a>` is the | 
|  | offset of `a` from the beginning of the main executable's TLS segment. | 
|  |  | 
|  | On `x86_64` `TPOFF_a == -<a>` where `<a>` is the offset of the `a` from the *end* | 
|  | of the main executable's TLS segment. | 
|  |  | 
|  | The linker is aware of what `TPOFF_X` is for any given `X` and fills in this | 
|  | value. | 
|  |  | 
|  | ## Implementation ## | 
|  |  | 
|  | This section discusses the implementation as it is implemented on Fuchsia. This | 
|  | said the broad strokes here are widely similar across different libc | 
|  | implementations including musl and glibc. | 
|  |  | 
|  | The actual implementation of all of this introduces a few more details. Namely | 
|  | the so-called "DTV" (Dynamic Thread Vector) (denoted `dtv` in this doc), which | 
|  | indexes TLS blocks by module ID. The following diagram shows what the initial | 
|  | executable set looks like. In Fuchsia's implementation we actually store a | 
|  | bunch of meta information in a thread descriptor struct along with the | 
|  | ABI TCB (denoted `tcb` below). In our implementation we use the first 8 bytes | 
|  | of this space to point to the DTV. At first `tcb` points to `dtv` as shown in | 
|  | the below diagrams but after a dlopen this can change. | 
|  |  | 
|  | arm64: | 
|  |  | 
|  | ``` | 
|  | *------------------------------------------------------------------------------* | 
|  | | thread | tcb | X | tls1 | ... | tlsN | ... | tls_cnt | dtv[1] | ... | dtv[N] | | 
|  | *------------------------------------------------------------------------------* | 
|  | ^         ^         ^             ^            ^ | 
|  | td        tp      dtv[1]       dtv[n+1]       dtv | 
|  | ``` | 
|  |  | 
|  | Here `X` has size `min(16, tls_align) - 16` where `tls_align` is the maximum | 
|  | alignment of all loaded TLS segments from the static TLS set. This is set by | 
|  | the static linker since the static linker resolves `TPOFF_*` values. This | 
|  | padding is set that so that if, as required, `$tp` is aligned to main | 
|  | executable's `PT_TLS` segment's `p_align` value then `tls1 - $tp` will be | 
|  | `max(16, p_align)`. This ensures that there is always at least a 16 byte space | 
|  | for the ABI TCB (denoted `tcb` in the diagram above). | 
|  |  | 
|  | x86: | 
|  |  | 
|  | ``` | 
|  | *-----------------------------------------------------------------------------* | 
|  | | tls_cnt | dtv[1] | ... | dtv[N] | ... | tlsN | ... | tls1 | tcb |  thread   | | 
|  | *-----------------------------------------------------------------------------* | 
|  | ^                                       ^             ^       ^ | 
|  | dtv                                  dtv[n+1]       dtv[1]  tp/td | 
|  | ``` | 
|  |  | 
|  | Here `td` denotes the "thread descriptor pointer". In both implementations this | 
|  | points to the thread descriptor. A subtle point not made apparent in these | 
|  | diagrams is that `tcb` is actually a member of the thread descriptor struct in | 
|  | both cases but on AArch64 it is the last member and on `x86_64` it is the first | 
|  | member. | 
|  |  | 
|  | #### dlopen #### | 
|  |  | 
|  | This picture explains what happens for the initial executables but it doesn't | 
|  | explain what happens in the `dlopen` case. When `__tls_get_addr` is called it | 
|  | first checks to see if `tls_cnt` is such that the module ID (given by `GOT_s[0]` | 
|  | ) is within the `dtv`. If it is then it simply looks up `dtv[GOT_s[0]] + GOT_s[1]` | 
|  | but if it isn't something more complicated happens. See the implementation of | 
|  | `__tls_get_new` in [dynlink.c](/zircon/third_party/ulib/musl/ldso/dynlink.c). | 
|  |  | 
|  | In a nutshell a sufficiently large space was already allocated for a larger `dtv` | 
|  | on a call to `dlopen`. It is an invariant of the system that sufficient space | 
|  | will always exist somewhere already allocated. The larger space is then setup to | 
|  | be a proper `dtv`. `tcb` is then set to point to this new larger `dtv`. Future | 
|  | accesses will then use the simpler code path since `tls_cnt` will be large | 
|  | enough. |