| # Thread Local Storage # |
| |
| The ELF Thread Local Storage ABI (TLS) is a storage model for variables that |
| allows each thread to have a unique copy of a global variable. This model |
| is used to implement C++'s `thread_local` storage model. On thread creation the |
| variable will be given its initial value from the initial TLS image. TLS |
| variables are for instance useful as buffers in thread safe code or for per |
| thread book keeping. C style errors like errno or dlerror can also be handled |
| this way. |
| |
| TLS variables are much like any other global/static variable. In implementation |
| their initial data winds up in the `PT_TLS` segment. The `PT_TLS` segment |
| is inside of a read only `PT_LOAD` segment despite TLS variables being writable. |
| This segment is then copied into the process for each thread in a unique |
| writable location. The location the `PT_TLS` segment is copied to is influenced |
| by the segment's alignment to ensure that the alignment of TLS variables is |
| respected. |
| |
| ## ABI ## |
| |
| The actual interface that the compiler, linker, and dynamic linker must adhere |
| to is actually quite simple despite the details of the implementation being more |
| complex. The compiler and the linker must emit code and dynamic relocations that |
| use one of the 4 access models (described in a following section). The dynamic |
| linker and thread implementation must then set everything up so that this |
| actually works. Different architectures have different ABIs but they're similar |
| enough at broad strokes that we can speak about most of them as if there was |
| just one ABI. This document will assume that either x86-64 or AArch64 is being |
| used and will point out differences when they occur. |
| |
| The TLS ABI makes use of a few terms: |
| |
| * Thread Pointer: This is a unique address in each thread, generally stored |
| in a register. Thread local variables lie at offsets from the thread pointer. |
| Thread Pointer will be abbreviated and used as `$tp` in this document. `$tp` |
| is what `__builtin_thread_pointer()` returns on AArch64. On AArch64 `$tp` |
| is given by a special register named `TPIDR_EL0` that can be accessed using |
| `mrs <reg>, TPIDR_EL0`. On `x86_64` the `fs.base` segment base is used and |
| can be accessed with `%fs:` and can be loaded from `%fs:0` or `rdfsbase` |
| instruction. |
| * TLS Segment: This is the image of data in each module and specified by the |
| `PT_TLS` program header in each module. Not every module has a `PT_TLS` |
| program header and thus not every module has a TLS segment. Each module |
| has at most one TLS segment and correspondingly at most one `PT_TLS` |
| program header. |
| * Static TLS set: This is the sum total of modules that are known to the |
| dynamic linker at program start up time. It consists of the main executable |
| and every library transitively mentioned by `DT_NEEDED`. Modules that |
| require being in the Static TLS set have `DF_STATIC_TLS` set on their |
| `DT_FLAGS` entry in their dynamic table (given by the `PT_DYNAMIC` segment). |
| * TLS Region: This is a contiguous region of memory unique to each |
| thread. `$tp` will point to some point in this region. It contains the |
| TLS segment of every module in Static TLS set as well as some |
| implementation-private data, which is sometimes called the TCB (Thread |
| Control Block). On AArch64 a 16-byte reserved space starting at `$tp` is |
| also sometimes called the TCB. We will refer to this space as the "ABI TCB" |
| in this doc. |
| * TLS Block: This is an individual thread's copy of a TLS segment. There is |
| one TLS block per TLS segment per thread. |
| * Module ID: The module ID is not statically known except for the main |
| executable's module ID which is always 1. Other module's module IDs are |
| chosen by the dynamic linker. It's just a unique non-zero ID for each |
| module. In theory it could be any non-zero 64-bit value that is unique to |
| the module like a hash or something. In practice it's just a simple counter |
| that the dynamic linker maintains. |
| * The main executable: This is the module that contains the start address. It, |
| is also treated in a special way in one of the access models. It always |
| has a Module ID of 1. This is the only module that can use fixed offsets |
| from `$tp` via the Local Exec model described below. |
| |
| To comply with the ABI all access models must be supported. |
| |
| #### Access Models #### |
| |
| There are 4 access models specified by the ABI: |
| |
| * `global-dynamic` |
| * `local-dynamic` |
| * `initial-exec` |
| * `local-exec` |
| |
| These are the values that can be used for `-ftls-model=...` and |
| `__attribute__((tls_model("...")))` |
| |
| Which model is used relates to: |
| |
| 1. Which module is performing the access: |
| 1. The main executable |
| 2. A module in the static TLS set |
| 3. A module that was loaded after startup, e.g. by `dlopen` |
| 2. Which module the variable being accessed is defined in: |
| 1. Within the same module (i.e. `local-*`) |
| 2. In a different module (i.e. `global-*`) |
| |
| * `global-dynamic` Can be used from anywhere, for any variable. |
| * `local-dynamic` Can be used by any module, for any variable defined in that |
| same module. |
| * `initial-exec` Can be used by any module for any variable defined in the static |
| TLS set. |
| * `local-exec` Can be used by the main executable for variables defined in the |
| main executable. |
| |
| ###### Global Dynamic ###### |
| |
| Global dynamic is the most general access format. It is also the slowest. |
| Any thread-local global variable should be accessible with this method. This |
| access model *must* be used if a dynamic library accesses a symbol defined in |
| another module (see exception in section on Initial Exec). Symbols defined |
| within the executable need not use this access model. The main executable can |
| also avoid using this access model. This is the default access model when |
| compiling with `-fPIC` as is the norm for shared libraries. |
| |
| This access model works by calling a function defined in the dynamic linker. |
| There are two ways functions might be called, via TLSDESC, or via |
| `__tls_get_addr`. |
| |
| In the case of `__tls_get_addr` it is passed the pair of `GOT` entries |
| associated with this symbol. Specifically it is passed the pointer to the first |
| and the second entry comes right after it. For a given symbol `S`, the first |
| entry, denoted `GOT_S[0]`, must contain the Module ID of the module in which |
| `S` was defined. The second entry, denoted `GOT_S[1]`, must contain offset into |
| TLS Block, which is the same as the offset of the symbol in the `PT_TLS` segment |
| of the associated module. The pointer to `S` is then computed using |
| `__tls_get_addr(GOT_S)`. The implementation of `__tls_get_addr` will be |
| discussed later. |
| |
| TLSDESC is an alternative ABI for `global-dynamic` access (and `local-dynamic`) |
| where a different pair of `GOT` slots are used where the first `GOT` slot |
| contains a function pointer. The second contains some dynamic linker defined |
| auxiliary data. This allows the dynamic linker a choice over which function is |
| called depending on circumstance. |
| |
| In both cases the calls to these functions must be implemented by a specific |
| code sequence and a specific set of relocs. This allows the linker to recognize |
| these accesses and potentially relax them to the `local-dynamic` access model. |
| |
| (NOTE: The following paragraph contains details about how the compiler upholds |
| its end of the ABI. Skip this paragraph if you don't care about that.) |
| |
| For the compiler to emit code for this access model a call needs to be emitted |
| against `__tls_get_addr` (defined by the dynamic linker) and a reference to the |
| symbol name. Specifically the compiler the emits code for (minding the |
| additional relocation needed for the GOT itself) `__tls_get_addr(GOT_S)`. The |
| linker then emits two dynamic relocations when generating the GOT. On `x86_64` |
| these are `R_X86_64_DTPMOD` and `R_X86_64_DTPOFF`. On AArch64 these are |
| `R_AARCH64_DTPMOD` and `R_AARCH64_DTPOFF`. These relocations reference the symbol |
| regardless of whether or not the module defines a symbol by that name or not. |
| |
| ###### Local Dynamic ###### |
| |
| Local dynamic the same as Global Dynamic but for local symbols. It can be |
| thought of as a single `global-dynamic` access to the TLS block of this module. |
| Then because every variable defined in the module is at fixed offsets from the |
| TLS block the compiler can optimize multiple `global-dynamic` calls into one. |
| The compiler will relax a `global-dynamic` access to a `local-dynamic` access |
| whenever the variables are local/static or have hidden visibility. The linker |
| may sometimes be able to relax some `global-dynamic` accesses to `local-dynamic` |
| as well. |
| |
| The following gives an example of how the compiler might emit code for this |
| access model: |
| |
| ``` |
| static thread_local char buf[buf_cap]; |
| static thread_local size_t buf_size = 0; |
| while(*str && buf_size < buf_cap) { |
| buf[buf_size++] = *str++; |
| } |
| ``` |
| |
| might be lowered to |
| |
| ``` |
| // GOT_module[0] is the module ID of this module |
| // GOT_module[1] is just 0 |
| // <X> denotes the offset of X in this module's TLS block |
| tls = __tls_get_addr(GOT_module) |
| while(*str && *(size_t*)(tls+<buf_size>) < buf_cap) { |
| (char*)(tls+<buf>)[*(size_t*)(tls+<buf_size>)++] = *str++; |
| } |
| ``` |
| |
| If this code used global dynamic it would have to make at least 2 calls, one to |
| get the pointer for buf and the other to get the pointer for `buf_size`. |
| |
| ###### Initial Exec ###### |
| |
| This access model can be used anytime the compiler knows the module that the |
| symbol being accessed is defined in will be loaded in the initial set of |
| executables rather than opened using `dlopen`. This access model is generally |
| only used when the main executable is accessing a global symbol with default |
| visibility. This is because compiling an executable is the only time the |
| compiler knows that any code generated will be in the initial executable set. If |
| a DSO is compiled to make thread local accesses use this model then the DSO |
| cannot be safely opened with `dlopen`. This is acceptable in performance |
| critical applications and in cases where you know the binary will never be |
| dlopen-ed such as in the case of libc. Modules compiled/linked this way have |
| their `DF_STATIC_TLS` flag set. |
| |
| Initial Exec is the default when compiling without `-fPIC`. |
| |
| The compiler emits code without even calling `__tls_get_addr` for this access |
| model. It does so using a single GOT entry, which we'll denote `GOT_s` for symbol |
| `s`, for which the compiler emits relocations, to ensure that |
| |
| ``` |
| extern thread_local int a; |
| extern thread_local int b; |
| int main() { |
| return a + b; |
| } |
| ``` |
| |
| would be lowered to something like the following |
| |
| ``` |
| int main() { |
| return *(int*)($tp + GOT[a]) + *(int*)($tp + GOT[b]); |
| } |
| ``` |
| |
| Note that on x86 architectures `GOT[s]` will actually resolve to a negative |
| value. |
| |
| ###### Local Exec ###### |
| |
| This is the fastest access model and can only be used if the symbol is in the |
| first TLS block, which is the TLS block of the main executable. In practice only |
| the main executable can use this access mode because any shared library can't |
| (and normally wouldn't need to) know if it is accessing something from the main |
| executable. The linker will relax `initial-exec` to `local-exec`. The compiler |
| can't do this without explicit instructions via `-ftls-model` or |
| `__attribute__((tls_model("...")))` because the compiler cannot know if the |
| current translation unit is going to be linked into a main executable or a |
| shared library. |
| |
| The precise details of how this offset is computed changes a bit |
| from architecture to architecture. |
| |
| example code: |
| |
| ``` |
| static thread_local int a; |
| static thread_local int b; |
| |
| int main() { |
| return a + b; |
| } |
| ``` |
| |
| would be lowered to |
| |
| ``` |
| int main() { |
| return (int*)($tp+TPOFF_a) + (int*)($tp+TPOFF_b)); |
| } |
| ``` |
| |
| On AArch64 `TPOFF_a == max(16, p_align) + <a>` where `p_align` is exactly the |
| `p_align` field of the main executable's `PT_TLS` segment and `<a>` is the |
| offset of `a` from the beginning of the main executable's TLS segment. |
| |
| On `x86_64` `TPOFF_a == -<a>` where `<a>` is the offset of the `a` from the *end* |
| of the main executable's TLS segment. |
| |
| The linker is aware of what `TPOFF_X` is for any given `X` and fills in this |
| value. |
| |
| ## Implementation ## |
| |
| This section discusses the implementation as it is implemented on Fuchsia. This |
| said the broad strokes here are widely similar across different libc |
| implementations including musl and glibc. |
| |
| The actual implementation of all of this introduces a few more details. Namely |
| the so-called "DTV" (Dynamic Thread Vector) (denoted `dtv` in this doc), which |
| indexes TLS blocks by module ID. The following diagram shows what the initial |
| executable set looks like. In Fuchsia's implementation we actually store a |
| bunch of meta information in a thread descriptor struct along with the |
| ABI TCB (denoted `tcb` below). In our implementation we use the first 8 bytes |
| of this space to point to the DTV. At first `tcb` points to `dtv` as shown in |
| the below diagrams but after a dlopen this can change. |
| |
| arm64: |
| |
| ``` |
| *------------------------------------------------------------------------------* |
| | thread | tcb | X | tls1 | ... | tlsN | ... | tls_cnt | dtv[1] | ... | dtv[N] | |
| *------------------------------------------------------------------------------* |
| ^ ^ ^ ^ ^ |
| td tp dtv[1] dtv[n+1] dtv |
| ``` |
| |
| Here `X` has size `min(16, tls_align) - 16` where `tls_align` is the maximum |
| alignment of all loaded TLS segments from the static TLS set. This is set by |
| the static linker since the static linker resolves `TPOFF_*` values. This |
| padding is set that so that if, as required, `$tp` is aligned to main |
| executable's `PT_TLS` segment's `p_align` value then `tls1 - $tp` will be |
| `max(16, p_align)`. This ensures that there is always at least a 16 byte space |
| for the ABI TCB (denoted `tcb` in the diagram above). |
| |
| x86: |
| |
| ``` |
| *-----------------------------------------------------------------------------* |
| | tls_cnt | dtv[1] | ... | dtv[N] | ... | tlsN | ... | tls1 | tcb | thread | |
| *-----------------------------------------------------------------------------* |
| ^ ^ ^ ^ |
| dtv dtv[n+1] dtv[1] tp/td |
| ``` |
| |
| Here `td` denotes the "thread descriptor pointer". In both implementations this |
| points to the thread descriptor. A subtle point not made apparent in these |
| diagrams is that `tcb` is actually a member of the thread descriptor struct in |
| both cases but on AArch64 it is the last member and on `x86_64` it is the first |
| member. |
| |
| #### dlopen #### |
| |
| This picture explains what happens for the initial executables but it doesn't |
| explain what happens in the `dlopen` case. When `__tls_get_addr` is called it |
| first checks to see if `tls_cnt` is such that the module ID (given by `GOT_s[0]` |
| ) is within the `dtv`. If it is then it simply looks up `dtv[GOT_s[0]] + GOT_s[1]` |
| but if it isn't something more complicated happens. See the implementation of |
| `__tls_get_new` in [dynlink.c](/zircon/third_party/ulib/musl/ldso/dynlink.c). |
| |
| In a nutshell a sufficiently large space was already allocated for a larger `dtv` |
| on a call to `dlopen`. It is an invariant of the system that sufficient space |
| will always exist somewhere already allocated. The larger space is then setup to |
| be a proper `dtv`. `tcb` is then set to point to this new larger `dtv`. Future |
| accesses will then use the simpler code path since `tls_cnt` will be large |
| enough. |