Thread Local Storage

The ELF Thread Local Storage ABI (TLS) is a storage model for variables that allows each thread to have a unique copy of a global variable. This model is used to implement C++'s thread_local storage model. On thread creation the variable will be given its initial value from the initial TLS image. TLS variables are for instance useful as buffers in thread safe code or for per thread book keeping. C style errors like errno or dlerror can also be handled this way.

TLS variables are much like any other global/static variable. In implementation their initial data winds up in the PT_TLS segment. The PT_TLS segment is inside of a read only PT_LOAD segment despite TLS variables being writable. This segment is then copied into the process for each thread in a unique writable location. The location the PT_TLS segment is copied to is influenced by the segment's alignment to ensure that the alignment of TLS variables is respected.

ABI

The actual interface that the compiler, linker, and dynamic linker must adhere to is actually quite simple despite the details of the implementation being more complex. The compiler and the linker must emit code and dynamic relocations that use one of the 4 access models (described in a following section). The dynamic linker and thread implementation must then set everything up so that this actually works. Different architectures have different ABIs but they're similar enough at broad strokes that we can speak about most of them as if there was just one ABI. This document will assume that either x86-64 or AArch64 is being used and will point out differences when they occur.

The TLS ABI makes use of a few terms:

Thread Pointer: This is a unique address in each thread, generally stored in a register. Thread local variables lie at offsets from the thread pointer. Thread Pointer will be abbreviated and used as $tp in this document. $tp is what __builtin_thread_pointer() returns on AArch64. On AArch64 $tp is given by a special register named TPIDR_EL0 that can be accessed using mrs <reg>, TPIDR_EL0. On x86_64 the fs.base segment base is used and can be accessed with %fs: and can be loaded from %fs:0 or rdfsbase instruction.
TLS Segment: This is the image of data in each module and specified by the PT_TLS program header in each module. Not every module has a PT_TLS program header and thus not every module has a TLS segment. Each module has at most one TLS segment and correspondingly at most one PT_TLS program header.
Static TLS set: This is the sum total of modules that are known to the dynamic linker at program start up time. It consists of the main executable and every library transitively mentioned by DT_NEEDED. Modules that require being in the Static TLS set have DF_STATIC_TLS set on their DT_FLAGS entry in their dynamic table (given by the PT_DYNAMIC segment).
TLS Region: This is a contiguous region of memory unique to each thread. $tp will point to some point in this region. It contains the TLS segment of every module in Static TLS set as well as some implementation-private data which is sometimes called the TCB (Thread Control Block). On AArch64 a 16-byte reserved space starting at $tp is also sometimes called the TCB. We will refer to this space as the “ABI TCB” in this doc.
TLS Block: This is an individual thread's copy of a TLS segment. There is one TLS block per TLS segment per thread.
Module ID: The module ID is not statically known except for the main executable‘s module ID which is always 1. Other module’s module IDs are chosen by the dynamic linker. It‘s just a unique non-zero ID for each module. In theory it could be any non-zero 64-bit value that is unique to the module like a hash or something. In practice it’s just a simple counter that the dynamic linker maintains.
The main executable: This is the module that contains the start address. It, is also treated in a special way in one of the access models. It always has a Module ID of 1. This is the only module that can use fixed offsets from $tp via the Local Exec model described below.

To comply with the ABI all access models must be supported.

Access Models

There are 4 access models specified by the ABI:

global-dynamic
local-dynamic
initial-exec
local-exec

These are the values that can be used for -ftls-model=... and __attribute__((tls_model("...")))

Which model is used relates to:

Which module is performing the access:
The main executable
A module in the static TLS set
A module that was loaded after startup, e.g. by dlopen
Which module the variable being accessed is defined in:
Within the same module (i.e. local-*)
In a different module (i.e. global-*)

global-dynamic Can be used from anywhere, for any variable.
local-dynamic Can be used by any module, for any variable defined in that same module.
initial-exec Can be used by any module for any variable defined in the static TLS set.
local-exec Can be used by the main executable for variables defined in the main executable.

Global Dynamic

Global dynamic is the most general access format. It is also the slowest. Any thread-local global variable should be accessible with this method. This access model must be used if a dynamic library accesses a symbol defined in another module (see exception in section on Initial Exec). Symbols defined within the executable need not use this access model. The main executable can also avoid using this access model. This is the default access model when compiling with -fPIC as is the norm for shared libraries.

This access model works by calling a function defined in the dynamic linker. There are two ways functions might be called, via TLSDESC, or via __tls_get_addr.

In the case of __tls_get_addr it is passed the pair of GOT entries associated with this symbol. Specifically it is passed the pointer to the first and the second entry comes right after it. For a given symbol S, the first entry, denoted GOT_S[0], must contain the Module ID of the module in which S was defined. The second entry, denoted GOT_S[1], must contain offset into TLS Block which is the same as the offset of the symbol in the PT_TLS segment of the associated module. The pointer to S is then computed using __tls_get_addr(GOT_S). The implementation of __tls_get_addr will be discussed later.

TLSDESC is an alternative ABI for global-dynamic access (and local-dynamic) where a different pair of GOT slots are used where the first GOT slot contains a function pointer. The second contains some dynamic linker defined auxiliary data. This allows the dynamic linker a choice over which function is called depending on circumstance.

In both cases the calls to these functions must be implemented by a specific code sequence and a specific set of relocs. This allows the linker to recognize these accesses and potentially relax them to the local-dynamic access model.

(NOTE: The following paragraph contains details about how the compiler upholds its end of the ABI. Skip this paragraph if you don't care about that.)

For the compiler to emit code for this access model a call needs to be emitted against __tls_get_addr (defined by the dynamic linker) and a reference to the symbol name. Specifically the compiler the emits code for (minding the additional relocation needed for the GOT itself) __tls_get_addr(GOT_S). The linker then emits two dynamic relocations when generating the GOT. On x86_64 these are R_X86_64_DTPMOD and R_X86_64_DTPOFF. On AArch64 these are R_AARCH64_DTPMOD and R_AARCH64_DTPOFF. These relocations reference the symbol regardless of whether or not the module defines a symbol by that name or not.

Local Dynamic

Local dynamic the same as Global Dynamic but for local symbols. It can be thought of as a single global-dynamic access to the TLS block of this module. Then because every variable defined in the module is at fixed offsets from the TLS block the compiler can optimize multiple global-dynamic calls into one. The compiler will relax a global-dynamic access to a local-dynamic access whenever the variables are local/static or have hidden visibility. The linker may sometimes be able to relax some global-dynamic accesses to local-dynamic as well.

The following gives an example of how the compiler might emit code for this access model:


static thread_local char buf[buf_cap];
static thread_local size_t buf_size = 0;
while(*str && buf_size < buf_cap) {
  buf[buf_size++] = *str++;
}

might be lowered to


// GOT_module[0] is the module ID of this module
// GOT_module[1] is just 0
// <X> denotes the offset of X in this module's TLS block
tls = __tls_get_addr(GOT_module)
while(*str && *(size_t*)(tls+<buf_size>) < buf_cap) {
  (char*)(tls+<buf>)[*(size_t*)(tls+<buf_size>)++] = *str++;
}

If this code used global dynamic it would have to make at least 2 calls, one to get the pointer for buf and the other to get the pointer for buf_size.

Initial Exec

This access model can be used anytime the compiler knows the module that the symbol being accessed is defined in will be loaded in the initial set of executables rather than opened using dlopen. This access model is generally only used when the main executable is accessing a global symbol with default visibility. This is because compiling an executable is the only time the compiler knows that any code generated will be in the initial executable set. If a DSO is compiled to make thread local accesses use this model then the DSO cannot be safely opened with dlopen. This is acceptable in performance critical applications and in cases where you know the binary will never be dlopen-ed such as in the case of libc. Modules compiled/linked this way have their DF_STATIC_TLS flag set.

Initial Exec is the default when compiling without -fPIC.

The compiler emits code without even calling __tls_get_addr for this access model. It does so using a single GOT entry which we'll denote GOT_s for symbol s which the compiler emits relocations for to ensure that


extern thread_local int a;
extern thread_local int b;
int main() {
  return a + b;
}

would be lowered to something like the following


int main() {
  return *(int*)($tp + GOT[a]) + *(int*)($tp + GOT[b]);
}

Note that on x86 architectures GOT[s] will actually resolve to a negative value.

Local Exec

This is the fastest access model and can only be used if the symbol is in the first TLS block which is the TLS block of the main executable. In practice only the main executable can use this access mode because any shared library can‘t (and normally wouldn’t need to) know if it is accessing something from the main executable. The linker will relax initial-exec to local-exec. The compiler can't do this without explicit instructions via -ftls-model or __attribute__((tls_model("..."))) because the compiler cannot know if the current translation unit is going to be linked into a main executable or a shared library.

The precise details of how this offset is computed changes a bit from architecture to architecture.

example code:

static thread_local int a;
static thread_local int b;

int main() {
  return a + b;
}

would be lowered to

int main() {
  return (int*)($tp+TPOFF_a) + (int*)($tp+TPOFF_b));
}

On AArch64 TPOFF_a == max(16, p_align) + <a> where p_align is exactly the p_align field of the main executable‘s PT_TLS segment and <a> is the offset of a from the beginning of the main executable’s TLS segment.

On x86_64 TPOFF_a == -<a> where <a> is the offset of the a from the end of the main executable's TLS segment.

The linker is aware of what TPOFF_X is for any given X and fills in this value.

Implementation

This section discusses the implementation as it is implemented on Fuchsia. This said the broad strokes here are widely similar across different libc implementations including musl and glibc.

The actual implementation of all of this introduces a few more details. Namely the so-called “DTV” (Dynamic Thread Vector) (denoted dtv in this doc) which indexes TLS blocks by module ID. The following diagram shows what the initial executable set looks like. In Fuchsia's implementation we actually store a bunch of meta information in a thread descriptor struct along with the ABI TCB (denoted tcb below). In our implementation we use the first 8 bytes of this space to point to the DTV. At first tcb points to dtv as shown in the below diagrams but after a dlopen this can change.

arm64:

*------------------------------------------------------------------------------*
| thread | tcb | X | tls1 | ... | tlsN | ... | tls_cnt | dtv[1] | ... | dtv[N] |
*------------------------------------------------------------------------------*
^         ^         ^             ^            ^
td        tp      dtv[1]       dtv[n+1]       dtv

Here X has size min(16, tls_align) - 16 where tls_align is the maximum alignment of all loaded TLS segments from the static TLS set. This is set by the static linker since the static linker resolves TPOFF_* values. This padding is set that so that if, as required, $tp is aligned to main executable‘s PT_TLS segment’s p_align value then tls1 - $tp will be max(16, p_align). This ensures that there is always at least a 16 byte space for the ABI TCB (denoted tcb in the diagram above).

x86:

*-----------------------------------------------------------------------------*
| tls_cnt | dtv[1] | ... | dtv[N] | ... | tlsN | ... | tls1 | tcb |  thread   |
*-----------------------------------------------------------------------------*
^                                       ^             ^       ^
dtv                                  dtv[n+1]       dtv[1]  tp/td

Here td denotes the “thread descriptor pointer”. In both implementations this points to the thread descriptor. A subtle point not made apparent in these diagrams is that tcb is actually a member of the thread descriptor struct in both cases but on AArch64 it is the last member and on x86_64 it is the first member.

dlopen

This picture explains what happens for the initial executables but it doesn‘t explain what happens in the dlopen case. When __tls_get_addr is called it first checks to see if tls_cnt is such that the module ID (given by GOT_s[0] ) is within the dtv. If it is then it simply looks up dtv[GOT_s[0]] + GOT_s[1] but if it isn’t something more complicated happens. See the implementation of __tls_get_new in dynlink.c. In a nutshell a sufficiently large space was already allocated for a larger dtv on a call to dlopen. It is an invariant of the system that sufficient space will always exist somewhere already allocated. The larger space is then setup to be a proper dtv. tcb is then set to point to this new larger dtv. Future accesses will then use the simpler code path since tls_cnt will be large enough.