The ELF Thread Local Storage ABI (TLS) is a storage model for variables that allows each thread to have a unique copy of a global variable. This model is used to implement C++'s thread_local
storage model. On thread creation the variable will be given its initial value from the initial TLS image. TLS variables are for instance useful as buffers in thread safe code or for per thread book keeping. C style errors like errno or dlerror can also be handled this way.
TLS variables are much like any other global/static variable. In implementation their initial data winds up in the PT_TLS
segment. The PT_TLS
segment is inside of a read only PT_LOAD
segment despite TLS variables being writable. This segment is then copied into the process for each thread in a unique writable location. The location the PT_TLS
segment is copied to is influenced by the segment's alignment to ensure that the alignment of TLS variables is respected.
The actual interface that the compiler, linker, and dynamic linker must adhere to is actually quite simple despite the details of the implementation being more complex. The compiler and the linker must emit code and dynamic relocations that use one of the 4 access models (described in a following section). The dynamic linker and thread implementation must then set everything up so that this actually works. Different architectures have different ABIs but they're similar enough at broad strokes that we can speak about most of them as if there was just one ABI. This document will assume that either x86-64 or AArch64 is being used and will point out differences when they occur.
The TLS ABI makes use of a few terms:
$tp
in this document. $tp
is what __builtin_thread_pointer()
returns on AArch64. On AArch64 $tp
is given by a special register named TPIDR_EL0
that can be accessed using mrs <reg>, TPIDR_EL0
. On x86_64
the fs.base
segment base is used and can be accessed with %fs:
and can be loaded from %fs:0
or rdfsbase
instruction.PT_TLS
program header in each module. Not every module has a PT_TLS
program header and thus not every module has a TLS segment. Each module has at most one TLS segment and correspondingly at most one PT_TLS
program header.DT_NEEDED
. Modules that require being in the Static TLS set have DF_STATIC_TLS
set on their DT_FLAGS
entry in their dynamic table (given by the PT_DYNAMIC
segment).$tp
will point to some point in this region. It contains the TLS segment of every module in Static TLS set as well as some implementation-private data, which is sometimes called the TCB (Thread Control Block). On AArch64 a 16-byte reserved space starting at $tp
is also sometimes called the TCB. We will refer to this space as the “ABI TCB” in this doc.$tp
via the Local Exec model described below.To comply with the ABI all access models must be supported.
There are 4 access models specified by the ABI:
global-dynamic
local-dynamic
initial-exec
local-exec
These are the values that can be used for -ftls-model=...
and __attribute__((tls_model("...")))
Which model is used relates to:
dlopen
local-*
)global-*
)global-dynamic
Can be used from anywhere, for any variable.local-dynamic
Can be used by any module, for any variable defined in that same module.initial-exec
Can be used by any module for any variable defined in the static TLS set.local-exec
Can be used by the main executable for variables defined in the main executable.Global dynamic is the most general access format. It is also the slowest. Any thread-local global variable should be accessible with this method. This access model must be used if a dynamic library accesses a symbol defined in another module (see exception in section on Initial Exec). Symbols defined within the executable need not use this access model. The main executable can also avoid using this access model. This is the default access model when compiling with -fPIC
as is the norm for shared libraries.
This access model works by calling a function defined in the dynamic linker. There are two ways functions might be called, via TLSDESC, or via __tls_get_addr
.
In the case of __tls_get_addr
it is passed the pair of GOT
entries associated with this symbol. Specifically it is passed the pointer to the first and the second entry comes right after it. For a given symbol S
, the first entry, denoted GOT_S[0]
, must contain the Module ID of the module in which S
was defined. The second entry, denoted GOT_S[1]
, must contain offset into TLS Block, which is the same as the offset of the symbol in the PT_TLS
segment of the associated module. The pointer to S
is then computed using __tls_get_addr(GOT_S)
. The implementation of __tls_get_addr
will be discussed later.
TLSDESC is an alternative ABI for global-dynamic
access (and local-dynamic
) where a different pair of GOT
slots are used where the first GOT
slot contains a function pointer. The second contains some dynamic linker defined auxiliary data. This allows the dynamic linker a choice over which function is called depending on circumstance.
In both cases the calls to these functions must be implemented by a specific code sequence and a specific set of relocs. This allows the linker to recognize these accesses and potentially relax them to the local-dynamic
access model.
(NOTE: The following paragraph contains details about how the compiler upholds its end of the ABI. Skip this paragraph if you don't care about that.)
For the compiler to emit code for this access model a call needs to be emitted against __tls_get_addr
(defined by the dynamic linker) and a reference to the symbol name. Specifically the compiler the emits code for (minding the additional relocation needed for the GOT itself) __tls_get_addr(GOT_S)
. The linker then emits two dynamic relocations when generating the GOT. On x86_64
these are R_X86_64_DTPMOD
and R_X86_64_DTPOFF
. On AArch64 these are R_AARCH64_DTPMOD
and R_AARCH64_DTPOFF
. These relocations reference the symbol regardless of whether or not the module defines a symbol by that name or not.
Local dynamic the same as Global Dynamic but for local symbols. It can be thought of as a single global-dynamic
access to the TLS block of this module. Then because every variable defined in the module is at fixed offsets from the TLS block the compiler can optimize multiple global-dynamic
calls into one. The compiler will relax a global-dynamic
access to a local-dynamic
access whenever the variables are local/static or have hidden visibility. The linker may sometimes be able to relax some global-dynamic
accesses to local-dynamic
as well.
The following gives an example of how the compiler might emit code for this access model:
static thread_local char buf[buf_cap]; static thread_local size_t buf_size = 0; while(*str && buf_size < buf_cap) { buf[buf_size++] = *str++; }
might be lowered to
// GOT_module[0] is the module ID of this module // GOT_module[1] is just 0 // <X> denotes the offset of X in this module's TLS block tls = __tls_get_addr(GOT_module) while(*str && *(size_t*)(tls+<buf_size>) < buf_cap) { (char*)(tls+<buf>)[*(size_t*)(tls+<buf_size>)++] = *str++; }
If this code used global dynamic it would have to make at least 2 calls, one to get the pointer for buf and the other to get the pointer for buf_size
.
This access model can be used anytime the compiler knows the module that the symbol being accessed is defined in will be loaded in the initial set of executables rather than opened using dlopen
. This access model is generally only used when the main executable is accessing a global symbol with default visibility. This is because compiling an executable is the only time the compiler knows that any code generated will be in the initial executable set. If a DSO is compiled to make thread local accesses use this model then the DSO cannot be safely opened with dlopen
. This is acceptable in performance critical applications and in cases where you know the binary will never be dlopen-ed such as in the case of libc. Modules compiled/linked this way have their DF_STATIC_TLS
flag set.
Initial Exec is the default when compiling without -fPIC
.
The compiler emits code without even calling __tls_get_addr
for this access model. It does so using a single GOT entry, which we'll denote GOT_s
for symbol s
, for which the compiler emits relocations, to ensure that
extern thread_local int a; extern thread_local int b; int main() { return a + b; }
would be lowered to something like the following
int main() { return *(int*)($tp + GOT[a]) + *(int*)($tp + GOT[b]); }
Note that on x86 architectures GOT[s]
will actually resolve to a negative value.
This is the fastest access model and can only be used if the symbol is in the first TLS block, which is the TLS block of the main executable. In practice only the main executable can use this access mode because any shared library can‘t (and normally wouldn’t need to) know if it is accessing something from the main executable. The linker will relax initial-exec
to local-exec
. The compiler can't do this without explicit instructions via -ftls-model
or __attribute__((tls_model("...")))
because the compiler cannot know if the current translation unit is going to be linked into a main executable or a shared library.
The precise details of how this offset is computed changes a bit from architecture to architecture.
example code:
static thread_local int a; static thread_local int b; int main() { return a + b; }
would be lowered to
int main() { return (int*)($tp+TPOFF_a) + (int*)($tp+TPOFF_b)); }
On AArch64 TPOFF_a == max(16, p_align) + <a>
where p_align
is exactly the p_align
field of the main executable‘s PT_TLS
segment and <a>
is the offset of a
from the beginning of the main executable’s TLS segment.
On x86_64
TPOFF_a == -<a>
where <a>
is the offset of the a
from the end of the main executable's TLS segment.
The linker is aware of what TPOFF_X
is for any given X
and fills in this value.
This section discusses the implementation as it is implemented on Fuchsia. This said the broad strokes here are widely similar across different libc implementations including musl and glibc.
The actual implementation of all of this introduces a few more details. Namely the so-called “DTV” (Dynamic Thread Vector) (denoted dtv
in this doc), which indexes TLS blocks by module ID. The following diagram shows what the initial executable set looks like. In Fuchsia's implementation we actually store a bunch of meta information in a thread descriptor struct along with the ABI TCB (denoted tcb
below). In our implementation we use the first 8 bytes of this space to point to the DTV. At first tcb
points to dtv
as shown in the below diagrams but after a dlopen this can change.
arm64:
*------------------------------------------------------------------------------* | thread | tcb | X | tls1 | ... | tlsN | ... | tls_cnt | dtv[1] | ... | dtv[N] | *------------------------------------------------------------------------------* ^ ^ ^ ^ ^ td tp dtv[1] dtv[n+1] dtv
Here X
has size min(16, tls_align) - 16
where tls_align
is the maximum alignment of all loaded TLS segments from the static TLS set. This is set by the static linker since the static linker resolves TPOFF_*
values. This padding is set that so that if, as required, $tp
is aligned to main executable‘s PT_TLS
segment’s p_align
value then tls1 - $tp
will be max(16, p_align)
. This ensures that there is always at least a 16 byte space for the ABI TCB (denoted tcb
in the diagram above).
x86:
*-----------------------------------------------------------------------------* | tls_cnt | dtv[1] | ... | dtv[N] | ... | tlsN | ... | tls1 | tcb | thread | *-----------------------------------------------------------------------------* ^ ^ ^ ^ dtv dtv[n+1] dtv[1] tp/td
Here td
denotes the “thread descriptor pointer”. In both implementations this points to the thread descriptor. A subtle point not made apparent in these diagrams is that tcb
is actually a member of the thread descriptor struct in both cases but on AArch64 it is the last member and on x86_64
it is the first member.
This picture explains what happens for the initial executables but it doesn‘t explain what happens in the dlopen
case. When __tls_get_addr
is called it first checks to see if tls_cnt
is such that the module ID (given by GOT_s[0]
) is within the dtv
. If it is then it simply looks up dtv[GOT_s[0]] + GOT_s[1]
but if it isn’t something more complicated happens. See the implementation of __tls_get_new
in dynlink.c.
In a nutshell a sufficiently large space was already allocated for a larger dtv
on a call to dlopen
. It is an invariant of the system that sufficient space will always exist somewhere already allocated. The larger space is then setup to be a proper dtv
. tcb
is then set to point to this new larger dtv
. Future accesses will then use the simpler code path since tls_cnt
will be large enough.