Thread Local Storage
# What and Why
Thread Local Storage (TLS) provides per-thread global state without synchronization overhead. Each thread gets its own copy of a TLS variable, eliminating data races while maintaining the convenience of global-like access patterns.
Classic use cases include:
- errno: The quintessential example. System calls set errno, but in a multi-threaded program each thread needs its own errno to avoid races.
- Allocator caches: Thread-local free lists avoid lock contention in malloc implementations (tcmalloc, jemalloc).
- Random number generator state: Thread-local RNG state avoids locking on every random number generation.
- Scratch buffers: Thread-local temporary buffers avoid repeated allocation.
Thread Memory Layout: TLS gives each thread private storage +------------------+ | Code (shared) | All threads execute same code +------------------+ | Heap (shared) | Shared data (requires synchronization) +------------------+ | v +-----------+-----------+-----------+ | Thread 1 | Thread 2 | Thread 3 | +-----------+-----------+-----------+ | Stack | Stack | Stack | Private (automatic) | TLS Block | TLS Block | TLS Block | Private (TLS) | - errno | - errno | - errno | | - rng | - rng | - rng | +-----------+-----------+-----------+ Each thread's TLS block contains its own copy of TLS variables. No synchronization needed - threads can't see each other's TLS.
# C/C++ Basics
In C and C++, TLS variables are declared with storage class specifiers:
// GCC extension (works in C and C++) __thread int counter = 0; // C11 standard _Thread_local int counter = 0; // C++11 standard thread_local int counter = 0;
Initialization constraints differ:
- __thread (C/GCC): Only constant initializers allowed. No constructors, no function calls, no runtime computation.
- thread_local (C++11): Dynamic initialization permitted. Constructors run on first access by each thread. Destructors run at thread exit.
// C with __thread: only constants allowed __thread int x = 42; // OK __thread int y = some_function(); // ERROR: not a constant __thread MyClass obj; // ERROR: has constructor // C++ with thread_local: dynamic init OK thread_local int x = compute(); // OK: runs once per thread thread_local MyClass obj(args); // OK: constructed per thread
# ELF TLS Models and Linkage
How TLS variables are accessed depends on the TLS model chosen by the compiler and linker. This is where linkage significantly impacts TLS behavior and performance. There are four models, trading off generality against efficiency:
ELF TLS Models: Performance vs Flexibility Tradeoff Model Access Cost dlopen Compatible? Use Case --------------------------------------------------------------------------- Local-exec 1 instruction No Main executable only Initial-exec 1-2 instructions No Libs loaded at startup Local-dynamic Function call* Yes Multiple TLS vars in lib General-dynamic Function call Yes Any TLS anywhere * Cached per function - one call gets base, then offsets added
Local-exec is the fastest model. The compiler knows the TLS variable is in the main executable, so it can compute the offset at link time and emit a single instruction accessing %fs:offset directly.
Initial-exec works for shared libraries loaded at program startup (not via dlopen). The offset isn't known until load time, so it's stored in the GOT and requires a GOT lookup, but still avoids function calls.
General-dynamic is the most flexible but slowest. It works for any TLS variable
anywhere, including in dlopen'd libraries. Each access calls __tls_get_addr() to
resolve the address at runtime.
Local-dynamic optimizes the case where a function accesses multiple TLS variables
from the same module. One __tls_get_addr() call gets the module's TLS base address,
then individual variables are accessed via offsets from that base.
Access Code Comparison (x86-64): Local-exec (fastest): mov %fs:tls_var@TPOFF, %eax # Single instruction, offset known at link time Initial-exec: mov tls_var@GOTTPOFF(%rip), %rax # Load offset from GOT mov %fs:(%rax), %eax # Access via offset General-dynamic (slowest): lea tls_var@TLSGD(%rip), %rdi # Load TLS descriptor address call __tls_get_addr@PLT # Function call! mov (%rax), %eax # Dereference result
How the compiler chooses: By default, compilers use general-dynamic for
shared libraries (safest) and initial-exec or local-exec for executables. You can override
with -ftls-model=:
# Force a specific TLS model gcc -ftls-model=local-exec ... # Only for main executable gcc -ftls-model=initial-exec ... # Won't work if dlopen'd gcc -ftls-model=local-dynamic ... # Optimizes multi-var access gcc -ftls-model=global-dynamic ...# Default for shared libs
# TLS Access Under the Hood
On x86-64 Linux, TLS works through the %fs segment register. Each thread has its %fs base pointing to its Thread Control Block (TCB), which contains or points to that thread's TLS data.
TLS Access via Segment Register (x86-64): %fs register (per-thread, set by kernel on context switch) | v +------+------+------+------+ | TCB | exe | lib1 | lib2 | <-- Thread's TLS block +------+------+------+------+ | +-- offset ---> [errno] Access: mov %fs:offset, %rax The offset is either: - Computed at link time (local-exec) - Loaded from GOT at runtime (initial-exec) - Returned by __tls_get_addr (general/local-dynamic)
# __tls_get_addr and the Dynamic Thread Vector
For general-dynamic and local-dynamic TLS, the runtime function __tls_get_addr
resolves TLS addresses. Understanding how it works explains why it's slower and why it
enables dlopen compatibility.
Dynamic Thread Vector (DTV): Per-thread module index Each thread has a DTV - an array of pointers to TLS blocks: Thread Pointer (%fs) | v +-----+-----+-----+-----+-----+ | TCB | gen | [1] | [2] | [3] | ... <-- DTV +-----+-----+-----+-----+-----+ | | | | | | | +--> TLS block for libbar.so (or NULL) | | +------> TLS block for libfoo.so | +------------> TLS block for executable (always module 1) +------------------> Generation counter Module IDs are assigned by the dynamic linker: - Executable is always module 1 - Libraries get IDs as they're loaded
How __tls_get_addr works:
// Compiler generates a call like: void *addr = __tls_get_addr(&tls_index); // tls_index contains: struct tls_index { unsigned long module_id; // Which module? (filled by dynamic linker) unsigned long offset; // Offset within module's TLS block }; // __tls_get_addr does: 1. Get this thread's DTV 2. Look up DTV[module_id] to find the TLS block pointer 3. If NULL, allocate the TLS block (lazy allocation in glibc) 4. Return TLS_block + offset
__tls_get_addr(module=3, offset=16) lookup: Thread's DTV TLS Blocks +-----+-----+-----+-----+ | gen | [1] | [2] | [3] |---------+ +-----+-----+-----+-----+ | v +----------+ | TLS for | | module 3 | +----------+ | var_a | offset 0 | var_b | offset 8 | var_c | offset 16 <-- returns this address +----------+
The generation counter trick: When dlopen loads a new library with TLS,
the DTV may need to grow. Instead of updating every thread's DTV immediately (expensive!),
the loader increments a global generation counter. When __tls_get_addr is called,
it compares the thread's DTV generation to the global one. If stale, it reallocates and
updates the DTV. This lazy update avoids touching threads that never access the new library's TLS.
# Static TLS Exhaustion
The infamous error "cannot allocate memory in static TLS block" occurs when
dlopen'ing a library that uses initial-exec TLS model, but there's insufficient space
reserved in the static TLS block.
Static TLS Block Exhaustion Problem: At program startup, the static TLS block is allocated with fixed size: Static TLS Block (fixed at startup, e.g., ~4KB total) <------------------- allocated once, cannot grow -------------------> +--------+--------+-----------+---------------------------------+ | exe | libc | libpthread| slack | | 256B | 512B | 1024B | ~2304B (surplus) | +--------+--------+-----------+---------------------------------+ ^ Available for dlopen'd libs using initial-exec TLS After dlopen("libfoo.so") with 1KB initial-exec TLS: +--------+--------+-----------+--------+------------------------+ | exe | libc | libpthread| libfoo | slack | | 256B | 512B | 1024B | 1024B | ~1280B | +--------+--------+-----------+--------+------------------------+ ^ Still ~1280B available After dlopen("libbar.so") requesting 2KB initial-exec TLS: +--------+--------+-----------+--------+------------------------+ | exe | libc | libpthread| libfoo |XXXXXXXXXXXXXXXXXXXXXXXXX | 256B | 512B | 1024B | 1024B | Only 1280B slack! +--------+--------+-----------+--------+------------------------+ ^ Need 2048B, have 1280B ERROR: "cannot allocate memory in static TLS block" The static block CANNOT be resized after program start. Libraries needing more space must use general-dynamic model.
Typical surplus sizes:
- glibc: ~1536 bytes default surplus. Tunable since glibc 2.33 via
glibc.rtld.optional_static_tlsenvironment variable or tunable. - musl: No surplus at all. Libraries using initial-exec TLS cannot be dlopen'd on musl.
Mitigation strategies:
// BAD: 4KB in static TLS per thread __thread char buffer[4096]; // GOOD: Only 8 bytes in static TLS, allocate on demand __thread char *buffer; void ensure_buffer(void) { if (!buffer) { buffer = malloc(4096); // Heap allocation, not TLS } }
Diagnosing TLS issues:
# See TLS allocation during library loading LD_DEBUG=files ./myprogram 2>&1 | grep -i tls # Check a library's TLS model readelf -d libfoo.so | grep -i tls objdump -R libfoo.so | grep TLS
# glibc vs musl: Critical Differences
Code that works on glibc may fail on musl (used by Alpine Linux and many embedded systems). The differences stem from fundamentally different design philosophies:
| Aspect | glibc | musl |
|---|---|---|
| Static TLS surplus | ~1-2KB reserved | None |
| dlopen + initial-exec | May work (until exhausted) | Always fails |
| dlclose behavior | Unloads library | No-op (permanent) |
| TLS allocation for dlopen | Lazy (on first access) | Upfront (at dlopen) |
| __tls_get_addr | Not async-signal-safe | Async-signal-safe |
musl's philosophy: Pre-allocate everything so failures happen early (at dlopen) rather than late (random crash when TLS is first accessed). The tradeoff is stricter constraints on what can be dlopen'd, and libraries are permanent once loaded.
# Gotchas and Pitfalls
- Memory overhead: N threads × M bytes per TLS variable. A 1KB TLS buffer with 100 threads = 100KB. Use pointers and lazy allocation for large data.
- Destruction order (C++): thread_local destructors run at thread exit, but the order between different TLS variables is unspecified. If destructor A accesses TLS variable B, B might already be destroyed. Avoid cross-TLS-variable dependencies in destructors.
- fork() copies TLS: When you fork(), the child gets a copy of the parent's TLS values. This can break assumptions (e.g., TLS holding a thread ID or file descriptor that's now stale in the child).
- Signal handlers: Accessing TLS in signal handlers is generally safe
(the signal runs on some thread and sees that thread's TLS). However, on glibc,
__tls_get_addris not async-signal-safe, so accessing TLS that requires dynamic resolution for the first time in a signal handler can deadlock. - dlopen with RTLD_LOCAL vs RTLD_GLOBAL: Affects symbol visibility, which can impact TLS resolution if libraries have interdependencies.
- Cross-compilation: TLS model availability varies by platform. Code assuming initial-exec may fail on platforms with different TLS implementations.
# Rust Specifics
Rust provides TLS through two mechanisms with different tradeoffs:
thread_local! macro (stable):
use std::cell::Cell; thread_local! { static COUNTER: Cell<u32> = Cell::new(0); } fn main() { // Access requires .with() closure - cannot get direct reference COUNTER.with(|c| { c.set(c.get() + 1); println!("Counter: {}", c.get()); }); }
The .with() pattern exists because thread_local! uses lazy
initialization. The closure ensures the TLS is initialized before access and prevents
returning references that could outlive the thread.
thread_local! access via .with(): COUNTER.with(|c| ...) | v +--> LocalKey::with() +--> Is this thread's slot initialized? +--> No: Run initializer (Cell::new(0)) | Store in TLS | Then call closure with &Cell | +--> Yes: Call closure with &Cell directly
#[thread_local] attribute (unstable/nightly):
#![feature(thread_local)] use std::cell::Cell; #[thread_local] static COUNTER: Cell<u32> = Cell::new(0); fn main() { // Direct access - no closure needed COUNTER.set(COUNTER.get() + 1); println!("Counter: {}", COUNTER.get()); }
#[thread_local] maps directly to native TLS (like C's __thread),
giving single-instruction access on local-exec. The tradeoff: no lazy initialization,
only const initializers, and it's unstable.
Rust TLS gotchas:
- Destructors not guaranteed: If a thread panics or the process exits, TLS destructors may not run. Don't rely on them for critical cleanup.
- Performance difference:
thread_local!adds function call overhead for the lazy init check. For hot paths, this can matter.#[thread_local]avoids this but requires nightly. - No direct references:
thread_local!intentionally prevents getting&'static Tto avoid lifetime issues. Use.with()or consider other patterns if you need persistent references.
# Debugging TLS
When TLS goes wrong, these tools help diagnose issues:
# GDB: Examine TLS variables (gdb) info tls # Show all TLS variables (gdb) p my_tls_var # Print a specific TLS variable (gdb) p &my_tls_var # Show its address # Check TLS segment in binary readelf -l ./mybinary | grep TLS # See TLS relocations in a shared library readelf -r libfoo.so | grep TLS # Runtime TLS debugging LD_DEBUG=files,bindings ./myprogram 2>&1 | grep -i tls # Check which TLS model a library uses objdump -d libfoo.so | grep -A2 '@tls'
Common symptoms and causes:
- "cannot allocate memory in static TLS block": dlopen'd library uses
initial-exec TLS, static block exhausted. Recompile library with
-ftls-model=global-dynamic. - TLS variable has wrong/stale value after fork: TLS was copied from parent, contains parent's state. Reinitialize in child if needed.
- Crash in signal handler accessing TLS: First access triggered
__tls_get_addrwhich isn't async-signal-safe on glibc. Access the TLS variable once in normal code before installing the signal handler. - TLS works in tests, fails in production: Test links statically (local-exec), production uses dlopen (needs general-dynamic). Check TLS model consistency.