Thread Local Storage (Draft) :: mateusz.systems

mateusz@systems ~/book/ch02/tls $ cat section.md

Thread Local Storage

# What and Why

Thread Local Storage (TLS) provides per-thread global state without synchronization overhead. Each thread gets its own copy of a TLS variable, eliminating data races while maintaining the convenience of global-like access patterns.

Classic use cases include:

errno: The quintessential example. System calls set errno, but in a multi-threaded program each thread needs its own errno to avoid races.
Allocator caches: Thread-local free lists avoid lock contention in malloc implementations (tcmalloc, jemalloc).
Random number generator state: Thread-local RNG state avoids locking on every random number generation.
Scratch buffers: Thread-local temporary buffers avoid repeated allocation.

Thread Memory Layout: TLS gives each thread private storage

+------------------+
|   Code (shared)  |   All threads execute same code
+------------------+
|   Heap (shared)  |   Shared data (requires synchronization)
+------------------+
        |
        v
  +-----------+-----------+-----------+
  | Thread 1  | Thread 2  | Thread 3  |
  +-----------+-----------+-----------+
  | Stack     | Stack     | Stack     |   Private (automatic)
  | TLS Block | TLS Block | TLS Block |   Private (TLS)
  |  - errno  |  - errno  |  - errno  |
  |  - rng    |  - rng    |  - rng    |
  +-----------+-----------+-----------+

Each thread's TLS block contains its own copy of TLS variables.
No synchronization needed - threads can't see each other's TLS.

# C/C++ Basics

In C and C++, TLS variables are declared with storage class specifiers:

// GCC extension (works in C and C++)
__thread int counter = 0;

// C11 standard
_Thread_local int counter = 0;

// C++11 standard
thread_local int counter = 0;

Initialization constraints differ:

__thread (C/GCC): Only constant initializers allowed. No constructors, no function calls, no runtime computation.
thread_local (C++11): Dynamic initialization permitted. Constructors run on first access by each thread. Destructors run at thread exit.

// C with __thread: only constants allowed
__thread int x = 42;              // OK
__thread int y = some_function(); // ERROR: not a constant
__thread MyClass obj;             // ERROR: has constructor

// C++ with thread_local: dynamic init OK
thread_local int x = compute();   // OK: runs once per thread
thread_local MyClass obj(args);   // OK: constructed per thread

# ELF TLS Models and Linkage

How TLS variables are accessed depends on the TLS model chosen by the compiler and linker. This is where linkage significantly impacts TLS behavior and performance. There are four models, trading off generality against efficiency:

ELF TLS Models: Performance vs Flexibility Tradeoff

Model             Access Cost        dlopen Compatible?   Use Case
---------------------------------------------------------------------------
Local-exec        1 instruction      No                   Main executable only
Initial-exec      1-2 instructions   No                   Libs loaded at startup
Local-dynamic     Function call*     Yes                  Multiple TLS vars in lib
General-dynamic   Function call      Yes                  Any TLS anywhere

* Cached per function - one call gets base, then offsets added

Local-exec is the fastest model. The compiler knows the TLS variable is in the main executable, so it can compute the offset at link time and emit a single instruction accessing %fs:offset directly.

Initial-exec works for shared libraries loaded at program startup (not via dlopen). The offset isn't known until load time, so it's stored in the GOT and requires a GOT lookup, but still avoids function calls.

General-dynamic is the most flexible but slowest. It works for any TLS variable anywhere, including in dlopen'd libraries. Each access calls __tls_get_addr() to resolve the address at runtime.

Local-dynamic optimizes the case where a function accesses multiple TLS variables from the same module. One __tls_get_addr() call gets the module's TLS base address, then individual variables are accessed via offsets from that base.

Access Code Comparison (x86-64):

Local-exec (fastest):
    mov   %fs:tls_var@TPOFF, %eax    # Single instruction, offset known at link time

Initial-exec:
    mov   tls_var@GOTTPOFF(%rip), %rax   # Load offset from GOT
    mov   %fs:(%rax), %eax               # Access via offset

General-dynamic (slowest):
    lea   tls_var@TLSGD(%rip), %rdi  # Load TLS descriptor address
    call  __tls_get_addr@PLT         # Function call!
    mov   (%rax), %eax               # Dereference result

How the compiler chooses: By default, compilers use general-dynamic for shared libraries (safest) and initial-exec or local-exec for executables. You can override with -ftls-model=:

# Force a specific TLS model
gcc -ftls-model=local-exec  ...   # Only for main executable
gcc -ftls-model=initial-exec ...  # Won't work if dlopen'd
gcc -ftls-model=local-dynamic ... # Optimizes multi-var access
gcc -ftls-model=global-dynamic ...# Default for shared libs

# TLS Access Under the Hood

On x86-64 Linux, TLS works through the %fs segment register. Each thread has its %fs base pointing to its Thread Control Block (TCB), which contains or points to that thread's TLS data.

TLS Access via Segment Register (x86-64):

  %fs register (per-thread, set by kernel on context switch)
       |
       v
+------+------+------+------+
| TCB  | exe  | lib1 | lib2 |  <-- Thread's TLS block
+------+------+------+------+
       |
       +-- offset --->  [errno]

Access: mov %fs:offset, %rax

The offset is either:
  - Computed at link time (local-exec)
  - Loaded from GOT at runtime (initial-exec)
  - Returned by __tls_get_addr (general/local-dynamic)

# __tls_get_addr and the Dynamic Thread Vector

For general-dynamic and local-dynamic TLS, the runtime function __tls_get_addr resolves TLS addresses. Understanding how it works explains why it's slower and why it enables dlopen compatibility.

Dynamic Thread Vector (DTV): Per-thread module index

Each thread has a DTV - an array of pointers to TLS blocks:

  Thread Pointer (%fs)
        |
        v
  +-----+-----+-----+-----+-----+
  | TCB | gen | [1] | [2] | [3] | ...   <-- DTV
  +-----+-----+-----+-----+-----+
          |     |     |     |
          |     |     |     +--> TLS block for libbar.so (or NULL)
          |     |     +------> TLS block for libfoo.so
          |     +------------> TLS block for executable (always module 1)
          +------------------> Generation counter

Module IDs are assigned by the dynamic linker:
  - Executable is always module 1
  - Libraries get IDs as they're loaded

How __tls_get_addr works:

// Compiler generates a call like:
void *addr = __tls_get_addr(&tls_index);

// tls_index contains:
struct tls_index {
    unsigned long module_id;  // Which module? (filled by dynamic linker)
    unsigned long offset;     // Offset within module's TLS block
};

// __tls_get_addr does:
1. Get this thread's DTV
2. Look up DTV[module_id] to find the TLS block pointer
3. If NULL, allocate the TLS block (lazy allocation in glibc)
4. Return TLS_block + offset

__tls_get_addr(module=3, offset=16) lookup:

   Thread's DTV                           TLS Blocks
  +-----+-----+-----+-----+
  | gen | [1] | [2] | [3] |---------+
  +-----+-----+-----+-----+         |
                                    v
                              +----------+
                              | TLS for  |
                              | module 3 |
                              +----------+
                              | var_a    | offset 0
                              | var_b    | offset 8
                              | var_c    | offset 16  <-- returns this address
                              +----------+

The generation counter trick: When dlopen loads a new library with TLS, the DTV may need to grow. Instead of updating every thread's DTV immediately (expensive!), the loader increments a global generation counter. When __tls_get_addr is called, it compares the thread's DTV generation to the global one. If stale, it reallocates and updates the DTV. This lazy update avoids touching threads that never access the new library's TLS.

# Static TLS Exhaustion

The infamous error "cannot allocate memory in static TLS block" occurs when dlopen'ing a library that uses initial-exec TLS model, but there's insufficient space reserved in the static TLS block.

Static TLS Block Exhaustion Problem:

At program startup, the static TLS block is allocated with fixed size:

  Static TLS Block (fixed at startup, e.g., ~4KB total)
  <------------------- allocated once, cannot grow ------------------->

  +--------+--------+-----------+---------------------------------+
  |  exe   |  libc  | libpthread|              slack              |
  | 256B   | 512B   |   1024B   |       ~2304B (surplus)          |
  +--------+--------+-----------+---------------------------------+
                                 ^
                                 Available for dlopen'd libs
                                 using initial-exec TLS

After dlopen("libfoo.so") with 1KB initial-exec TLS:

  +--------+--------+-----------+--------+------------------------+
  |  exe   |  libc  | libpthread| libfoo |         slack          |
  | 256B   | 512B   |   1024B   | 1024B  |       ~1280B           |
  +--------+--------+-----------+--------+------------------------+
                                          ^
                                          Still ~1280B available

After dlopen("libbar.so") requesting 2KB initial-exec TLS:

  +--------+--------+-----------+--------+------------------------+
  |  exe   |  libc  | libpthread| libfoo |XXXXXXXXXXXXXXXXXXXXXXXXX
  | 256B   | 512B   |   1024B   | 1024B  |   Only 1280B slack!
  +--------+--------+-----------+--------+------------------------+
                                          ^
                                          Need 2048B, have 1280B

  ERROR: "cannot allocate memory in static TLS block"

The static block CANNOT be resized after program start.
Libraries needing more space must use general-dynamic model.

Typical surplus sizes:

glibc: ~1536 bytes default surplus. Tunable since glibc 2.33 via glibc.rtld.optional_static_tls environment variable or tunable.
musl: No surplus at all. Libraries using initial-exec TLS cannot be dlopen'd on musl.

Mitigation strategies:

// BAD: 4KB in static TLS per thread
__thread char buffer[4096];

// GOOD: Only 8 bytes in static TLS, allocate on demand
__thread char *buffer;

void ensure_buffer(void) {
    if (!buffer) {
        buffer = malloc(4096);  // Heap allocation, not TLS
    }
}

Diagnosing TLS issues:

# See TLS allocation during library loading
LD_DEBUG=files ./myprogram 2>&1 | grep -i tls

# Check a library's TLS model
readelf -d libfoo.so | grep -i tls
objdump -R libfoo.so | grep TLS

# glibc vs musl: Critical Differences

Code that works on glibc may fail on musl (used by Alpine Linux and many embedded systems). The differences stem from fundamentally different design philosophies:

Aspect	glibc	musl
Static TLS surplus	~1-2KB reserved	None
dlopen + initial-exec	May work (until exhausted)	Always fails
dlclose behavior	Unloads library	No-op (permanent)
TLS allocation for dlopen	Lazy (on first access)	Upfront (at dlopen)
__tls_get_addr	Not async-signal-safe	Async-signal-safe

musl's philosophy: Pre-allocate everything so failures happen early (at dlopen) rather than late (random crash when TLS is first accessed). The tradeoff is stricter constraints on what can be dlopen'd, and libraries are permanent once loaded.

# Gotchas and Pitfalls

Memory overhead: N threads × M bytes per TLS variable. A 1KB TLS buffer with 100 threads = 100KB. Use pointers and lazy allocation for large data.
Destruction order (C++): thread_local destructors run at thread exit, but the order between different TLS variables is unspecified. If destructor A accesses TLS variable B, B might already be destroyed. Avoid cross-TLS-variable dependencies in destructors.
fork() copies TLS: When you fork(), the child gets a copy of the parent's TLS values. This can break assumptions (e.g., TLS holding a thread ID or file descriptor that's now stale in the child).
Signal handlers: Accessing TLS in signal handlers is generally safe (the signal runs on some thread and sees that thread's TLS). However, on glibc, __tls_get_addr is not async-signal-safe, so accessing TLS that requires dynamic resolution for the first time in a signal handler can deadlock.
dlopen with RTLD_LOCAL vs RTLD_GLOBAL: Affects symbol visibility, which can impact TLS resolution if libraries have interdependencies.
Cross-compilation: TLS model availability varies by platform. Code assuming initial-exec may fail on platforms with different TLS implementations.

# Rust Specifics

Rust provides TLS through two mechanisms with different tradeoffs:

thread_local! macro (stable):

use std::cell::Cell;

thread_local! {
    static COUNTER: Cell<u32> = Cell::new(0);
}

fn main() {
    // Access requires .with() closure - cannot get direct reference
    COUNTER.with(|c| {
        c.set(c.get() + 1);
        println!("Counter: {}", c.get());
    });
}

The .with() pattern exists because thread_local! uses lazy initialization. The closure ensures the TLS is initialized before access and prevents returning references that could outlive the thread.

thread_local! access via .with():

  COUNTER.with(|c| ...)
        |
        v
  +--> LocalKey::with()
        +--> Is this thread's slot initialized?
              +--> No:  Run initializer (Cell::new(0))
              |        Store in TLS
              |        Then call closure with &Cell
              |
              +--> Yes: Call closure with &Cell directly

#[thread_local] attribute (unstable/nightly):

#![feature(thread_local)]

use std::cell::Cell;

#[thread_local]
static COUNTER: Cell<u32> = Cell::new(0);

fn main() {
    // Direct access - no closure needed
    COUNTER.set(COUNTER.get() + 1);
    println!("Counter: {}", COUNTER.get());
}

#[thread_local] maps directly to native TLS (like C's __thread), giving single-instruction access on local-exec. The tradeoff: no lazy initialization, only const initializers, and it's unstable.

Rust TLS gotchas:

Destructors not guaranteed: If a thread panics or the process exits, TLS destructors may not run. Don't rely on them for critical cleanup.
Performance difference: thread_local! adds function call overhead for the lazy init check. For hot paths, this can matter. #[thread_local] avoids this but requires nightly.
No direct references: thread_local! intentionally prevents getting &'static T to avoid lifetime issues. Use .with() or consider other patterns if you need persistent references.

# Debugging TLS

When TLS goes wrong, these tools help diagnose issues:

# GDB: Examine TLS variables
(gdb) info tls                    # Show all TLS variables
(gdb) p my_tls_var                # Print a specific TLS variable
(gdb) p &my_tls_var               # Show its address

# Check TLS segment in binary
readelf -l ./mybinary | grep TLS

# See TLS relocations in a shared library
readelf -r libfoo.so | grep TLS

# Runtime TLS debugging
LD_DEBUG=files,bindings ./myprogram 2>&1 | grep -i tls

# Check which TLS model a library uses
objdump -d libfoo.so | grep -A2 '@tls'

Common symptoms and causes:

"cannot allocate memory in static TLS block": dlopen'd library uses initial-exec TLS, static block exhausted. Recompile library with -ftls-model=global-dynamic.
TLS variable has wrong/stale value after fork: TLS was copied from parent, contains parent's state. Reinitialize in child if needed.
Crash in signal handler accessing TLS: First access triggered __tls_get_addr which isn't async-signal-safe on glibc. Access the TLS variable once in normal code before installing the signal handler.
TLS works in tests, fails in production: Test links statically (local-exec), production uses dlopen (needs general-dynamic). Check TLS model consistency.