DNS in Datacenters (Draft) :: mateusz.systems

mateusz@systems ~/book/ch05/dns $ cat section.md

DNS in Datacenters

DNS (Domain Name System) translates human-readable names to IP addresses. In datacenters, DNS serves critical roles beyond simple lookups: service discovery, load distribution, and failover orchestration. Understanding DNS behavior during outages can mean the difference between graceful degradation and cascading failures.

# Internal vs External DNS

Datacenters typically run separate DNS infrastructure for internal and external queries.

External DNS: Resolves public-facing names (www.example.com) for internet clients. Hosted on public DNS servers (Route53, Cloudflare DNS, etc.). Returns public IP addresses (load balancers, CDN endpoints).

Internal DNS: Resolves internal service names (db.internal, cache.prod.dc1) for servers within the datacenter. Returns private IP addresses (10.x.x.x, 172.16.x.x). Not accessible from internet.

# Split-Horizon DNS

Split-horizon (split-brain) DNS returns different answers depending on who's asking. Same hostname resolves differently for internal vs external clients.

Query for "api.example.com"

From Internet:
   Client --> Public DNS --> Returns: 203.0.113.10 (load balancer)

From Inside Datacenter:
   Server --> Internal DNS --> Returns: 10.0.1.50 (direct to app server)

Same hostname, different IPs based on query source

Why use split-horizon:

Avoid hairpinning: internal servers access services directly, not through external load balancers
Reduce latency: skip NAT, load balancers, and public internet path for internal traffic
Security: hide internal topology from external queries

Configuration: DNS servers check query source IP. If from internal network (10.0.0.0/8), return internal answer. If from internet, return public answer.

# Service Discovery Patterns

In modern microservice architectures, service discovery finds instances of a service dynamically. DNS is one approach, but not the only one.

DNS-Based Service Discovery

Query: web-service.prod.internal
Response: 10.0.1.10, 10.0.1.11, 10.0.1.12 (A records)

Client picks one IP (round-robin or random)
Connects directly to chosen instance

Advantages: Simple, works with any client, no special libraries needed.

Disadvantages: DNS caching means slow updates (TTL-dependent). Client has no health info—might connect to failed instance. No load balancing smarts (just round-robin).

Dedicated Service Discovery Systems

Systems like Consul, etcd, ZooKeeper, or Kubernetes Services provide dynamic service discovery with health checks and real-time updates.

Service Registry (Consul/etcd/K8s)
    |
    +-- web-service instances: [10.0.1.10 (healthy),
    |                            10.0.1.11 (healthy),
    |                            10.0.1.12 (unhealthy, excluded)]
    |
Client queries registry API
Receives list of healthy instances only
Connects to one (with client-side load balancing)

Advantages: Fast updates (no TTL delays), health-aware (excludes failed instances), richer metadata (version tags, weights).

Disadvantages: Requires client libraries, adds dependency on registry service, more complex than DNS.

# TTL and Caching Behavior

DNS responses include a TTL (Time to Live) that tells clients/resolvers how long to cache the answer. TTL is critical for failover speed and load balancing.

Query: api.example.com
Response: 203.0.113.10, TTL=60 seconds

Client caches this answer for 60 seconds
Won't re-query DNS until TTL expires
If IP changes during TTL period, client still uses old IP

Low TTL (5-60 seconds):

Pros: Fast failover, quick DNS-based load balancing changes
Cons: Higher DNS query load, more latency (frequent lookups)

High TTL (300-3600 seconds):

Pros: Lower DNS load, less latency (fewer lookups)
Cons: Slow failover, clients may hit stale IPs for hours

Best practice: Use low TTL (60s) for services that need fast failover. Use high TTL (300-600s) for stable infrastructure. Plan failovers accounting for "longest TTL" delay.

# DNS During Outages and Failover

DNS is often involved in failover scenarios, but it's not instant. Understanding DNS propagation delays prevents false confidence in failover plans.

Scenario: Datacenter Failover

Time  Action                              Impact
----  --------------------------------    ---------------------------
T+0   Primary datacenter fails            Services down
T+1   Monitoring detects failure          Alert sent
T+2   DNS updated: api.example.com now    DNS servers have new IP
      points to backup datacenter IP
T+2   Clients with fresh DNS work         Some traffic recovers
T+62  All clients' TTL expired            Full traffic at backup DC
                                          (assuming 60s TTL)

Actual recovery: 60+ seconds, not "instant"

Problem: Clients that cached old IP (before failover) wait for TTL to expire. During this time, they attempt to connect to failed datacenter.

Mitigation strategies:

Use anycast IP addresses (same IP advertised from multiple datacenters via BGP)
Use load balancers that do health checks and remove failed backends
Combine DNS failover with application-level retries to alternate endpoints
Pre-position clients with multiple IP addresses (A records) and client-side failover

DNS as Dependency in Outages

Lesson from Facebook 2021: When Facebook's BGP routes were withdrawn, their DNS servers also became unreachable. Even if DNS records were "correct," clients couldn't query DNS because there was no route to DNS servers.

Implication: DNS depends on underlying network routing (BGP). If routing fails, DNS fails too. Don't assume DNS is independent of network infrastructure—it's built on top of it.

# Key Takeaways

Split-horizon DNS returns different answers for internal vs external queries
DNS-based service discovery is simple but slow to update; dedicated systems (Consul, K8s) offer health-aware, real-time discovery
TTL determines failover speed—low TTL (60s) enables faster recovery but increases DNS load
DNS failover isn't instant; plan for "longest TTL" delay in recovery time objectives
DNS depends on network routing (BGP)—if network fails, DNS may be unreachable even if correct
Use anycast or load balancers with health checks for faster failover than DNS alone provides