DNS in Datacenters
DNS (Domain Name System) translates human-readable names to IP addresses. In datacenters, DNS serves critical roles beyond simple lookups: service discovery, load distribution, and failover orchestration. Understanding DNS behavior during outages can mean the difference between graceful degradation and cascading failures.
# Internal vs External DNS
Datacenters typically run separate DNS infrastructure for internal and external queries.
External DNS: Resolves public-facing names (www.example.com) for internet clients. Hosted on public DNS servers (Route53, Cloudflare DNS, etc.). Returns public IP addresses (load balancers, CDN endpoints).
Internal DNS: Resolves internal service names (db.internal, cache.prod.dc1) for servers within the datacenter. Returns private IP addresses (10.x.x.x, 172.16.x.x). Not accessible from internet.
# Split-Horizon DNS
Split-horizon (split-brain) DNS returns different answers depending on who's asking. Same hostname resolves differently for internal vs external clients.
Query for "api.example.com" From Internet: Client --> Public DNS --> Returns: 203.0.113.10 (load balancer) From Inside Datacenter: Server --> Internal DNS --> Returns: 10.0.1.50 (direct to app server) Same hostname, different IPs based on query source
Why use split-horizon:
- Avoid hairpinning: internal servers access services directly, not through external load balancers
- Reduce latency: skip NAT, load balancers, and public internet path for internal traffic
- Security: hide internal topology from external queries
Configuration: DNS servers check query source IP. If from internal network (10.0.0.0/8), return internal answer. If from internet, return public answer.
# Service Discovery Patterns
In modern microservice architectures, service discovery finds instances of a service dynamically. DNS is one approach, but not the only one.
DNS-Based Service Discovery
Query: web-service.prod.internal Response: 10.0.1.10, 10.0.1.11, 10.0.1.12 (A records) Client picks one IP (round-robin or random) Connects directly to chosen instance
Advantages: Simple, works with any client, no special libraries needed.
Disadvantages: DNS caching means slow updates (TTL-dependent). Client has no health info—might connect to failed instance. No load balancing smarts (just round-robin).
Dedicated Service Discovery Systems
Systems like Consul, etcd, ZooKeeper, or Kubernetes Services provide dynamic service discovery with health checks and real-time updates.
Service Registry (Consul/etcd/K8s)
|
+-- web-service instances: [10.0.1.10 (healthy),
| 10.0.1.11 (healthy),
| 10.0.1.12 (unhealthy, excluded)]
|
Client queries registry API
Receives list of healthy instances only
Connects to one (with client-side load balancing)
Advantages: Fast updates (no TTL delays), health-aware (excludes failed instances), richer metadata (version tags, weights).
Disadvantages: Requires client libraries, adds dependency on registry service, more complex than DNS.
# TTL and Caching Behavior
DNS responses include a TTL (Time to Live) that tells clients/resolvers how long to cache the answer. TTL is critical for failover speed and load balancing.
Query: api.example.com Response: 203.0.113.10, TTL=60 seconds Client caches this answer for 60 seconds Won't re-query DNS until TTL expires If IP changes during TTL period, client still uses old IP
Low TTL (5-60 seconds):
- Pros: Fast failover, quick DNS-based load balancing changes
- Cons: Higher DNS query load, more latency (frequent lookups)
High TTL (300-3600 seconds):
- Pros: Lower DNS load, less latency (fewer lookups)
- Cons: Slow failover, clients may hit stale IPs for hours
Best practice: Use low TTL (60s) for services that need fast failover. Use high TTL (300-600s) for stable infrastructure. Plan failovers accounting for "longest TTL" delay.
# DNS During Outages and Failover
DNS is often involved in failover scenarios, but it's not instant. Understanding DNS propagation delays prevents false confidence in failover plans.
Scenario: Datacenter Failover
Time Action Impact
---- -------------------------------- ---------------------------
T+0 Primary datacenter fails Services down
T+1 Monitoring detects failure Alert sent
T+2 DNS updated: api.example.com now DNS servers have new IP
points to backup datacenter IP
T+2 Clients with fresh DNS work Some traffic recovers
T+62 All clients' TTL expired Full traffic at backup DC
(assuming 60s TTL)
Actual recovery: 60+ seconds, not "instant"
Problem: Clients that cached old IP (before failover) wait for TTL to expire. During this time, they attempt to connect to failed datacenter.
Mitigation strategies:
- Use anycast IP addresses (same IP advertised from multiple datacenters via BGP)
- Use load balancers that do health checks and remove failed backends
- Combine DNS failover with application-level retries to alternate endpoints
- Pre-position clients with multiple IP addresses (A records) and client-side failover
DNS as Dependency in Outages
Lesson from Facebook 2021: When Facebook's BGP routes were withdrawn, their DNS servers also became unreachable. Even if DNS records were "correct," clients couldn't query DNS because there was no route to DNS servers.
Implication: DNS depends on underlying network routing (BGP). If routing fails, DNS fails too. Don't assume DNS is independent of network infrastructure—it's built on top of it.
# Key Takeaways
- Split-horizon DNS returns different answers for internal vs external queries
- DNS-based service discovery is simple but slow to update; dedicated systems (Consul, K8s) offer health-aware, real-time discovery
- TTL determines failover speed—low TTL (60s) enables faster recovery but increases DNS load
- DNS failover isn't instant; plan for "longest TTL" delay in recovery time objectives
- DNS depends on network routing (BGP)—if network fails, DNS may be unreachable even if correct
- Use anycast or load balancers with health checks for faster failover than DNS alone provides