Achieving 99.99% Uptime Guide

99.99% uptime means your application can be down for no more than 52 minutes per year. That sounds achievable until you realise that a single bad deployment, a database failover, or a misconfigured load balancer can eat that entire budget in one incident.

Here is the infrastructure playbook we use to hit four nines for production systems.

What 99.99% Actually Means

Before building for it, understand what you are targeting.

SLA	Downtime per year	Downtime per month
99%	87.6 hours	7.3 hours
99.9%	8.7 hours	43.8 minutes
99.99%	52.6 minutes	4.4 minutes
99.999%	5.3 minutes	26.3 seconds

The jump from 99.9% to 99.99% is significant. It requires eliminating most planned downtime (deployments, maintenance) and dramatically reducing the blast radius of incidents.

Eliminate Planned Downtime First

Most teams focus on unplanned outages, but planned downtime — deployments, database migrations, certificate renewals — is often the bigger contributor to downtime at early stages.

Blue-Green Deployments

Blue-green deployments run two identical production environments. Traffic routes to the active environment (blue). You deploy to the idle environment (green), run smoke tests, then shift traffic. If something goes wrong, you shift traffic back in seconds.

Load Balancer
├── Blue (active, v1.2.3)  ← 100% traffic
└── Green (idle, v1.2.4)   ← 0% traffic (deploying)

After validation:
├── Blue (idle, v1.2.3)    ← 0% traffic
└── Green (active, v1.2.4) ← 100% traffic

Rollback is a traffic shift, not a redeployment. This is the single most impactful change most teams can make.

Canary Releases

For higher-risk changes, canary releases send a small percentage of traffic to the new version before full rollout.

1% of traffic → new version
Monitor error rates, latency, and business metrics for 15 minutes
If metrics look good, ramp to 10%, then 50%, then 100%
If metrics degrade, shift back to 0% immediately

Zero-Downtime Database Migrations

Database migrations are the most common source of deployment-related downtime. The pattern that works:

Deploy code that supports both old and new schema
Run the migration (additive changes only — add columns, never remove)
Deploy code that uses only the new schema
Clean up old columns in a separate migration after the old code is gone

Never remove columns or rename them in the same deployment that changes the code. The old code will still be running during the migration.

Multi-AZ Architecture

A single availability zone failure should not take down your application. Every critical component needs to span at least two AZs.

Compute

ECS services and Kubernetes deployments should have replicas spread across AZs. AWS handles this automatically if you configure your task placement constraints correctly.

# ECS task placement — spread across AZs
placement_constraints {
  type       = "spread"
  field      = "attribute:ecs.availability-zone"
}

Databases

RDS Multi-AZ creates a synchronous standby replica in a second AZ. Failover is automatic and typically completes in 60–120 seconds. For most applications, this is acceptable. For stricter requirements, Aurora Global Database provides sub-second failover.

Load Balancers

Application Load Balancers are inherently multi-AZ. Ensure your target groups have healthy instances in at least two AZs at all times.

Health Checks That Actually Work

Bad health checks are a silent killer. If your health check passes when the application is degraded, your load balancer will keep sending traffic to broken instances.

What a Good Health Check Looks Like

A health check endpoint should verify that the application can actually serve requests — not just that the process is running.

// Bad: just returns 200
app.get('/health', (req, res) => res.json({ status: 'ok' }));

// Good: verifies critical dependencies
app.get('/health', async (req, res) => {
  const dbOk = await checkDatabaseConnection();
  const cacheOk = await checkRedisConnection();
  
  if (!dbOk || !cacheOk) {
    return res.status(503).json({ 
      status: 'degraded',
      db: dbOk,
      cache: cacheOk 
    });
  }
  
  res.json({ status: 'ok' });
});

Configure your load balancer to remove instances from rotation when the health check returns 503. This prevents traffic from reaching broken instances.

Monitoring and Alerting

You cannot fix what you cannot see. Comprehensive observability is a prerequisite for high uptime.

The Four Golden Signals

Monitor these four metrics for every service:

Latency — How long requests take. Track p50, p95, and p99.
Traffic — Requests per second. Sudden drops are often more alarming than spikes.
Errors — Error rate as a percentage of total requests.
Saturation — How full your resources are (CPU, memory, connection pool).

Synthetic Monitoring

Real user monitoring tells you about problems after they happen. Synthetic monitors check your application proactively, every minute, from multiple regions.

Configure synthetic monitors for:

Your homepage (basic availability)
Your login flow (authentication stack)
Your most critical user journey (checkout, core feature)
Your API health endpoint

Alert when any synthetic check fails from two or more regions simultaneously. Single-region failures are often network issues, not application problems.

Alert Fatigue

Too many alerts is as dangerous as too few. Engineers who receive dozens of alerts per day start ignoring them. Every alert should be:

Actionable — There is something a human can do right now
Urgent — It requires attention within the SLA window
Accurate — It fires when there is a real problem, not on transient noise

Incident Response

Even with the best infrastructure, incidents happen. How you respond determines whether a five-minute blip becomes a 30-minute outage.

Runbooks

Every alert should have a corresponding runbook — a documented procedure for diagnosing and resolving the issue. Runbooks should be:

Accessible without logging into anything (they are needed during outages)
Step-by-step, not conceptual
Tested by someone other than the person who wrote them

Incident Severity Levels

Define severity levels before you need them:

Level	Definition	Response time
P1	Complete outage or data loss	5 minutes
P2	Significant degradation affecting >10% of users	15 minutes
P3	Minor degradation, workaround available	1 hour
P4	Non-urgent issue	Next business day

Post-Mortems

Every P1 and P2 incident should have a blameless post-mortem within 48 hours. The goal is to understand what happened and prevent recurrence — not to assign blame.

A good post-mortem answers:

What happened and when?
What was the user impact?
What caused it?
What did we do to resolve it?
What are we changing to prevent it happening again?

Chaos Engineering

The only way to know your system is resilient is to test it. Chaos engineering deliberately introduces failures to verify that your recovery mechanisms work.

Start small:

Terminate a random ECS task and verify the service recovers
Simulate an AZ failure by stopping all instances in one AZ
Introduce artificial latency on your database connection and verify circuit breakers trip

Run chaos experiments in staging first. Graduate to production only when you have high confidence in your recovery mechanisms.

The Uptime Stack

The infrastructure components that together enable 99.99% uptime:

Blue-green deployments — eliminate deployment downtime
Multi-AZ compute and databases — survive AZ failures
Meaningful health checks — remove broken instances automatically
Synthetic monitoring — detect problems before users do
Defined runbooks — resolve incidents faster
Regular chaos testing — verify resilience before incidents do

None of these are exotic. They are engineering fundamentals applied consistently. The teams that hit four nines are not doing anything magical — they are just doing the basics very well.

Achieving 99.99% Uptime: The Infrastructure Playbook