99.99% uptime means your application can be down for no more than 52 minutes per year. That sounds achievable until you realise that a single bad deployment, a database failover, or a misconfigured load balancer can eat that entire budget in one incident. Here is the infrastructure playbook we use to hit four nines for production systems.
What 99.99% Actually Means
Before building for it, understand what you are targeting.
| SLA | Downtime per year | Downtime per month |
|-----|-------------------|--------------------|
| 99% | 87.6 hours | 7.3 hours |
| 99.9% | 8.7 hours | 43.8 minutes |
| 99.99% | 52.6 minutes | 4.4 minutes |
| 99.999% | 5.3 minutes | 26.3 seconds |
The jump from 99.9% to 99.99% is significant. It requires eliminating most planned downtime (deployments, maintenance) and dramatically reducing the blast radius of incidents.
Eliminate Planned Downtime First
Most teams focus on unplanned outages, but planned downtime — deployments, database migrations, certificate renewals — is often the bigger contributor to downtime at early stages.
Blue-Green Deployments
Blue-green deployments run two identical production environments. Traffic routes to the active environment (blue). You deploy to the idle environment (green), run smoke tests, then shift traffic. If something goes wrong, you shift traffic back in seconds.
```
Load Balancer
├── Blue (active, v1.2.3) ← 100% traffic
└── Green (idle, v1.2.4) ← 0% traffic (deploying)
After validation:
├── Blue (idle, v1.2.3) ← 0% traffic
└── Green (active, v1.2.4) ← 100% traffic
```
Rollback is a traffic shift, not a redeployment. This is the single most impactful change most teams can make.
Canary Releases
For higher-risk changes, canary releases send a small percentage of traffic to the new version before full rollout.
- 1% of traffic → new version
- Monitor error rates, latency, and business metrics for 15 minutes
- If metrics look good, ramp to 10%, then 50%, then 100%
- If metrics degrade, shift back to 0% immediately
Zero-Downtime Database Migrations
Database migrations are the most common source of deployment-related downtime. The pattern that works:
1. Deploy code that supports both old and new schema
2. Run the migration (additive changes only — add columns, never remove)
3. Deploy code that uses only the new schema
4. Clean up old columns in a separate migration after the old code is gone
Never remove columns or rename them in the same deployment that changes the code. The old code will still be running during the migration.
Multi-AZ Architecture
A single availability zone failure should not take down your application. Every critical component needs to span at least two AZs.
Compute
ECS services and Kubernetes deployments should have replicas spread across AZs. AWS handles this automatically if you configure your task placement constraints correctly.
```hcl
# ECS task placement — spread across AZs
placement_constraints {
type = "spread"
field = "attribute:ecs.availability-zone"
}
```
Databases
RDS Multi-AZ creates a synchronous standby replica in a second AZ. Failover is automatic and typically completes in 60–120 seconds. For most applications, this is acceptable. For stricter requirements, Aurora Global Database provides sub-second failover.
Load Balancers
Application Load Balancers are inherently multi-AZ. Ensure your target groups have healthy instances in at least two AZs at all times.
Health Checks That Actually Work
Bad health checks are a silent killer. If your health check passes when the application is degraded, your load balancer will keep sending traffic to broken instances.
What a Good Health Check Looks Like
A health check endpoint should verify that the application can actually serve requests — not just that the process is running.
```typescript
// Bad: just returns 200
app.get('/health', (req, res) => res.json({ status: 'ok' }));
// Good: verifies critical dependencies
app.get('/health', async (req, res) => {
const dbOk = await checkDatabaseConnection();
const cacheOk = await checkRedisConnection();
if (!dbOk || !cacheOk) {
return res.status(503).json({
status: 'degraded',
db: dbOk,
cache: cacheOk
});
}
res.json({ status: 'ok' });
});
```
Configure your load balancer to remove instances from rotation when the health check returns 503. This prevents traffic from reaching broken instances.
Monitoring and Alerting
You cannot fix what you cannot see. Comprehensive observability is a prerequisite for high uptime.
The Four Golden Signals
Monitor these four metrics for every service:
1. **Latency** — How long requests take. Track p50, p95, and p99.
2. **Traffic** — Requests per second. Sudden drops are often more alarming than spikes.
3. **Errors** — Error rate as a percentage of total requests.
4. **Saturation** — How full your resources are (CPU, memory, connection pool).
Synthetic Monitoring
Real user monitoring tells you about problems after they happen. Synthetic monitors check your application proactively, every minute, from multiple regions.
Configure synthetic monitors for:
Alert when any synthetic check fails from two or more regions simultaneously. Single-region failures are often network issues, not application problems.
- Your homepage (basic availability)
- Your login flow (authentication stack)
- Your most critical user journey (checkout, core feature)
- Your API health endpoint
Alert Fatigue
Too many alerts is as dangerous as too few. Engineers who receive dozens of alerts per day start ignoring them. Every alert should be:
- **Actionable** — There is something a human can do right now
- **Urgent** — It requires attention within the SLA window
- **Accurate** — It fires when there is a real problem, not on transient noise
Incident Response
Even with the best infrastructure, incidents happen. How you respond determines whether a five-minute blip becomes a 30-minute outage.
Runbooks
Every alert should have a corresponding runbook — a documented procedure for diagnosing and resolving the issue. Runbooks should be:
- Accessible without logging into anything (they are needed during outages)
- Step-by-step, not conceptual
- Tested by someone other than the person who wrote them
Incident Severity Levels
Define severity levels before you need them:
| Level | Definition | Response time |
|-------|------------|---------------|
| P1 | Complete outage or data loss | 5 minutes |
| P2 | Significant degradation affecting >10% of users | 15 minutes |
| P3 | Minor degradation, workaround available | 1 hour |
| P4 | Non-urgent issue | Next business day |
Post-Mortems
Every P1 and P2 incident should have a blameless post-mortem within 48 hours. The goal is to understand what happened and prevent recurrence — not to assign blame.
A good post-mortem answers:
- What happened and when?
- What was the user impact?
- What caused it?
- What did we do to resolve it?
- What are we changing to prevent it happening again?
Chaos Engineering
The only way to know your system is resilient is to test it. Chaos engineering deliberately introduces failures to verify that your recovery mechanisms work.
Start small:
Run chaos experiments in staging first. Graduate to production only when you have high confidence in your recovery mechanisms.
- Terminate a random ECS task and verify the service recovers
- Simulate an AZ failure by stopping all instances in one AZ
- Introduce artificial latency on your database connection and verify circuit breakers trip
The Uptime Stack
The infrastructure components that together enable 99.99% uptime:
None of these are exotic. They are engineering fundamentals applied consistently. The teams that hit four nines are not doing anything magical — they are just doing the basics very well.
- **Blue-green deployments** — eliminate deployment downtime
- **Multi-AZ compute and databases** — survive AZ failures
- **Meaningful health checks** — remove broken instances automatically
- **Synthetic monitoring** — detect problems before users do
- **Defined runbooks** — resolve incidents faster
- **Regular chaos testing** — verify resilience before incidents do





