Site Reliability Engineering (SRE) practices have become essential for operating reliable cloud-native services at scale. After implementing SRE principles across production systems, I’ve learned how to balance reliability with velocity and reduce operational toil.
Service Level Objectives
SLOs define the target reliability for a service. Start by identifying what matters to users:
Availability SLO:
slo:
name: api-availability
objective: 99.9% # 43 minutes downtime per month
window: 30d
sli: |
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Latency SLO:
slo:
name: api-latency
objective: 99% # 99% of requests under threshold
threshold: 500ms
window: 30d
sli: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m]))
< 0.5
Error Budgets
Error budgets quantify acceptable unreliability:
type ErrorBudget struct {
SLO float64 // 99.9% = 0.999
Window time.Duration // 30 days
}
func (eb *ErrorBudget) AllowedDowntime() time.Duration {
return time.Duration(float64(eb.Window) * (1 - eb.SLO))
}
// Allowed downtime for 99.9% SLO over 30 days: 43.2 minutes
Alert on Symptoms, Not Causes
Alert when SLOs are at risk:
groups:
- name: slo-alerts
rules:
- alert: ErrorBudgetBurnFast
expr: |
(1 - (
sum(rate(http_requests_total{code!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)) > 0.002
for: 5m
Reducing Toil
Automate repetitive tasks. Replace manual processes with automated health checks and remediation.
On-Call Best Practices
Make on-call sustainable:
- Actionable Alerts: Include runbook links
- Escalation Policies: Clear escalation paths
- Post-Incident Reviews: Blameless postmortems
- Capacity Planning: Prevent incidents proactively
SRE Metrics
Track effectiveness:
# Toil ratio (should be < 50%)
sum(time_spent_on_toil_hours) / sum(total_engineering_hours)
# Mean time to recovery
avg(incident_resolution_duration_seconds)
Toil Reduction Strategies
Toil is repetitive, manual work that doesn’t provide enduring value. SRE teams should spend less than 50% of time on toil:
// Automate common operations
type AutoHealer struct {
k8s kubernetes.Interface
alerts AlertManager
}
func (ah *AutoHealer) HandlePodCrashLoop(pod *corev1.Pod) error {
// Check crash loop history
restartCount := pod.Status.ContainerStatuses[0].RestartCount
if restartCount > 5 {
// Collect diagnostics
logs := ah.getPodLogs(pod)
events := ah.getPodEvents(pod)
// Create incident
incident := ah.alerts.CreateIncident(&Incident{
Title: fmt.Sprintf("Pod %s in crash loop", pod.Name),
Severity: "high",
Logs: logs,
Events: events,
Runbook: "https://runbooks.example.com/crash-loop",
})
// Attempt automatic remediation
if ah.canAutoRemediate(pod) {
return ah.remediate(pod)
}
// Escalate to on-call
return ah.alerts.Escalate(incident)
}
return nil
}
func (ah *AutoHealer) remediate(pod *corev1.Pod) error {
// Common remediation: restart with backoff
deployment := ah.getDeployment(pod)
// Scale down
*deployment.Spec.Replicas = *deployment.Spec.Replicas - 1
ah.k8s.AppsV1().Deployments(pod.Namespace).Update(
context.TODO(),
deployment,
metav1.UpdateOptions{},
)
time.Sleep(30 * time.Second)
// Scale back up
*deployment.Spec.Replicas = *deployment.Spec.Replicas + 1
_, err := ah.k8s.AppsV1().Deployments(pod.Namespace).Update(
context.TODO(),
deployment,
metav1.UpdateOptions{},
)
return err
}
Capacity Planning
Proactive capacity management prevents incidents:
type CapacityPlanner struct {
prometheus PrometheusClient
}
func (cp *CapacityPlanner) PredictCapacity() (*CapacityForecast, error) {
// Get historical CPU usage
query := `
avg_over_time(
node_cpu_usage_percent[7d]
)
`
result, err := cp.prometheus.Query(query)
if err != nil {
return nil, err
}
// Calculate growth rate
growthRate := cp.calculateGrowthRate(result)
// Project future needs
currentCapacity := cp.getCurrentCapacity()
daysUntilLimit := cp.calculateDaysUntilLimit(
currentCapacity,
growthRate,
0.80, // Alert at 80% capacity
)
return &CapacityForecast{
CurrentUsage: result.Average,
GrowthRate: growthRate,
DaysUntilLimit: daysUntilLimit,
RecommendedAction: cp.getRecommendation(daysUntilLimit),
}, nil
}
func (cp *CapacityPlanner) getRecommendation(daysUntilLimit int) string {
switch {
case daysUntilLimit < 7:
return "URGENT: Add capacity immediately"
case daysUntilLimit < 30:
return "Plan capacity addition within 2 weeks"
case daysUntilLimit < 90:
return "Schedule capacity review"
default:
return "Capacity adequate"
}
}
Postmortem Culture
Blameless postmortems drive continuous improvement:
# Incident Postmortem Template
## Summary
Brief description of what happened and impact
## Timeline
- 14:32 UTC: Alert fired for elevated error rate
- 14:35 UTC: On-call engineer paged
- 14:40 UTC: Investigation started
- 15:10 UTC: Root cause identified
- 15:25 UTC: Fix deployed
- 15:30 UTC: Service recovered
## Root Cause
Database connection pool exhausted due to slow queries
## Impact
- Duration: 58 minutes
- Error rate: 12% of requests failed
- Users affected: ~5,000
- Revenue impact: $2,500 estimated
## What Went Well
- Automated alerts fired promptly
- Clear runbooks accelerated diagnosis
- Rollback procedure worked smoothly
## What Went Wrong
- Database query performance not monitored
- Connection pool limits not tuned for load
- No circuit breaker to protect downstream services
## Action Items
1. Add slow query monitoring [Owner: @alice, Due: 2019-08-01]
2. Implement connection pool auto-tuning [Owner: @bob, Due: 2019-08-15]
3. Deploy circuit breaker pattern [Owner: @carol, Due: 2019-08-30]
4. Update runbook with new findings [Owner: @dave, Due: 2019-07-26]
## Lessons Learned
- Synthetic monitoring didn't catch this scenario
- Need end-to-end performance testing under load
- Database observability has gaps
On-Call Best Practices
Make on-call sustainable and effective:
Runbook Automation
# PagerDuty runbook integration
runbooks:
high_error_rate:
title: "High Error Rate Investigation"
steps:
- name: "Check service health"
command: "kubectl get pods -n production"
- name: "View recent logs"
command: "kubectl logs -n production deployment/api --tail=100"
- name: "Check dependencies"
command: "curl -s https://status.example.com/api/health"
- name: "View metrics dashboard"
url: "https://grafana.example.com/d/api-overview"
- name: "Escalation"
instructions: |
If error rate > 10% for > 10 minutes:
1. Notify #incidents channel
2. Page secondary on-call
3. Consider rolling back recent deployments
Alert Fatigue Prevention
# Good: Alert on user-impacting issues
alert: HighErrorRate
expr: |
sum(rate(http_requests_total{code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate is {{ $value | humanizePercentage }}"
runbook: "https://runbooks.example.com/high-error-rate"
# Bad: Alert on non-actionable metrics
# alert: HighCPU
# expr: node_cpu_usage > 0.80
# This fires frequently but may not indicate problems
On-Call Rotation
Implement fair, sustainable rotations:
# On-call schedule
rotations:
primary:
schedule: "weekly"
handoff: "Monday 10:00 AM"
members:
- alice
- bob
- carol
- dave
secondary:
schedule: "weekly"
handoff: "Monday 10:00 AM"
members:
- eve
- frank
- grace
- henry
escalation_policy:
- level: 1
type: "primary"
timeout: "5m"
- level: 2
type: "secondary"
timeout: "10m"
- level: 3
type: "manager"
timeout: "15m"
SRE Metrics and KPIs
Track team effectiveness:
# Availability
100 * (
sum(rate(http_requests_total{code!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
)
# Mean Time to Detect (MTTD)
avg(alert_timestamp - incident_start_timestamp)
# Mean Time to Resolve (MTTR)
avg(incident_end_timestamp - incident_start_timestamp)
# Change Failure Rate
sum(failed_deployments) / sum(total_deployments)
# Deployment Frequency
count(deployments_total) / days
# Toil Ratio
sum(toil_hours) / sum(total_engineering_hours)
Track against industry benchmarks:
- Elite performers: Deploy multiple times per day, <1 hour MTTR, <15% change failure rate
- High performers: Deploy weekly, <1 day MTTR, <15% change failure rate
- Medium performers: Deploy monthly, <1 week MTTR, <30% change failure rate
Service Level Indicators (SLIs)
Choose SLIs that matter to users:
Availability SLI
# Request success rate
sum(rate(http_requests_total{code!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
Latency SLI
# 99th percentile latency
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)
Correctness SLI
# Data pipeline success rate
sum(rate(pipeline_tasks_total{status="success"}[5m]))
/
sum(rate(pipeline_tasks_total[5m]))
Freshness SLI
# Data age (for batch systems)
time() - max(data_last_updated_timestamp)
Error Budget Policy
Define what happens when error budget is exhausted:
error_budget_policy:
slo: 99.9%
window: 30d
actions:
- threshold: "100%" # Budget exhausted
actions:
- "Freeze all feature launches"
- "Focus entirely on reliability"
- "Daily leadership updates required"
- threshold: "50%" # Half budget consumed
actions:
- "Increase testing requirements"
- "Mandatory blameless postmortems"
- "Weekly reliability review"
- threshold: "25%" # Quarter budget used
actions:
- "Heightened change review"
- "Monitor error budget daily"
Conclusion
Effective Site Reliability Engineering requires:
- Clear SLOs that reflect user experience
- Error budgets that balance reliability and velocity
- Symptom-based alerting that reduces noise
- Toil automation to free up engineering time
- Sustainable on-call practices with clear runbooks
- Blameless postmortems that drive learning
- Capacity planning to prevent incidents
- Meaningful metrics that track team effectiveness
Start by defining SLOs for your most critical service, measure your error budget, and use it to drive prioritization decisions. Build a culture of reliability where incidents are learning opportunities, toil is actively eliminated, and engineering time is invested in improvements that prevent future incidents.
SRE is not just about tools and metrics—it’s about cultural change that values reliability as a feature, treats operational work as engineering problems, and continuously improves system resilience through thoughtful automation and observation.