Traditional monitoring approaches fail in cloud-native environments. The shift from monoliths to microservices, from static infrastructure to dynamic containers, and from known failure modes to emergent behaviors requires a fundamentally different approach. This post explores how monitoring has evolved into observability and what that means for running production systems.
The Limitations of Traditional Monitoring
Traditional monitoring relied on predefined dashboards and threshold-based alerts:
# Traditional monitoring configuration
alerts:
- name: HighCPU
condition: cpu_usage > 80%
duration: 5m
action: page_oncall
- name: DiskFull
condition: disk_usage > 90%
duration: 1m
action: alert_team
This works when:
- You know what metrics matter
- Systems behave predictably
- Failures are well-understood
- Infrastructure is relatively static
In cloud-native environments, these assumptions break down:
- Hundreds of services generate millions of metrics
- Containers come and go constantly
- Failures emerge from complex interactions
- You can’t predict every problem
The Three Pillars of Observability
Modern observability rests on three pillars:
1. Metrics
Time-series numerical data about system state:
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
httpRequestsTotal = promauto.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = promauto.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration",
Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
},
[]string{"method", "endpoint"},
)
activeConnections = promauto.NewGauge(
prometheus.GaugeOpts{
Name: "active_connections",
Help: "Number of active connections",
},
)
)
func metricsMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
start := time.Now()
activeConnections.Inc()
defer activeConnections.Dec()
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
next.ServeHTTP(wrapped, r)
duration := time.Since(start).Seconds()
httpRequestsTotal.WithLabelValues(
r.Method,
r.URL.Path,
fmt.Sprintf("%d", wrapped.statusCode),
).Inc()
httpRequestDuration.WithLabelValues(
r.Method,
r.URL.Path,
).Observe(duration)
})
}
2. Logs
Discrete events with context:
import (
"github.com/sirupsen/logrus"
)
var log = logrus.New()
func init() {
log.SetFormatter(&logrus.JSONFormatter{})
log.SetLevel(logrus.InfoLevel)
}
func (s *Service) ProcessOrder(ctx context.Context, order *Order) error {
logger := log.WithFields(logrus.Fields{
"order_id": order.ID,
"user_id": order.UserID,
"total": order.Total,
"trace_id": getTraceID(ctx),
"request_id": getRequestID(ctx),
})
logger.Info("Processing order")
if err := s.validateOrder(order); err != nil {
logger.WithError(err).Warn("Order validation failed")
return err
}
if err := s.chargePayment(ctx, order); err != nil {
logger.WithError(err).Error("Payment processing failed")
return err
}
logger.WithFields(logrus.Fields{
"duration_ms": time.Since(order.CreatedAt).Milliseconds(),
}).Info("Order processed successfully")
return nil
}
3. Distributed Traces
Request flow across services:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
var tracer = otel.Tracer("order-service")
func (s *Service) ProcessOrder(ctx context.Context, order *Order) error {
ctx, span := tracer.Start(ctx, "ProcessOrder")
defer span.End()
span.SetAttributes(
attribute.String("order.id", order.ID),
attribute.String("user.id", order.UserID),
attribute.Float64("order.total", order.Total),
)
if err := s.validateOrder(ctx, order); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "validation failed")
return err
}
if err := s.chargePayment(ctx, order); err != nil {
span.RecordError(err)
span.SetStatus(codes.Error, "payment failed")
return err
}
span.SetStatus(codes.Ok, "order processed")
return nil
}
func (s *Service) chargePayment(ctx context.Context, order *Order) error {
ctx, span := tracer.Start(ctx, "ChargePayment")
defer span.End()
// Make external API call
resp, err := s.paymentClient.Charge(ctx, &PaymentRequest{
Amount: order.Total,
Method: order.PaymentMethod,
})
if err != nil {
span.RecordError(err)
return err
}
span.SetAttributes(
attribute.String("transaction.id", resp.TransactionID),
attribute.String("status", resp.Status),
)
return nil
}
Setting Up Prometheus for Metrics
Deploy Prometheus in Kubernetes:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
Application pod annotations:
apiVersion: v1
kind: Pod
metadata:
name: myapp
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
image: myapp:latest
ports:
- containerPort: 8080
name: metrics
Structured Logging with EFK Stack
Deploy Elasticsearch, Fluentd, and Kibana:
apiVersion: v1
kind: ConfigMap
metadata:
name: fluentd-config
namespace: logging
data:
fluent.conf: |
<source>
@type tail
path /var/log/containers/*.log
pos_file /var/log/fluentd-containers.log.pos
tag kubernetes.*
read_from_head true
<parse>
@type json
time_key time
time_format %Y-%m-%dT%H:%M:%S.%NZ
keep_time_key true
</parse>
</source>
<filter kubernetes.**>
@type kubernetes_metadata
@id filter_kube_metadata
</filter>
<filter kubernetes.**>
@type parser
key_name log
reserve_data true
<parse>
@type json
</parse>
</filter>
<match kubernetes.**>
@type elasticsearch
host elasticsearch.logging.svc.cluster.local
port 9200
logstash_format true
logstash_prefix kubernetes
include_tag_key true
<buffer>
@type file
path /var/log/fluentd-buffers/kubernetes.system.buffer
flush_mode interval
retry_type exponential_backoff
flush_interval 5s
retry_forever true
retry_max_interval 30
</buffer>
</match>
Application logging best practices:
type Logger struct {
logger *logrus.Logger
}
func NewLogger() *Logger {
logger := logrus.New()
logger.SetFormatter(&logrus.JSONFormatter{
TimestampFormat: time.RFC3339Nano,
FieldMap: logrus.FieldMap{
logrus.FieldKeyTime: "timestamp",
logrus.FieldKeyLevel: "severity",
logrus.FieldKeyMsg: "message",
},
})
return &Logger{logger: logger}
}
func (l *Logger) WithContext(ctx context.Context) *logrus.Entry {
entry := logrus.NewEntry(l.logger)
// Add trace context
if span := trace.SpanFromContext(ctx); span.SpanContext().IsValid() {
entry = entry.WithFields(logrus.Fields{
"trace_id": span.SpanContext().TraceID().String(),
"span_id": span.SpanContext().SpanID().String(),
})
}
// Add request context
if reqID := ctx.Value("request_id"); reqID != nil {
entry = entry.WithField("request_id", reqID)
}
if userID := ctx.Value("user_id"); userID != nil {
entry = entry.WithField("user_id", userID)
}
return entry
}
Distributed Tracing with Jaeger
Deploy Jaeger:
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: observability
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
env:
- name: COLLECTOR_ZIPKIN_HTTP_PORT
value: "9411"
ports:
- containerPort: 5775
protocol: UDP
- containerPort: 6831
protocol: UDP
- containerPort: 6832
protocol: UDP
- containerPort: 5778
protocol: TCP
- containerPort: 16686
protocol: TCP
- containerPort: 14268
protocol: TCP
- containerPort: 9411
protocol: TCP
Initialize tracing in application:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)
func initTracer(serviceName string) (*sdktrace.TracerProvider, error) {
exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger.observability:14268/api/traces"),
))
if err != nil {
return nil, err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String(serviceName),
attribute.String("environment", "production"),
)),
sdktrace.WithSampler(sdktrace.AlwaysSample()),
)
otel.SetTracerProvider(tp)
return tp, nil
}
func main() {
tp, err := initTracer("order-service")
if err != nil {
log.Fatal(err)
}
defer func() {
if err := tp.Shutdown(context.Background()); err != nil {
log.Printf("Error shutting down tracer provider: %v", err)
}
}()
// Application code
}
Correlating Metrics, Logs, and Traces
The real power comes from connecting all three:
func (s *Service) ProcessRequest(ctx context.Context, req *Request) (*Response, error) {
// Start trace span
ctx, span := tracer.Start(ctx, "ProcessRequest")
defer span.End()
// Extract trace context for logging
traceID := span.SpanContext().TraceID().String()
spanID := span.SpanContext().SpanID().String()
// Create logger with trace context
logger := log.WithFields(logrus.Fields{
"trace_id": traceID,
"span_id": spanID,
"request_id": req.ID,
})
// Record metric
start := time.Now()
defer func() {
duration := time.Since(start).Seconds()
requestDuration.WithLabelValues(
req.Method,
req.Path,
).Observe(duration)
}()
logger.Info("Processing request")
// Business logic
result, err := s.doWork(ctx, req)
if err != nil {
logger.WithError(err).Error("Request processing failed")
span.RecordError(err)
requestErrors.WithLabelValues(req.Method, req.Path).Inc()
return nil, err
}
logger.WithFields(logrus.Fields{
"result_size": len(result),
}).Info("Request processed successfully")
return result, nil
}
In Grafana, you can now:
- See a spike in error metrics
- Click to view logs filtered by time range
- Find the trace_id in logs
- Jump to Jaeger to see the full trace
Service Level Objectives (SLOs)
Define what “good” means:
# SLO configuration
slos:
- name: api-availability
objective: 99.9
window: 30d
sli:
query: |
sum(rate(http_requests_total{status!~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
- name: api-latency
objective: 99
threshold: 500ms
window: 30d
sli:
query: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m]))
Calculate error budget:
type SLO struct {
Name string
Objective float64 // 99.9% = 0.999
Window time.Duration
}
func (slo *SLO) ErrorBudget() time.Duration {
allowedDowntime := slo.Window.Seconds() * (1 - slo.Objective)
return time.Duration(allowedDowntime) * time.Second
}
func (slo *SLO) RemainingBudget(actualUptime float64) time.Duration {
actualDowntime := slo.Window.Seconds() * (1 - actualUptime)
budget := slo.ErrorBudget().Seconds()
remaining := budget - actualDowntime
if remaining < 0 {
return 0
}
return time.Duration(remaining) * time.Second
}
// Example: 99.9% over 30 days
slo := &SLO{
Name: "API Availability",
Objective: 0.999,
Window: 30 * 24 * time.Hour,
}
// Error budget: 43.2 minutes per month
fmt.Printf("Error budget: %v\n", slo.ErrorBudget())
// If actual uptime is 99.95%
remaining := slo.RemainingBudget(0.9995)
fmt.Printf("Remaining budget: %v\n", remaining)
Alerting on SLOs
Alert when burning error budget too fast:
groups:
- name: slo-alerts
interval: 30s
rules:
# Fast burn: 2% budget consumed in 1 hour
- alert: ErrorBudgetBurnRateFast
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
) > 0.002
for: 5m
labels:
severity: critical
annotations:
summary: "Fast error budget burn detected"
description: "Burning error budget at {{ $value | humanizePercentage }} per hour"
# Slow burn: 10% budget consumed in 6 hours
- alert: ErrorBudgetBurnRateSlow
expr: |
(
1 - (
sum(rate(http_requests_total{status!~"5.."}[6h]))
/
sum(rate(http_requests_total[6h]))
)
) > 0.001
for: 30m
labels:
severity: warning
annotations:
summary: "Slow error budget burn detected"
Observability-Driven Development
Build observability in from the start:
type UserService struct {
db *sql.DB
cache Cache
logger *Logger
tracer trace.Tracer
}
func (s *UserService) GetUser(ctx context.Context, id string) (*User, error) {
// Create span
ctx, span := s.tracer.Start(ctx, "GetUser",
trace.WithAttributes(attribute.String("user.id", id)),
)
defer span.End()
// Create logger
logger := s.logger.WithContext(ctx).WithField("user_id", id)
// Try cache first
cacheStart := time.Now()
if user, err := s.cache.Get(ctx, id); err == nil {
cacheHits.Inc()
span.AddEvent("cache_hit")
logger.Debug("User found in cache")
return user, nil
}
cacheMisses.Inc()
span.AddEvent("cache_miss")
cacheDuration.Observe(time.Since(cacheStart).Seconds())
// Query database
dbStart := time.Now()
user, err := s.queryUser(ctx, id)
dbDuration.Observe(time.Since(dbStart).Seconds())
if err != nil {
dbErrors.Inc()
span.RecordError(err)
logger.WithError(err).Error("Failed to query user from database")
return nil, err
}
// Update cache asynchronously
go func() {
if err := s.cache.Set(context.Background(), id, user); err != nil {
logger.WithError(err).Warn("Failed to update cache")
}
}()
logger.Info("User retrieved from database")
return user, nil
}
Cost-Effective Observability
Observability can get expensive. Strategies to control costs:
Sampling:
// Sample 10% of traces
sampler := sdktrace.ParentBased(
sdktrace.TraceIDRatioBased(0.1),
)
tp := sdktrace.NewTracerProvider(
sdktrace.WithSampler(sampler),
)
Metric Aggregation:
# Prometheus recording rules
groups:
- name: aggregations
interval: 30s
rules:
- record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)
- record: job:http_request_duration:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))
Log Sampling:
func (l *Logger) SampleDebug(rate float64, msg string, fields logrus.Fields) {
if rand.Float64() < rate {
l.logger.WithFields(fields).Debug(msg)
}
}
Conclusion
Cloud-native observability is about understanding system behavior through:
- Metrics: What’s happening at scale
- Logs: Why specific events occurred
- Traces: How requests flow through the system
The goal isn’t to collect all possible data—it’s to have the right data to answer questions you haven’t thought of yet. Start with:
- Instrument code with metrics, logs, and traces
- Deploy Prometheus, EFK/Loki, and Jaeger
- Correlate the three pillars via trace context
- Define SLOs for your services
- Alert on SLO violations, not arbitrary thresholds
- Iterate based on incidents
Observability is not a destination—it’s a continuous practice of improving your ability to understand and debug complex systems.