The Evolution of Cloud-Native Monitoring: From Metrics to Observability

Traditional monitoring approaches fail in cloud-native environments. The shift from monoliths to microservices, from static infrastructure to dynamic containers, and from known failure modes to emergent behaviors requires a fundamentally different approach. This post explores how monitoring has evolved into observability and what that means for running production systems.

The Limitations of Traditional Monitoring

Traditional monitoring relied on predefined dashboards and threshold-based alerts:

# Traditional monitoring configuration
alerts:
  - name: HighCPU
    condition: cpu_usage > 80%
    duration: 5m
    action: page_oncall

  - name: DiskFull
    condition: disk_usage > 90%
    duration: 1m
    action: alert_team

This works when:

You know what metrics matter
Systems behave predictably
Failures are well-understood
Infrastructure is relatively static

In cloud-native environments, these assumptions break down:

Hundreds of services generate millions of metrics
Containers come and go constantly
Failures emerge from complex interactions
You can’t predict every problem

The Three Pillars of Observability

Modern observability rests on three pillars:

1. Metrics

Time-series numerical data about system state:

import (
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promauto"
)

var (
    httpRequestsTotal = promauto.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    httpRequestDuration = promauto.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration",
            Buckets: []float64{.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10},
        },
        []string{"method", "endpoint"},
    )

    activeConnections = promauto.NewGauge(
        prometheus.GaugeOpts{
            Name: "active_connections",
            Help: "Number of active connections",
        },
    )
)

func metricsMiddleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        start := time.Now()

        activeConnections.Inc()
        defer activeConnections.Dec()

        wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
        next.ServeHTTP(wrapped, r)

        duration := time.Since(start).Seconds()

        httpRequestsTotal.WithLabelValues(
            r.Method,
            r.URL.Path,
            fmt.Sprintf("%d", wrapped.statusCode),
        ).Inc()

        httpRequestDuration.WithLabelValues(
            r.Method,
            r.URL.Path,
        ).Observe(duration)
    })
}

2. Logs

Discrete events with context:

import (
    "github.com/sirupsen/logrus"
)

var log = logrus.New()

func init() {
    log.SetFormatter(&logrus.JSONFormatter{})
    log.SetLevel(logrus.InfoLevel)
}

func (s *Service) ProcessOrder(ctx context.Context, order *Order) error {
    logger := log.WithFields(logrus.Fields{
        "order_id":   order.ID,
        "user_id":    order.UserID,
        "total":      order.Total,
        "trace_id":   getTraceID(ctx),
        "request_id": getRequestID(ctx),
    })

    logger.Info("Processing order")

    if err := s.validateOrder(order); err != nil {
        logger.WithError(err).Warn("Order validation failed")
        return err
    }

    if err := s.chargePayment(ctx, order); err != nil {
        logger.WithError(err).Error("Payment processing failed")
        return err
    }

    logger.WithFields(logrus.Fields{
        "duration_ms": time.Since(order.CreatedAt).Milliseconds(),
    }).Info("Order processed successfully")

    return nil
}

3. Distributed Traces

Request flow across services:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)

var tracer = otel.Tracer("order-service")

func (s *Service) ProcessOrder(ctx context.Context, order *Order) error {
    ctx, span := tracer.Start(ctx, "ProcessOrder")
    defer span.End()

    span.SetAttributes(
        attribute.String("order.id", order.ID),
        attribute.String("user.id", order.UserID),
        attribute.Float64("order.total", order.Total),
    )

    if err := s.validateOrder(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "validation failed")
        return err
    }

    if err := s.chargePayment(ctx, order); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "payment failed")
        return err
    }

    span.SetStatus(codes.Ok, "order processed")
    return nil
}

func (s *Service) chargePayment(ctx context.Context, order *Order) error {
    ctx, span := tracer.Start(ctx, "ChargePayment")
    defer span.End()

    // Make external API call
    resp, err := s.paymentClient.Charge(ctx, &PaymentRequest{
        Amount: order.Total,
        Method: order.PaymentMethod,
    })

    if err != nil {
        span.RecordError(err)
        return err
    }

    span.SetAttributes(
        attribute.String("transaction.id", resp.TransactionID),
        attribute.String("status", resp.Status),
    )

    return nil
}

Setting Up Prometheus for Metrics

Deploy Prometheus in Kubernetes:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: kubernetes_pod_name

Application pod annotations:

apiVersion: v1
kind: Pod
metadata:
  name: myapp
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  containers:
    - name: app
      image: myapp:latest
      ports:
        - containerPort: 8080
          name: metrics

Structured Logging with EFK Stack

Deploy Elasticsearch, Fluentd, and Kibana:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluentd-config
  namespace: logging
data:
  fluent.conf: |
    <source>
      @type tail
      path /var/log/containers/*.log
      pos_file /var/log/fluentd-containers.log.pos
      tag kubernetes.*
      read_from_head true
      <parse>
        @type json
        time_key time
        time_format %Y-%m-%dT%H:%M:%S.%NZ
        keep_time_key true
      </parse>
    </source>

    <filter kubernetes.**>
      @type kubernetes_metadata
      @id filter_kube_metadata
    </filter>

    <filter kubernetes.**>
      @type parser
      key_name log
      reserve_data true
      <parse>
        @type json
      </parse>
    </filter>

    <match kubernetes.**>
      @type elasticsearch
      host elasticsearch.logging.svc.cluster.local
      port 9200
      logstash_format true
      logstash_prefix kubernetes
      include_tag_key true
      <buffer>
        @type file
        path /var/log/fluentd-buffers/kubernetes.system.buffer
        flush_mode interval
        retry_type exponential_backoff
        flush_interval 5s
        retry_forever true
        retry_max_interval 30
      </buffer>
    </match>

Application logging best practices:

type Logger struct {
    logger *logrus.Logger
}

func NewLogger() *Logger {
    logger := logrus.New()
    logger.SetFormatter(&logrus.JSONFormatter{
        TimestampFormat: time.RFC3339Nano,
        FieldMap: logrus.FieldMap{
            logrus.FieldKeyTime:  "timestamp",
            logrus.FieldKeyLevel: "severity",
            logrus.FieldKeyMsg:   "message",
        },
    })
    return &Logger{logger: logger}
}

func (l *Logger) WithContext(ctx context.Context) *logrus.Entry {
    entry := logrus.NewEntry(l.logger)

    // Add trace context
    if span := trace.SpanFromContext(ctx); span.SpanContext().IsValid() {
        entry = entry.WithFields(logrus.Fields{
            "trace_id": span.SpanContext().TraceID().String(),
            "span_id":  span.SpanContext().SpanID().String(),
        })
    }

    // Add request context
    if reqID := ctx.Value("request_id"); reqID != nil {
        entry = entry.WithField("request_id", reqID)
    }

    if userID := ctx.Value("user_id"); userID != nil {
        entry = entry.WithField("user_id", userID)
    }

    return entry
}

Distributed Tracing with Jaeger

Deploy Jaeger:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: observability
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
        - name: jaeger
          image: jaegertracing/all-in-one:latest
          env:
            - name: COLLECTOR_ZIPKIN_HTTP_PORT
              value: "9411"
          ports:
            - containerPort: 5775
              protocol: UDP
            - containerPort: 6831
              protocol: UDP
            - containerPort: 6832
              protocol: UDP
            - containerPort: 5778
              protocol: TCP
            - containerPort: 16686
              protocol: TCP
            - containerPort: 14268
              protocol: TCP
            - containerPort: 9411
              protocol: TCP

Initialize tracing in application:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracer(serviceName string) (*sdktrace.TracerProvider, error) {
    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger.observability:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String(serviceName),
            attribute.String("environment", "production"),
        )),
        sdktrace.WithSampler(sdktrace.AlwaysSample()),
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

func main() {
    tp, err := initTracer("order-service")
    if err != nil {
        log.Fatal(err)
    }
    defer func() {
        if err := tp.Shutdown(context.Background()); err != nil {
            log.Printf("Error shutting down tracer provider: %v", err)
        }
    }()

    // Application code
}

Correlating Metrics, Logs, and Traces

The real power comes from connecting all three:

func (s *Service) ProcessRequest(ctx context.Context, req *Request) (*Response, error) {
    // Start trace span
    ctx, span := tracer.Start(ctx, "ProcessRequest")
    defer span.End()

    // Extract trace context for logging
    traceID := span.SpanContext().TraceID().String()
    spanID := span.SpanContext().SpanID().String()

    // Create logger with trace context
    logger := log.WithFields(logrus.Fields{
        "trace_id":   traceID,
        "span_id":    spanID,
        "request_id": req.ID,
    })

    // Record metric
    start := time.Now()
    defer func() {
        duration := time.Since(start).Seconds()
        requestDuration.WithLabelValues(
            req.Method,
            req.Path,
        ).Observe(duration)
    }()

    logger.Info("Processing request")

    // Business logic
    result, err := s.doWork(ctx, req)
    if err != nil {
        logger.WithError(err).Error("Request processing failed")
        span.RecordError(err)
        requestErrors.WithLabelValues(req.Method, req.Path).Inc()
        return nil, err
    }

    logger.WithFields(logrus.Fields{
        "result_size": len(result),
    }).Info("Request processed successfully")

    return result, nil
}

In Grafana, you can now:

See a spike in error metrics
Click to view logs filtered by time range
Find the trace_id in logs
Jump to Jaeger to see the full trace

Service Level Objectives (SLOs)

Define what “good” means:

# SLO configuration
slos:
  - name: api-availability
    objective: 99.9
    window: 30d
    sli:
      query: |
        sum(rate(http_requests_total{status!~"5.."}[5m])) /
        sum(rate(http_requests_total[5m]))

  - name: api-latency
    objective: 99
    threshold: 500ms
    window: 30d
    sli:
      query: |
        histogram_quantile(0.99,
          rate(http_request_duration_seconds_bucket[5m]))

Calculate error budget:

type SLO struct {
    Name       string
    Objective  float64 // 99.9% = 0.999
    Window     time.Duration
}

func (slo *SLO) ErrorBudget() time.Duration {
    allowedDowntime := slo.Window.Seconds() * (1 - slo.Objective)
    return time.Duration(allowedDowntime) * time.Second
}

func (slo *SLO) RemainingBudget(actualUptime float64) time.Duration {
    actualDowntime := slo.Window.Seconds() * (1 - actualUptime)
    budget := slo.ErrorBudget().Seconds()
    remaining := budget - actualDowntime
    if remaining < 0 {
        return 0
    }
    return time.Duration(remaining) * time.Second
}

// Example: 99.9% over 30 days
slo := &SLO{
    Name:      "API Availability",
    Objective: 0.999,
    Window:    30 * 24 * time.Hour,
}

// Error budget: 43.2 minutes per month
fmt.Printf("Error budget: %v\n", slo.ErrorBudget())

// If actual uptime is 99.95%
remaining := slo.RemainingBudget(0.9995)
fmt.Printf("Remaining budget: %v\n", remaining)

Alerting on SLOs

Alert when burning error budget too fast:

groups:
  - name: slo-alerts
    interval: 30s
    rules:
      # Fast burn: 2% budget consumed in 1 hour
      - alert: ErrorBudgetBurnRateFast
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )
          ) > 0.002
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Fast error budget burn detected"
          description: "Burning error budget at {{ $value | humanizePercentage }} per hour"

      # Slow burn: 10% budget consumed in 6 hours
      - alert: ErrorBudgetBurnRateSlow
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[6h]))
              /
              sum(rate(http_requests_total[6h]))
            )
          ) > 0.001
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "Slow error budget burn detected"

Observability-Driven Development

Build observability in from the start:

type UserService struct {
    db     *sql.DB
    cache  Cache
    logger *Logger
    tracer trace.Tracer
}

func (s *UserService) GetUser(ctx context.Context, id string) (*User, error) {
    // Create span
    ctx, span := s.tracer.Start(ctx, "GetUser",
        trace.WithAttributes(attribute.String("user.id", id)),
    )
    defer span.End()

    // Create logger
    logger := s.logger.WithContext(ctx).WithField("user_id", id)

    // Try cache first
    cacheStart := time.Now()
    if user, err := s.cache.Get(ctx, id); err == nil {
        cacheHits.Inc()
        span.AddEvent("cache_hit")
        logger.Debug("User found in cache")
        return user, nil
    }
    cacheMisses.Inc()
    span.AddEvent("cache_miss")
    cacheDuration.Observe(time.Since(cacheStart).Seconds())

    // Query database
    dbStart := time.Now()
    user, err := s.queryUser(ctx, id)
    dbDuration.Observe(time.Since(dbStart).Seconds())

    if err != nil {
        dbErrors.Inc()
        span.RecordError(err)
        logger.WithError(err).Error("Failed to query user from database")
        return nil, err
    }

    // Update cache asynchronously
    go func() {
        if err := s.cache.Set(context.Background(), id, user); err != nil {
            logger.WithError(err).Warn("Failed to update cache")
        }
    }()

    logger.Info("User retrieved from database")
    return user, nil
}

Cost-Effective Observability

Observability can get expensive. Strategies to control costs:

Sampling:

// Sample 10% of traces
sampler := sdktrace.ParentBased(
    sdktrace.TraceIDRatioBased(0.1),
)

tp := sdktrace.NewTracerProvider(
    sdktrace.WithSampler(sampler),
)

Metric Aggregation:

# Prometheus recording rules
groups:
  - name: aggregations
    interval: 30s
    rules:
      - record: job:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (job)

      - record: job:http_request_duration:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (job, le))

Log Sampling:

func (l *Logger) SampleDebug(rate float64, msg string, fields logrus.Fields) {
    if rand.Float64() < rate {
        l.logger.WithFields(fields).Debug(msg)
    }
}

Conclusion

Cloud-native observability is about understanding system behavior through:

Metrics: What’s happening at scale
Logs: Why specific events occurred
Traces: How requests flow through the system

The goal isn’t to collect all possible data—it’s to have the right data to answer questions you haven’t thought of yet. Start with:

Instrument code with metrics, logs, and traces
Deploy Prometheus, EFK/Loki, and Jaeger
Correlate the three pillars via trace context
Define SLOs for your services
Alert on SLO violations, not arbitrary thresholds
Iterate based on incidents

Observability is not a destination—it’s a continuous practice of improving your ability to understand and debug complex systems.