Service mesh provides unparalleled observability into microservices communication without requiring application code changes. After instrumenting production systems with service mesh observability, I’ve learned how to extract maximum value from the metrics, traces, and visualizations these systems provide.

The Observability Promise

Service mesh proxies (like Envoy) sit in the request path and automatically collect:

  • Golden Signals: Request rate, error rate, latency, saturation
  • Service Topology: Who talks to whom
  • Distributed Traces: Request flow across services
  • Traffic Patterns: Protocol types, response sizes, retry behavior

All without modifying application code.

Metrics That Matter

Service mesh generates thousands of metrics. Focus on these:

Request Metrics

# Request rate by service
sum(rate(istio_requests_total[5m])) by (destination_service_name)

# Error rate
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
/
sum(rate(istio_requests_total[5m])) by (destination_service_name)

# Latency percentiles
histogram_quantile(0.99,
  sum(rate(istio_request_duration_milliseconds_bucket[5m]))
    by (destination_service_name, le)
)

Traffic Flow Metrics

# Traffic between specific services
sum(rate(istio_requests_total{
  source_workload="frontend",
  destination_service_name="api-service"
}[5m]))

# Cross-namespace traffic
sum(rate(istio_requests_total[5m]))
  by (source_workload_namespace, destination_service_namespace)

Connection Pool Metrics

# Active connections
envoy_cluster_upstream_cx_active

# Connection failures
rate(envoy_cluster_upstream_cx_connect_fail[5m])

# Connection timeouts
rate(envoy_cluster_upstream_cx_connect_timeout[5m])

Grafana Dashboards

Build comprehensive dashboards:

{
  "dashboard": {
    "title": "Service Mesh Overview",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "sum(rate(istio_requests_total[5m])) by (destination_service_name)"
        }],
        "type": "graph"
      },
      {
        "title": "Error Rate by Service",
        "targets": [{
          "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) by (destination_service_name)"
        }],
        "type": "graph"
      },
      {
        "title": "P99 Latency",
        "targets": [{
          "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (destination_service_name, le))"
        }],
        "type": "graph"
      },
      {
        "title": "Success Rate",
        "targets": [{
          "expr": "sum(rate(istio_requests_total{response_code!~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m]))"
        }],
        "type": "singlestat"
      }
    ]
  }
}

Distributed Tracing

Service mesh enables automatic trace propagation:

# Istio tracing configuration
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  meshConfig:
    enableTracing: true
    defaultConfig:
      tracing:
        sampling: 1.0  # 100% sampling
        zipkin:
          address: jaeger-collector.observability:9411

Application only needs to propagate trace headers:

func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
    // Extract trace headers
    headers := []string{
        "x-request-id",
        "x-b3-traceid",
        "x-b3-spanid",
        "x-b3-parentspanid",
        "x-b3-sampled",
        "x-b3-flags",
    }

    // Forward to downstream services
    req, _ := http.NewRequest("GET", "http://downstream-service", nil)
    for _, h := range headers {
        if val := r.Header.Get(h); val != "" {
            req.Header.Set(h, val)
        }
    }

    resp, err := http.DefaultClient.Do(req)
    // Handle response
}

Service Graph Visualization

Kiali provides service topology visualization:

# Deploy Kiali
apiVersion: v1
kind: ConfigMap
metadata:
  name: kiali
  namespace: istio-system
data:
  config.yaml: |
    auth:
      strategy: anonymous
    deployment:
      accessible_namespaces: ["**"]
    external_services:
      prometheus:
        url: http://prometheus:9090
      tracing:
        url: http://jaeger-query:16686
      grafana:
        url: http://grafana:3000

Alerting Rules

Define meaningful alerts:

groups:
  - name: service-mesh-alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(istio_requests_total{response_code=~"5.."}[5m]))
            by (destination_service_name)
          /
          sum(rate(istio_requests_total[5m]))
            by (destination_service_name)
          > 0.05
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High error rate for {{ $labels.destination_service_name }}"
          description: "Error rate is {{ $value | humanizePercentage }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum(rate(istio_request_duration_milliseconds_bucket[5m]))
              by (destination_service_name, le)
          ) > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency for {{ $labels.destination_service_name }}"
          description: "P99 latency is {{ $value }}ms"

      - alert: CircuitBreakerTripped
        expr: |
          rate(envoy_cluster_upstream_cx_overflow[5m]) > 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Circuit breaker tripped for {{ $labels.envoy_cluster_name }}"

Access Logging

Configure detailed access logs:

apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: mesh-default
  namespace: istio-system
spec:
  accessLogging:
    - providers:
        - name: envoy
      match:
        mode: SERVER_AND_CLIENT
      filter:
        expression: response.code >= 400

Custom log format:

{
  "start_time": "%START_TIME%",
  "method": "%REQ(:METHOD)%",
  "path": "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
  "protocol": "%PROTOCOL%",
  "response_code": "%RESPONSE_CODE%",
  "duration": "%DURATION%",
  "upstream_service_time": "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%",
  "upstream_cluster": "%UPSTREAM_CLUSTER%",
  "upstream_host": "%UPSTREAM_HOST%",
  "user_agent": "%REQ(USER-AGENT)%",
  "request_id": "%REQ(X-REQUEST-ID)%",
  "authority": "%REQ(:AUTHORITY)%",
  "bytes_received": "%BYTES_RECEIVED%",
  "bytes_sent": "%BYTES_SENT%"
}

Performance Analysis

Use metrics to identify bottlenecks:

# Services with highest latency
topk(5,
  histogram_quantile(0.99,
    sum(rate(istio_request_duration_milliseconds_bucket[5m]))
      by (destination_service_name, le)
  )
)

# Services with most retries
topk(5,
  sum(rate(envoy_cluster_upstream_rq_retry[5m]))
    by (envoy_cluster_name)
)

# Services with most timeouts
topk(5,
  sum(rate(envoy_cluster_upstream_rq_timeout[5m]))
    by (envoy_cluster_name)
)

Capacity Planning

Track resource utilization:

# Connection pool saturation
envoy_cluster_upstream_cx_active
/
envoy_cluster_circuit_breakers_default_cx_open
* 100

# Request queue depth
envoy_cluster_upstream_rq_pending_active

# Memory usage by proxy
container_memory_usage_bytes{container="istio-proxy"}

Security Observability

Monitor security posture:

# mTLS success rate
sum(rate(istio_requests_total{
  security_policy="mutual_tls",
  response_code!~"4..|5.."
}[5m]))
/
sum(rate(istio_requests_total{security_policy="mutual_tls"}[5m]))

# Unauthorized requests
sum(rate(istio_requests_total{response_code="403"}[5m]))
  by (source_workload, destination_service_name)

# Authentication failures
sum(rate(istio_requests_total{response_code="401"}[5m]))
  by (destination_service_name)

Debugging Workflows

Use observability to debug issues:

  1. Alert fires → Check Grafana dashboard for affected service
  2. Identify spike → Jump to Kiali for service dependencies
  3. Trace issues → Click through to Jaeger with trace ID
  4. Root cause → Correlate with logs using trace ID
  5. Fix and verify → Monitor metrics to confirm resolution

Best Practices

  1. Start with golden signals: Request rate, errors, latency
  2. Build service-specific dashboards: Not just cluster-wide
  3. Alert on symptoms, not causes: High error rate, not pod restarts
  4. Correlate signals: Link metrics, logs, and traces
  5. Sample intelligently: 100% for errors, sample for success
  6. Archive traces selectively: Keep errors longer than successes
  7. Document runbooks: Link alerts to investigation procedures

Conclusion

Service mesh observability transforms how we understand distributed systems. The automatic collection of metrics, traces, and logs without code changes is powerful. Focus on:

  • Golden signals for each service
  • Service topology and dependencies
  • Distributed tracing for debugging
  • Performance and capacity planning
  • Security monitoring

The key is not collecting all possible data, but having the right data to answer questions quickly when issues arise. Start with core metrics, add tracing, then expand to custom dashboards and alerts tailored to your services.