Service mesh provides unparalleled observability into microservices communication without requiring application code changes. After instrumenting production systems with service mesh observability, I’ve learned how to extract maximum value from the metrics, traces, and visualizations these systems provide.
The Observability Promise
Service mesh proxies (like Envoy) sit in the request path and automatically collect:
- Golden Signals: Request rate, error rate, latency, saturation
- Service Topology: Who talks to whom
- Distributed Traces: Request flow across services
- Traffic Patterns: Protocol types, response sizes, retry behavior
All without modifying application code.
Metrics That Matter
Service mesh generates thousands of metrics. Focus on these:
Request Metrics
# Request rate by service
sum(rate(istio_requests_total[5m])) by (destination_service_name)
# Error rate
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
/
sum(rate(istio_requests_total[5m])) by (destination_service_name)
# Latency percentiles
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket[5m]))
by (destination_service_name, le)
)
Traffic Flow Metrics
# Traffic between specific services
sum(rate(istio_requests_total{
source_workload="frontend",
destination_service_name="api-service"
}[5m]))
# Cross-namespace traffic
sum(rate(istio_requests_total[5m]))
by (source_workload_namespace, destination_service_namespace)
Connection Pool Metrics
# Active connections
envoy_cluster_upstream_cx_active
# Connection failures
rate(envoy_cluster_upstream_cx_connect_fail[5m])
# Connection timeouts
rate(envoy_cluster_upstream_cx_connect_timeout[5m])
Grafana Dashboards
Build comprehensive dashboards:
{
"dashboard": {
"title": "Service Mesh Overview",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "sum(rate(istio_requests_total[5m])) by (destination_service_name)"
}],
"type": "graph"
},
{
"title": "Error Rate by Service",
"targets": [{
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) by (destination_service_name)"
}],
"type": "graph"
},
{
"title": "P99 Latency",
"targets": [{
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (destination_service_name, le))"
}],
"type": "graph"
},
{
"title": "Success Rate",
"targets": [{
"expr": "sum(rate(istio_requests_total{response_code!~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m]))"
}],
"type": "singlestat"
}
]
}
}
Distributed Tracing
Service mesh enables automatic trace propagation:
# Istio tracing configuration
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
enableTracing: true
defaultConfig:
tracing:
sampling: 1.0 # 100% sampling
zipkin:
address: jaeger-collector.observability:9411
Application only needs to propagate trace headers:
func (h *Handler) ServeHTTP(w http.ResponseWriter, r *http.Request) {
// Extract trace headers
headers := []string{
"x-request-id",
"x-b3-traceid",
"x-b3-spanid",
"x-b3-parentspanid",
"x-b3-sampled",
"x-b3-flags",
}
// Forward to downstream services
req, _ := http.NewRequest("GET", "http://downstream-service", nil)
for _, h := range headers {
if val := r.Header.Get(h); val != "" {
req.Header.Set(h, val)
}
}
resp, err := http.DefaultClient.Do(req)
// Handle response
}
Service Graph Visualization
Kiali provides service topology visualization:
# Deploy Kiali
apiVersion: v1
kind: ConfigMap
metadata:
name: kiali
namespace: istio-system
data:
config.yaml: |
auth:
strategy: anonymous
deployment:
accessible_namespaces: ["**"]
external_services:
prometheus:
url: http://prometheus:9090
tracing:
url: http://jaeger-query:16686
grafana:
url: http://grafana:3000
Alerting Rules
Define meaningful alerts:
groups:
- name: service-mesh-alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: |
sum(rate(istio_requests_total{response_code=~"5.."}[5m]))
by (destination_service_name)
/
sum(rate(istio_requests_total[5m]))
by (destination_service_name)
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate for {{ $labels.destination_service_name }}"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket[5m]))
by (destination_service_name, le)
) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High latency for {{ $labels.destination_service_name }}"
description: "P99 latency is {{ $value }}ms"
- alert: CircuitBreakerTripped
expr: |
rate(envoy_cluster_upstream_cx_overflow[5m]) > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Circuit breaker tripped for {{ $labels.envoy_cluster_name }}"
Access Logging
Configure detailed access logs:
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
accessLogging:
- providers:
- name: envoy
match:
mode: SERVER_AND_CLIENT
filter:
expression: response.code >= 400
Custom log format:
{
"start_time": "%START_TIME%",
"method": "%REQ(:METHOD)%",
"path": "%REQ(X-ENVOY-ORIGINAL-PATH?:PATH)%",
"protocol": "%PROTOCOL%",
"response_code": "%RESPONSE_CODE%",
"duration": "%DURATION%",
"upstream_service_time": "%RESP(X-ENVOY-UPSTREAM-SERVICE-TIME)%",
"upstream_cluster": "%UPSTREAM_CLUSTER%",
"upstream_host": "%UPSTREAM_HOST%",
"user_agent": "%REQ(USER-AGENT)%",
"request_id": "%REQ(X-REQUEST-ID)%",
"authority": "%REQ(:AUTHORITY)%",
"bytes_received": "%BYTES_RECEIVED%",
"bytes_sent": "%BYTES_SENT%"
}
Performance Analysis
Use metrics to identify bottlenecks:
# Services with highest latency
topk(5,
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket[5m]))
by (destination_service_name, le)
)
)
# Services with most retries
topk(5,
sum(rate(envoy_cluster_upstream_rq_retry[5m]))
by (envoy_cluster_name)
)
# Services with most timeouts
topk(5,
sum(rate(envoy_cluster_upstream_rq_timeout[5m]))
by (envoy_cluster_name)
)
Capacity Planning
Track resource utilization:
# Connection pool saturation
envoy_cluster_upstream_cx_active
/
envoy_cluster_circuit_breakers_default_cx_open
* 100
# Request queue depth
envoy_cluster_upstream_rq_pending_active
# Memory usage by proxy
container_memory_usage_bytes{container="istio-proxy"}
Security Observability
Monitor security posture:
# mTLS success rate
sum(rate(istio_requests_total{
security_policy="mutual_tls",
response_code!~"4..|5.."
}[5m]))
/
sum(rate(istio_requests_total{security_policy="mutual_tls"}[5m]))
# Unauthorized requests
sum(rate(istio_requests_total{response_code="403"}[5m]))
by (source_workload, destination_service_name)
# Authentication failures
sum(rate(istio_requests_total{response_code="401"}[5m]))
by (destination_service_name)
Debugging Workflows
Use observability to debug issues:
- Alert fires → Check Grafana dashboard for affected service
- Identify spike → Jump to Kiali for service dependencies
- Trace issues → Click through to Jaeger with trace ID
- Root cause → Correlate with logs using trace ID
- Fix and verify → Monitor metrics to confirm resolution
Best Practices
- Start with golden signals: Request rate, errors, latency
- Build service-specific dashboards: Not just cluster-wide
- Alert on symptoms, not causes: High error rate, not pod restarts
- Correlate signals: Link metrics, logs, and traces
- Sample intelligently: 100% for errors, sample for success
- Archive traces selectively: Keep errors longer than successes
- Document runbooks: Link alerts to investigation procedures
Conclusion
Service mesh observability transforms how we understand distributed systems. The automatic collection of metrics, traces, and logs without code changes is powerful. Focus on:
- Golden signals for each service
- Service topology and dependencies
- Distributed tracing for debugging
- Performance and capacity planning
- Security monitoring
The key is not collecting all possible data, but having the right data to answer questions quickly when issues arise. Start with core metrics, add tracing, then expand to custom dashboards and alerts tailored to your services.