Kubernetes in Production: Running Security Services at Scale

Last August, I wrote about our early experiments with Kubernetes. Six months later, we’ve moved significant portions of our CipherTrust platform to Kubernetes in production. The journey from “this looks promising” to “we’re betting our business on this” has been educational. Let me share what we’ve learned running security-critical microservices on Kubernetes at scale.

The Decision to Go Production

Moving to production Kubernetes wasn’t taken lightly. We spent months evaluating:

Stability: Could we trust Kubernetes to reliably run services that protect our customers’ most sensitive data?

Security: Did Kubernetes provide sufficient isolation and security controls for multi-tenant key management?

Operability: Could our team operate Kubernetes clusters effectively, handling incidents and maintenance?

Performance: Would Kubernetes overhead impact the low-latency requirements of cryptographic operations?

After extensive testing in staging environments, we concluded yes to all these questions - with caveats and careful implementation.

Cluster Architecture

We run multiple Kubernetes clusters:

Per-environment clusters: Separate clusters for development, staging, and production. Blast radius isolation - issues in dev don’t affect production.

Per-region clusters: Production clusters in each AWS and Azure region we support. Provides low-latency access and disaster recovery.

Per-tenant clusters: For largest customers with stringent security requirements, dedicated clusters provide strong isolation.

Each cluster follows a standard architecture:

3 master nodes across availability zones for control plane high availability
6+ worker nodes sized based on workload requirements
Separate etcd cluster for improved reliability and performance
Network policies enforcing microsegmentation between services
Pod security policies restricting what containers can do

Secrets Management in Production

Kubernetes Secrets API is convenient but not secure enough for our HSM credentials and other sensitive data. Our production approach:

External secrets store: Secrets stored in HashiCorp Vault, not in Kubernetes.

Init containers: Before main containers start, init containers fetch secrets from Vault and write them to shared volumes.

Short-lived credentials: Secrets retrieved at pod startup are time-limited, expiring after 24 hours. Pods are restarted daily to refresh credentials.

Encryption at rest: etcd (where Kubernetes stores cluster state including Secrets) is encrypted at rest using KMS keys we control.

This is more complex than using Kubernetes Secrets directly, but provides the security guarantees we need for cryptographic services.

Network Policies and Microsegmentation

We implement zero-trust networking within Kubernetes using network policies:

Default deny: All pod-to-pod traffic is denied by default.

Explicit allow: Services explicitly declare what they need to communicate with.

Namespace isolation: Pods in different namespaces can’t communicate unless specifically allowed.

Example policy: The key lifecycle service can communicate with HSM proxy service and audit service, but nothing else. If the key lifecycle service is compromised, the attacker can’t pivot to other services.

Implementation uses Calico network policies (more powerful than default Kubernetes network policies) with:

Layer 4 rules (IP, port, protocol)
Layer 7 rules (HTTP method, path)
Identity-based rules (service account, not just IP)

Resource Management and QoS

Every pod has resource requests and limits:

Requests: Guaranteed resources. Kubernetes won’t schedule a pod unless the node has enough CPU and memory to meet requests.

Limits: Maximum resources. Container is throttled (CPU) or killed (memory) if it exceeds limits.

We categorize services into QoS classes:

Guaranteed (requests = limits): Critical services like key lifecycle and HSM proxy. Always get their resources.

Burstable (requests < limits): Services that usually need minimal resources but occasionally burst higher.

BestEffort (no requests or limits): Only monitoring and logging services. Can be evicted if the cluster is under pressure.

This ensures critical cryptographic operations always have resources, even under heavy load.

Monitoring and Observability

Kubernetes-native monitoring stack:

Prometheus: Collects metrics from Kubernetes and all our services. Stores time-series data for querying and alerting.

Grafana: Visualizes Prometheus metrics. Dashboards for cluster health, service performance, and business metrics.

Jaeger: Distributed tracing across microservices. When a key operation touches multiple services, Jaeger shows the complete path and latency breakdown.

Elasticsearch/Kibana: Centralized logging (as I’ve written about before). All pod logs flow here.

Kubernetes Dashboard: Web UI for cluster management and troubleshooting.

We instrument services to export custom metrics:

Key operations per second by type
Cryptographic operation latency percentiles
HSM connection pool utilization
Policy evaluation latency
Cache hit rates

These metrics feed Prometheus, which triggers PagerDuty alerts when thresholds are exceeded.

Deployment Strategies

We use several deployment patterns in production:

Rolling updates: Default strategy. Gradually replace old pods with new ones. Ensures some instances always available.

Blue-green: Deploy complete new version alongside old version. Switch traffic over once validated. Quick rollback if issues appear.

Canary: Deploy new version to small percentage of traffic first. Monitor metrics. Gradually increase if metrics look good.

Different services use different strategies based on risk tolerance and rollback requirements. Critical services use canary deployments with extensive validation before full rollout.

High Availability Patterns

Kubernetes helps with HA but doesn’t solve it automatically:

Pod replicas: Every service runs at least 3 replicas. Can tolerate 1-2 failures.

Pod anti-affinity: Replicas are spread across different nodes (and ideally different availability zones).

Readiness probes: Unhealthy pods are removed from service load balancing immediately.

Liveness probes: Crashed pods are restarted automatically.

PodDisruptionBudgets: Ensure a minimum number of pods remain available during maintenance operations like node draining.

This provides good availability within a cluster. For cross-cluster failover, we use DNS-based routing between regions.

Stateful Services Challenges

Most of our services are stateless, but some components require persistence:

Databases: Initially ran outside Kubernetes. Recently migrated to StatefulSets with persistent volumes. Still validating reliability.

HSM connection state: Some HSM client libraries maintain connection state. Pods can’t be arbitrarily killed and restarted. We implement graceful shutdown with connection draining.

Caches: Use Redis StatefulSets with persistent storage. Master-replica configuration for high availability.

StatefulSets are improving but still less mature than Deployments. We’re conservative about what we run as StatefulSets.

Security Hardening

Production Kubernetes clusters are hardened:

Pod security policies: Enforced restrictions on what containers can do. No privileged containers, no host network access, read-only root filesystem where possible.

RBAC: Fine-grained role-based access control. Different teams have different permissions. Developers can view logs but not delete production pods.

Admission controllers: Validate and potentially modify resource definitions before they’re created. We reject pods that don’t meet security standards.

Image scanning: All container images are scanned for vulnerabilities before deployment. Critical vulnerabilities block deployment.

Network encryption: All pod-to-pod traffic uses mTLS through a service mesh (evaluating Istio).

Audit logging: All Kubernetes API calls are logged for security monitoring and compliance.

Disaster Recovery

We test disaster recovery regularly:

Cluster failure: Ability to route traffic to different cluster in different region.

Node failure: Verify that pods are rescheduled and services remain available.

AZ failure: Simulate an entire availability zone going down.

etcd failure: Restore etcd from backup and verify cluster state is consistent.

Data corruption: Restore persistent volumes from snapshots.

Monthly disaster recovery drills ensure our procedures work and teams stay practiced.

Operational Challenges

Running Kubernetes in production isn’t all smooth:

Cluster upgrades: Kubernetes moves fast. Staying current requires careful planning and testing. We’re currently on Kubernetes 1.12, upgrading every 2-3 months.

Certificate rotation: Kubernetes uses certificates for component authentication. These expire. We had an incident where expired certificates broke the cluster. Now we monitor certificate expiration closely.

etcd performance: As cluster scales, etcd can become a bottleneck. We’ve had to tune etcd and eventually moved to dedicated etcd cluster separate from master nodes.

Resource exhaustion: Misbehaving pods can consume all resources on a node. Pod resource limits help but don’t prevent all issues.

Debugging: Debugging issues in a distributed system across many pods is harder than debugging a monolith. Good logging and tracing are essential.

Cost Management

Kubernetes can be cost-effective but requires attention:

Right-sizing: Ensure resource requests match actual usage. Over-requesting wastes money on idle resources.

Node autoscaling: Cluster Autoscaler adjusts node count based on pod resource requests. Scales up when pods can’t be scheduled, scales down when nodes are underutilized.

Spot instances: For non-critical workloads, use AWS spot instances or Azure spot VMs for significant cost savings.

Resource efficiency: Kubernetes bin-packing efficiently schedules pods on nodes, improving utilization over manual VM management.

We provide teams with cost dashboards showing their Kubernetes spend and recommendations for optimization.

Team Evolution

Moving to Kubernetes changed how our team works:

DevOps culture: Developers take more responsibility for operations. They write Kubernetes manifests, monitor their services, respond to alerts.

Platform team: We formed a platform team that manages Kubernetes clusters and provides tools/documentation for service teams.

Skillset changes: Everyone needed to learn Kubernetes concepts, kubectl commands, YAML debugging, etc.

Documentation: Extensive documentation and runbooks for common operations and troubleshooting.

The learning curve was steep but teams are now productive and self-sufficient.

Looking Forward

Kubernetes continues evolving rapidly. We’re watching:

Service mesh: Evaluating Istio for improved observability, security, and traffic management.

Operators: Kubernetes operators for managing stateful applications. Could simplify database management.

Multi-cluster management: As we run more clusters, managing them consistently becomes challenging.

GitOps: Storing all Kubernetes config in Git and deploying via CI/CD rather than kubectl commands.

Security enhancements: Pod security standards (replacing pod security policies), improved secrets management.

Key Takeaways

For teams considering Kubernetes for production:

Start small and learn before going all-in
Invest heavily in monitoring and observability from day one
Implement proper secrets management, not just Kubernetes Secrets
Use network policies for zero-trust microsegmentation
Set resource requests and limits on all pods
Practice disaster recovery regularly
Build a platform team to support service teams
Expect a significant learning curve for the organization
Budget time for cluster upgrades and maintenance
Start with stateless services; be cautious with stateful workloads

Kubernetes in production has been a success for us. The operational efficiency, deployment speed, and resource utilization improvements justify the complexity. Our microservices architecture runs reliably on Kubernetes, scaling to handle increasing load while maintaining the security guarantees our customers require.

That said, Kubernetes isn’t for everyone. The operational complexity is real. Smaller teams or simpler applications might be better served with PaaS offerings. But for our distributed, microservices-based key management platform, Kubernetes provides the foundation we need to scale globally while maintaining security and reliability.