Kubernetes just hit version 1.0, and there’s a lot of excitement in the container orchestration space. After running Docker containers in production for a few months with custom orchestration scripts, we’re hitting limitations. Kubernetes promises to solve many of these challenges, so I’ve been spending time evaluating whether it’s ready for production security workloads. Here’s what I’m learning.
What Kubernetes Solves
Running a handful of Docker containers is straightforward. Running hundreds of containers across dozens of hosts is a different problem. You need:
Scheduling: Deciding which containers run on which hosts based on resource requirements and availability.
Service discovery: Containers need to find each other. IP addresses change as containers are created and destroyed.
Load balancing: Distributing traffic across multiple container instances.
Health checking: Detecting failed containers and replacing them automatically.
Rolling updates: Updating services without downtime.
Secrets management: Getting sensitive credentials into containers securely (this is particularly critical for us).
Kubernetes provides primitives for all of these. The question is whether they’re mature enough for production security workloads.
Core Concepts
Kubernetes has several core concepts that took me a while to internalize:
Pods: The smallest deployable unit, typically containing one or more tightly coupled containers. For us, a pod might contain our key management service container plus a sidecar container that handles logging.
ReplicaSets: Ensure a specified number of pod replicas are running. If a pod dies, the ReplicaSet creates a replacement.
Services: Provide stable network endpoints for pods. Even as pods are created and destroyed, the Service maintains a consistent IP and DNS name.
Deployments: Manage rolling updates to services, gradually replacing old pods with new versions.
Namespaces: Logical isolation between different environments or tenants within a cluster.
The abstraction feels right. Services and deployments map naturally to our microservices architecture.
Setting Up a Test Cluster
I started by setting up a Kubernetes cluster on AWS EC2 instances. The process is more complex than I expected. Kubernetes has many components:
- etcd: Distributed key-value store for cluster state
- API server: Central management component
- Scheduler: Assigns pods to nodes
- Controller manager: Runs controllers for replication, services, etc.
- Kubelet: Agent on each node that manages containers
- Kube-proxy: Handles network routing to services
Getting all these components running and communicating securely took some time. Kubernetes provides setup scripts (kube-up.sh) that help, but you need to understand what they’re doing for production use.
Deploying Key Management Services
I’ve containerized our HSM proxy service and deployed it to Kubernetes. Here’s what the deployment looks like:
The deployment specifies that we want 3 replicas of the HSM proxy service. Kubernetes schedules these pods across available nodes and maintains exactly 3 replicas, creating replacements if any crash.
A Service provides a stable endpoint for the HSM proxy. Other services can connect to this endpoint without knowing which specific pods are handling requests.
Secrets Management in Kubernetes
Kubernetes has a Secrets API for handling sensitive data. You create a Secret object containing credentials, and pods can mount it as a file or environment variable.
This is better than baking secrets into container images, but I’m not fully comfortable with it yet for highly sensitive credentials. Secrets are base64 encoded, not encrypted, in etcd by default. The API server can authenticate access, but if someone gains access to etcd, they can read all secrets.
For our HSM authentication credentials, we’re exploring an alternative approach: using Kubernetes init containers to fetch credentials from our secure vault before the main container starts. The credentials are never stored in Kubernetes itself.
This adds complexity but provides stronger security guarantees for our most sensitive credentials.
Networking Considerations
Kubernetes networking is sophisticated but complex. Each pod gets its own IP address, and pods can communicate with each other directly without NAT. This is cleaner than Docker’s default networking, but it requires underlying network infrastructure to support it.
We’re using the default overlay network (flannel) for testing, which creates a virtual network spanning all cluster nodes. Traffic between pods on different nodes is encapsulated and routed through this overlay.
For our HSM connections, we need pods to reach HSMs that are outside the Kubernetes cluster. Kubernetes supports this through Services of type ExternalName, which create DNS records pointing to external resources.
Service Discovery and Load Balancing
Kubernetes Services provide built-in load balancing across pods. When you make a request to a Service, kube-proxy distributes it to one of the backing pods using round-robin (by default).
Service discovery happens through DNS. Kubernetes runs a DNS server that creates records for each Service. Pods can simply use the Service name as a hostname, and DNS resolves it to the Service’s cluster IP.
This is much cleaner than our previous approach where services hard-coded the addresses of other services or used environment variables.
Health Checking and Self-Healing
Kubernetes supports two types of health checks:
Liveness probes: Check if a container is alive. If it fails, Kubernetes restarts the container.
Readiness probes: Check if a container is ready to serve traffic. If it fails, Kubernetes removes the pod from Service load balancing but doesn’t restart it.
For our services, we implement HTTP health endpoints that Kubernetes probes. The health check verifies that the service can connect to its backend HSM before reporting healthy.
This self-healing is powerful. When an HSM connection breaks, pods automatically become unhealthy, get removed from load balancing, and eventually get restarted. The system recovers without manual intervention.
Rolling Updates
Kubernetes Deployments support rolling updates. When you update the container image, Kubernetes gradually replaces old pods with new ones, ensuring some instances remain available throughout the update.
You can control the update strategy: how many pods to update at once, how long to wait between updates, what conditions must be met before considering an update successful.
For security services, this is critical. We can deploy updates without downtime, and if an update introduces problems, we can roll back quickly.
Monitoring and Logging
Kubernetes doesn’t include comprehensive monitoring out of the box, but it provides building blocks:
Resource metrics: CPU and memory usage for each pod.
Logs: stdout/stderr from containers are collected and queryable.
Events: Kubernetes generates events for state changes (pod started, container crashed, etc.).
We’re integrating Kubernetes with our Elasticsearch-based logging infrastructure. Logs from all pods flow to Elasticsearch, tagged with pod name, service name, and namespace.
For metrics, we’re evaluating options. Heapster is the most common choice for collecting and aggregating Kubernetes metrics, but it’s still early days.
Persistent Storage Challenges
Most of our services are stateless, but some components (like our audit log database) need persistent storage. Kubernetes has support for persistent volumes, but it’s one of the less mature features.
The abstraction is good: you claim a volume with specific requirements (size, performance), and Kubernetes binds it to your pod. But the underlying implementation depends on the storage provider (AWS EBS, GCE Persistent Disks, NFS, etc.), and each has quirks.
For production, we’re planning to run stateful components (databases) outside Kubernetes initially, with only stateless services in the cluster. As Kubernetes persistent storage matures, we’ll revisit this.
Resource Management
Kubernetes allows specifying resource requests and limits for each container:
Requests: Minimum resources guaranteed to the container. Kubernetes only schedules the pod on nodes with sufficient resources.
Limits: Maximum resources the container can use. Kubernetes enforces these limits.
This prevents resource contention. A misbehaving service can’t consume all CPU on a node and starve other services.
For our cryptographic services, we’ve found that memory limits are particularly important. Some operations can be memory-intensive, and we need to ensure OOM conditions don’t crash entire nodes.
Multi-Tenancy and Isolation
Kubernetes namespaces provide logical isolation. We’re using namespaces to separate:
- Different environments (dev, staging, production)
- Different customers in our multi-tenant platform
- Different service tiers (production services vs. monitoring services)
Resource quotas can be applied to namespaces, limiting total resource consumption. This prevents one tenant from consuming all cluster resources.
However, namespaces are a soft isolation boundary. They’re not a security boundary equivalent to separate clusters. For highly sensitive workloads, we’re still evaluating whether namespace isolation is sufficient.
Production Readiness Assessment
After a few weeks of testing, here’s my assessment of Kubernetes for production security workloads:
Ready:
- Basic container orchestration (scheduling, replication, health checking)
- Service discovery and load balancing
- Rolling updates and rollbacks
- Resource management
Needs work:
- Secrets management (not secure enough for highly sensitive credentials)
- Persistent storage (works but not as mature as stateless workloads)
- Monitoring and observability (requires external tools)
- Security isolation (namespaces are good but not strong isolation)
Unknown:
- Operational complexity at scale
- Performance under high load
- Edge cases and failure modes
Next Steps
We’re planning a phased rollout:
Phase 1 (next month): Deploy stateless microservices to Kubernetes in a test environment. Gain operational experience.
Phase 2 (Q4): Deploy non-critical production services. Learn about monitoring, debugging, and incident response with Kubernetes.
Phase 3 (Q1 2016): Consider deploying critical security services after building confidence in stability and security.
We’re also contributing back to Kubernetes. Security workloads have requirements that push Kubernetes in new directions, and the community is responsive to feedback.
Key Takeaways
For teams evaluating Kubernetes for production:
- Start with stateless services - they’re the easiest to orchestrate
- Invest in monitoring and logging early - debugging container orchestration requires good visibility
- Understand Kubernetes networking thoroughly - it’s more complex than it first appears
- Test failure modes extensively - how does your application behave when pods crash, nodes fail, or the network partitions?
- Don’t rush to production - Kubernetes is powerful but complex, with many moving parts
- Secrets management needs extra attention for security-sensitive workloads
- The community is active and helpful - engage early and often
Kubernetes is impressive for a 1.0 release. It solves real problems we’ve been struggling with. But it’s also complex, and production deployment requires understanding many components and their interactions.
I’m optimistic about Kubernetes for security workloads, but cautiously so. We’ll be validating thoroughly before trusting it with our most critical services. The next few months will be interesting as we gain production experience.
The container orchestration space is moving fast, and Kubernetes seems to be emerging as the leader. For teams building microservices, it’s worth serious consideration. Just make sure you understand what you’re getting into - it’s not a simple tool, but it is a powerful one.