The Kubernetes ecosystem has rapidly evolved from a container orchestration platform into a comprehensive application management system. One of the most powerful extensibility mechanisms that has emerged is the Operator pattern, which allows developers to encode operational knowledge directly into software. In this post, I’ll explore how custom controllers and operators work, and when you should consider building your own.
Understanding the Controller Pattern
At its core, Kubernetes operates on a reconciliation loop model. Controllers continuously observe the desired state (defined in resources) and the actual state (running in the cluster), then take actions to reconcile any differences. This declarative approach is what makes Kubernetes so powerful for managing distributed systems.
The built-in controllers handle standard resources like Deployments, Services, and StatefulSets. But what happens when you need to manage complex, stateful applications that have domain-specific operational requirements? This is where custom controllers come in.
The Operator Pattern Explained
An Operator is a method of packaging, deploying, and managing a Kubernetes application. More specifically, it’s a custom controller that uses Custom Resource Definitions (CRDs) to manage applications and their components.
The key insight is that operators encode human operational knowledge into software. Instead of having a human operator understand how to back up a database, scale it, or handle failover, you write code that does this automatically.
Building a Custom Controller
Let’s walk through the fundamental components of building a custom controller in Go, which is the dominant language for Kubernetes ecosystem development.
First, you define a Custom Resource Definition (CRD):
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
name: databases.example.com
spec:
group: example.com
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
version:
type: string
replicas:
type: integer
minimum: 1
storage:
type: string
scope: Namespaced
names:
plural: databases
singular: database
kind: Database
This CRD allows users to declare database instances using a simple YAML manifest:
apiVersion: example.com/v1
kind: Database
metadata:
name: production-db
spec:
version: "13.0"
replicas: 3
storage: "100Gi"
The Controller Implementation
The controller implementation follows a standard pattern:
type DatabaseController struct {
kubeclientset kubernetes.Interface
dbclientset clientset.Interface
dbLister listers.DatabaseLister
workqueue workqueue.RateLimitingInterface
recorder record.EventRecorder
}
func (c *DatabaseController) Run(threadiness int, stopCh <-chan struct{}) error {
defer runtime.HandleCrash()
defer c.workqueue.ShutDown()
klog.Info("Starting Database controller")
// Wait for cache sync
if ok := cache.WaitForCacheSync(stopCh, c.dbInformerSynced); !ok {
return fmt.Errorf("failed to wait for caches to sync")
}
// Launch workers
for i := 0; i < threadiness; i++ {
go wait.Until(c.runWorker, time.Second, stopCh)
}
<-stopCh
return nil
}
func (c *DatabaseController) runWorker() {
for c.processNextWorkItem() {
}
}
func (c *DatabaseController) processNextWorkItem() bool {
obj, shutdown := c.workqueue.Get()
if shutdown {
return false
}
err := func(obj interface{}) error {
defer c.workqueue.Done(obj)
key, ok := obj.(string)
if !ok {
c.workqueue.Forget(obj)
return fmt.Errorf("expected string in workqueue but got %#v", obj)
}
if err := c.syncHandler(key); err != nil {
c.workqueue.AddRateLimited(key)
return fmt.Errorf("error syncing '%s': %s, requeuing", key, err.Error())
}
c.workqueue.Forget(obj)
return nil
}(obj)
if err != nil {
runtime.HandleError(err)
return true
}
return true
}
The Reconciliation Logic
The heart of any controller is the reconciliation function. This is where you implement the business logic:
func (c *DatabaseController) syncHandler(key string) error {
namespace, name, err := cache.SplitMetaNamespaceKey(key)
if err != nil {
return err
}
// Get the Database resource
db, err := c.dbLister.Databases(namespace).Get(name)
if err != nil {
if errors.IsNotFound(err) {
// Resource deleted, cleanup
return c.cleanupDatabase(namespace, name)
}
return err
}
// Get existing StatefulSet
statefulSet, err := c.statefulSetLister.StatefulSets(namespace).Get(name)
if errors.IsNotFound(err) {
// Create new StatefulSet
statefulSet, err = c.kubeclientset.AppsV1().StatefulSets(namespace).Create(
context.TODO(),
c.newStatefulSet(db),
metav1.CreateOptions{},
)
}
// Update if needed
if !reflect.DeepEqual(statefulSet.Spec.Replicas, &db.Spec.Replicas) {
statefulSet.Spec.Replicas = &db.Spec.Replicas
_, err = c.kubeclientset.AppsV1().StatefulSets(namespace).Update(
context.TODO(),
statefulSet,
metav1.UpdateOptions{},
)
}
return c.updateDatabaseStatus(db, statefulSet)
}
Design Principles for Operators
When building operators, several design principles are critical:
Idempotency: Your reconciliation logic must be idempotent. It will be called multiple times for the same resource, and it should always produce the same result.
Edge-driven vs Level-driven: Kubernetes controllers are level-driven, meaning they react to the current state, not just changes. This is more robust than edge-driven systems that only react to events.
Error Handling: Failed reconciliation attempts should be retried with exponential backoff. Use rate-limiting work queues to prevent thundering herd problems.
Status Reporting: Always update the status subresource to reflect the current state. This provides visibility to users and other controllers.
When to Build an Operator
Not every application needs an operator. Consider building one when:
- Your application has complex operational requirements (backup, restore, scaling, upgrades)
- You need to manage multiple interdependent resources as a single unit
- Domain-specific knowledge is required to operate the application
- You want to provide a simplified API for complex operations
For simpler applications, Helm charts or basic Kubernetes manifests may be sufficient.
Code Generation and Frameworks
Writing all the boilerplate code for a controller is tedious. Several tools can help:
kubebuilder: A framework for building Kubernetes APIs using CRDs. It generates scaffolding and provides libraries for common patterns.
operator-sdk: Built on kubebuilder, it provides additional tooling specifically for operator development.
code-generator: The low-level code generation tools used by Kubernetes itself to generate clientsets, informers, and listers.
Testing Strategies
Testing operators requires multiple levels:
Unit Tests: Test your reconciliation logic with mock clients. The controller-runtime library provides excellent testing utilities.
Integration Tests: Use envtest to run tests against a real API server and etcd.
End-to-End Tests: Deploy your operator to a real cluster and test the full lifecycle of your custom resources.
Observability
Controllers should be instrumented with:
- Prometheus metrics for reconciliation duration, error rates, and queue depth
- Structured logging using contextual loggers
- Event recording to provide audit trails visible via
kubectl describe
Conclusion
Kubernetes operators represent a powerful pattern for automating complex application management. By encoding operational knowledge into software, they enable teams to manage sophisticated stateful applications with the same declarative approach used for stateless workloads.
The key is understanding when the complexity of building an operator is justified. For applications with significant operational overhead, operators can dramatically reduce toil and increase reliability. As the ecosystem matures, we’re seeing operators become the standard way to run complex infrastructure components on Kubernetes.
Whether you’re managing databases, message queues, or custom business applications, the operator pattern provides a robust foundation for automated operations at scale.