Building Kubernetes Operators: Extending the Platform with Custom Controllers

The Kubernetes ecosystem has rapidly evolved from a container orchestration platform into a comprehensive application management system. One of the most powerful extensibility mechanisms that has emerged is the Operator pattern, which allows developers to encode operational knowledge directly into software. In this post, I’ll explore how custom controllers and operators work, and when you should consider building your own.

Understanding the Controller Pattern

At its core, Kubernetes operates on a reconciliation loop model. Controllers continuously observe the desired state (defined in resources) and the actual state (running in the cluster), then take actions to reconcile any differences. This declarative approach is what makes Kubernetes so powerful for managing distributed systems.

The built-in controllers handle standard resources like Deployments, Services, and StatefulSets. But what happens when you need to manage complex, stateful applications that have domain-specific operational requirements? This is where custom controllers come in.

The Operator Pattern Explained

An Operator is a method of packaging, deploying, and managing a Kubernetes application. More specifically, it’s a custom controller that uses Custom Resource Definitions (CRDs) to manage applications and their components.

The key insight is that operators encode human operational knowledge into software. Instead of having a human operator understand how to back up a database, scale it, or handle failover, you write code that does this automatically.

Building a Custom Controller

Let’s walk through the fundamental components of building a custom controller in Go, which is the dominant language for Kubernetes ecosystem development.

First, you define a Custom Resource Definition (CRD):

apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: databases.example.com
spec:
  group: example.com
  versions:
    - name: v1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                version:
                  type: string
                replicas:
                  type: integer
                  minimum: 1
                storage:
                  type: string
  scope: Namespaced
  names:
    plural: databases
    singular: database
    kind: Database

This CRD allows users to declare database instances using a simple YAML manifest:

apiVersion: example.com/v1
kind: Database
metadata:
  name: production-db
spec:
  version: "13.0"
  replicas: 3
  storage: "100Gi"

The Controller Implementation

The controller implementation follows a standard pattern:

type DatabaseController struct {
    kubeclientset kubernetes.Interface
    dbclientset   clientset.Interface
    dbLister      listers.DatabaseLister
    workqueue     workqueue.RateLimitingInterface
    recorder      record.EventRecorder
}

func (c *DatabaseController) Run(threadiness int, stopCh <-chan struct{}) error {
    defer runtime.HandleCrash()
    defer c.workqueue.ShutDown()

    klog.Info("Starting Database controller")

    // Wait for cache sync
    if ok := cache.WaitForCacheSync(stopCh, c.dbInformerSynced); !ok {
        return fmt.Errorf("failed to wait for caches to sync")
    }

    // Launch workers
    for i := 0; i < threadiness; i++ {
        go wait.Until(c.runWorker, time.Second, stopCh)
    }

    <-stopCh
    return nil
}

func (c *DatabaseController) runWorker() {
    for c.processNextWorkItem() {
    }
}

func (c *DatabaseController) processNextWorkItem() bool {
    obj, shutdown := c.workqueue.Get()
    if shutdown {
        return false
    }

    err := func(obj interface{}) error {
        defer c.workqueue.Done(obj)

        key, ok := obj.(string)
        if !ok {
            c.workqueue.Forget(obj)
            return fmt.Errorf("expected string in workqueue but got %#v", obj)
        }

        if err := c.syncHandler(key); err != nil {
            c.workqueue.AddRateLimited(key)
            return fmt.Errorf("error syncing '%s': %s, requeuing", key, err.Error())
        }

        c.workqueue.Forget(obj)
        return nil
    }(obj)

    if err != nil {
        runtime.HandleError(err)
        return true
    }

    return true
}

The Reconciliation Logic

The heart of any controller is the reconciliation function. This is where you implement the business logic:

func (c *DatabaseController) syncHandler(key string) error {
    namespace, name, err := cache.SplitMetaNamespaceKey(key)
    if err != nil {
        return err
    }

    // Get the Database resource
    db, err := c.dbLister.Databases(namespace).Get(name)
    if err != nil {
        if errors.IsNotFound(err) {
            // Resource deleted, cleanup
            return c.cleanupDatabase(namespace, name)
        }
        return err
    }

    // Get existing StatefulSet
    statefulSet, err := c.statefulSetLister.StatefulSets(namespace).Get(name)
    if errors.IsNotFound(err) {
        // Create new StatefulSet
        statefulSet, err = c.kubeclientset.AppsV1().StatefulSets(namespace).Create(
            context.TODO(),
            c.newStatefulSet(db),
            metav1.CreateOptions{},
        )
    }

    // Update if needed
    if !reflect.DeepEqual(statefulSet.Spec.Replicas, &db.Spec.Replicas) {
        statefulSet.Spec.Replicas = &db.Spec.Replicas
        _, err = c.kubeclientset.AppsV1().StatefulSets(namespace).Update(
            context.TODO(),
            statefulSet,
            metav1.UpdateOptions{},
        )
    }

    return c.updateDatabaseStatus(db, statefulSet)
}

Design Principles for Operators

When building operators, several design principles are critical:

Idempotency: Your reconciliation logic must be idempotent. It will be called multiple times for the same resource, and it should always produce the same result.

Edge-driven vs Level-driven: Kubernetes controllers are level-driven, meaning they react to the current state, not just changes. This is more robust than edge-driven systems that only react to events.

Error Handling: Failed reconciliation attempts should be retried with exponential backoff. Use rate-limiting work queues to prevent thundering herd problems.

Status Reporting: Always update the status subresource to reflect the current state. This provides visibility to users and other controllers.

When to Build an Operator

Not every application needs an operator. Consider building one when:

Your application has complex operational requirements (backup, restore, scaling, upgrades)
You need to manage multiple interdependent resources as a single unit
Domain-specific knowledge is required to operate the application
You want to provide a simplified API for complex operations

For simpler applications, Helm charts or basic Kubernetes manifests may be sufficient.

Code Generation and Frameworks

Writing all the boilerplate code for a controller is tedious. Several tools can help:

kubebuilder: A framework for building Kubernetes APIs using CRDs. It generates scaffolding and provides libraries for common patterns.

operator-sdk: Built on kubebuilder, it provides additional tooling specifically for operator development.

code-generator: The low-level code generation tools used by Kubernetes itself to generate clientsets, informers, and listers.

Testing Strategies

Testing operators requires multiple levels:

Unit Tests: Test your reconciliation logic with mock clients. The controller-runtime library provides excellent testing utilities.

Integration Tests: Use envtest to run tests against a real API server and etcd.

End-to-End Tests: Deploy your operator to a real cluster and test the full lifecycle of your custom resources.

Observability

Controllers should be instrumented with:

Prometheus metrics for reconciliation duration, error rates, and queue depth
Structured logging using contextual loggers
Event recording to provide audit trails visible via kubectl describe

Conclusion

Kubernetes operators represent a powerful pattern for automating complex application management. By encoding operational knowledge into software, they enable teams to manage sophisticated stateful applications with the same declarative approach used for stateless workloads.

The key is understanding when the complexity of building an operator is justified. For applications with significant operational overhead, operators can dramatically reduce toil and increase reliability. As the ecosystem matures, we’re seeing operators become the standard way to run complex infrastructure components on Kubernetes.

Whether you’re managing databases, message queues, or custom business applications, the operator pattern provides a robust foundation for automated operations at scale.