Write a Kubernetes controller (operator) with operator-sdk

TL;DR:

unless you need a really low-level control or are writing a specialized controller, do use one of the helper libraries like the operator-sdk or kubebuilder to avoid writing a lot of boilerplate code,
handling events and objects in the controller is about synchronizing the state between API object and the system, not about reacting to events,
try to be stateless, as keeping state is your problem,
when you write a Kubernetes controller, be sure you know what your state is and how to make it durable,
remember your controller can be restarted any time,
use Custom Resource Definition to create any API object you need,
DeepCopy() API objects if you’re changing them,
if you’re creating new API objects, always set their OwnerReference.

Introduction

Some time ago I posted an entry about how to write a Kubernetes controller. It was a very minimalistic example of a controller, which aimed only at putting the main pieces in place. Still, I wanted to explore the topic further and write something more realistic and useful. As you can learn in the previous blog entry, despite the simple idea of how a controller operates, writing one requires a lot of boilerplate code. Fortunately, there are already libraries/frameworks that are trying to remove this boilerplate stuff and get you on track much faster. I had a look at two of them: the operator-sdk and kubebuilder. As I found the operator-sdk first, I started with it. In this blog entry, I want to show you how to write a Kubernetes controller with operator-sdk.

My idea was pretty simple. I wanted to write an operator, which is useful, but still very simple, so it can serve as a tutorial material, showing how to build operators. This turned into a Netperf Operator, a kubernetes tool that allows you to run the good old network performance benchmarking tool called netperf (yes, click this URL and admire pure HTML at its best!). I wanted to be able to run netperf, a client-server application, between 2 kubernetes pods, preferably running on different cluster nodes. Such tests are pretty valuable as they are the only way you can check your real pod-to-pod network performance, including the impact of all the networking layers, like overlay networks and such. Normally, running netperf requires you to start a server at one place and a netperf client at the other end of the tested connection. I wanted to make this a one-step process, managed by an operator. And of course, learn and how you how to write a Kubernetes controller. You can check the resulting project on my github page: netperf-operator.

Oh, but why do I call it “operator”, not “controller”? These concepts are very close: people tend to name Kubernetes controllers specialized in running and configuring a single application “an operator”. There are already many awesome operators. Prometheus-operator is just one of them, but it really shows the power of an idea of running an application integrated with your kubernetes cluster and managed “the cluster way”. You can find other operators on the awesome-operators github page – be sure to check them.

How does kubernetes controller/operator work and run?

Well, this is mainly a reminder, but still a crucial one, so let’s say it once again. A very simplified controller’s life looks like this:

while true {
  receiveInfoAboutAPIObjects()
  synchronizeRealStateToMatchFetchedInfo()
}

There are a few consequences coming from the simple pseudocode above. The first one is that you need to receive notifications about the state and its changes of kubernetes objects that you’re interested in. Thankfully, the operator-sdk comes to the rescue here. It takes care of the synchronization with kubernetes API server and the notification loop. It also allows you to decide what type of API objects you want to observe.

Another important property is that the synchronization loop is not really event-driven, but state-driven. It means that your controller won’t receive events only when something changes, for example, a Pod dies or is created. You will get them then, but moreover, you will also get periodic refreshers: updates that show you the complete required state of the Kubernetes object, no matter if it changed or not since the last update. This is a very important property and you have to get it right. It took me some time to stick to it, so let me rephrase it. The notification loop doesn’t tell what change you need to perform in a system, but how the system should look like. It declares and describes the desired state, not the change that’s needed to produce it. Figuring out how to get to the desired state is a task for the controller.

Let me give you an example here. Let’s suppose we’re writing ReplicaSet controller. The ReplicaSet controller takes a pod configuration template and a desired number of pods to create. Its task is to ensure that the declared number of the pods are always running in the system. Let’s now assume we have created a ReplicaSet object with a Pod template and we set the pod count to 5 and also already 5 pods are running in the cluster. Now, if any of the pods die, the ReplicaSet controller gets a pod status update. But its reaction is not just “create a new pod”. Remember, a controller doesn’t react directly to the event, but checks for a difference between the configured and the real state. So, the controller checks that there are currently 4 pods running in the system, the required count is 5, so the solution is to start 1 new pod. But the same synchronization logic is also run periodically, even if there are no pod events in the system. Imagine now, that a whole node dies in the cluster and this node was running 2 out of 5 of our pods. Now, the cluster is not solving the problem of “how to notify the controller that 2 pods have died”. Instead, the controller will run the synchronization loop and check that there are now only 3 pods in the system and 5 are required, so the solution to reconcile the state is to start 2 new pods. So, we’re checking for the desired state and make it a reality, not simply react to events like.

One more important thing is that you should expect your controller/operator process to be terminated at any time. Most probably your controller will be run just as a pod in the cluster, and pods are mortal and should be easy to restart and recreate. This means, that if you need to keep a state, you basically have to keep it outside of your controller so that after a controller pod restart you can resume controller’s operations.

How to write a Kubernetes controller with operator-sdk – bootstrapping

After the general concepts presented above, we’re ready to write a Kubernetes controller and bootstrap a new operator project. This topic is nicely covered on the project’s github page. Definitely, have a look there for a reference. I bootstrapped my project with

operator-sdk new netperf-operator --api-version=app.example.com/v1alpha1 --kind=Netperf

Custom Resources and Custom Resource Definitions

OK, but let’s pause for a second. You might be wondering how the controller pattern is really useful, except for controllers that are already in the Kubernetes, like Deployment or StatefulSet controller. After all, not everything is about pods and services, that are native objects in Kubernetes. But can you write a controller that handles some other objects? The great thing is that Kubernetes API and objects system is easily extensible. The same way the cluster gets information for standard controllers from your description of Pods, Deployments or Services, it can also handle any Custom Resource. It just needs to be configured using a Custom Resource Definition. Using this mechanism, you can introduce any object you need into your cluster and configure it using YAML files and “kubectl” or API calls. You can find the CRD for netperf-operator here. The whole declaration is really in these few lines:

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: netperfs.app.example.com
spec:
  group: app.example.com
  names:
    kind: Netperf
    listKind: NetperfList
    plural: netperfs
    singular: netperf
  scope: Namespaced
  version: v1alpha1

It just defines the name of your Custom Resource (singular and plural), API object kind (Netperf) and that your CRD is scoped to a single namespace. And that’s it, you’re ready to create your own new “Netperf” objects. As you can see, we’re not giving here any data schema that we expect from the Custom Resource object. We handle that in the controller code; for CRD definition the object schema is irrelevant. Still, there’s a new feature in kubernetes 1.11 that allows for validating a Custom Resource by embedding Custom Resource validation schema in CRD definition, but it’s a new thing and just beta in 1.11.

So, the next step is to define your Custom Resource schema in code. Basic stuff is already generated by the “operator-sdk new” command used to bootstrap the project (check this directory). The important part is that we have to add the Spec and Status parts of the Custom Resource (CR) object. Spec is the specification, so basically, an input describing the object. Status shows, well, the status of the object. Here’s how the full definition looks like, but the most important part is:

type Netperf struct {
	metav1.TypeMeta   `json:",inline"`
	metav1.ObjectMeta `json:"metadata"`
	Spec              NetperfSpec   `json:"spec"`
	Status            NetperfStatus `json:"status,omitempty"`
}

type NetperfSpec struct {
	ServerNode string `json:"serverNode"`
	ClientNode string `json:"clientNode"`
}

type NetperfStatus struct {
	Status          string  `json:"status"`
	ServerPod       string  `json:"serverPod"`
	ClientPod       string  `json:"clientPod"`
	SpeedBitsPerSec float64 `json:"speedBitsPerSec"`
}

Lines 1-6 were already generated. We’re just providing the NetperfSpec and NetperfStatus definition. In Spec, I’m expecting just 2 input variables: names of kubernetes nodes where the controller should run the netperf server and client pods. With Spec like that, we can define a netperf test between any 2 cluster nodes we want. The Status is responsible for showing what are the names of ServerPod and ClientPod, the overall Status of the current CR (a single netperf test) and the final test result in bps.

When your definition is done, you have to run the command:

operator-sdk generate k8s

This will generate some helper functions for your CR, like deep copying the object instances. Now your definition of the Custom Resource is done and you can already use it!

The controller loop – checking what needs to be done

Do you remember how much code it took to start a control loop without any controller library? You might be surprised when you check the main.go file. Basically, the bootstrap code is this:

resource := "app.example.com/v1alpha1"
kind := "Netperf"
namespace, err := k8sutil.GetWatchNamespace()
if err != nil {
	logrus.Fatalf("Failed to get watch namespace: %v", err)
}
resyncPeriod := 5
sdk.Watch(resource, kind, namespace, resyncPeriod)
sdk.Watch("v1", "Pod", namespace, resyncPeriod)
sdk.Handle(stub.NewHandler(operator.NewNetperf(realkube.NewRealProvider())))
sdk.Run(context.TODO())

We’re declaring API resource type and kind, how frequently we want to get object status update from the API server and then in lines 8 and 9 we just start to watch changes on our “Netperf” and “Pod” objects. Why “Pod”? Because our Netperf Operator creates new pods with netperf client and server and it needs to react to changes and control their lifetime. After that, we register the Handler (line 10, the stub was generated during the bootstrap step as well) and then you Run the whole thing in line 11. The generated handler stub is just a generic handler of any type of objects. These generic objects are switched into domain-specific object types as soon as possible and handled accordingly with (full source):

func (h *Handler) Handle(ctx context.Context, event sdk.Event) error {
	switch event.Object.(type) {
	case *v1alpha1.Netperf:
		netperf := event.Object.(*v1alpha1.Netperf)
		return h.operator.HandleNetperf(netperf, event.Deleted)
	case *v1.Pod:
		pod := event.Object.(*v1.Pod)
		return h.operator.HandlePod(pod, event.Deleted)
	default:
		logrus.Warnf("unknown event received: %s", event)
	}
	return nil
}

OK, now we have everything needed to write a custom Kubernetes controller in place: Custom Resource Definition (CRD), Netperf Custom Resource (CR) and the control loop, where we can easily react to the incoming information about the system state we have to create. Now the “only” thing left is to add our business logic that does that.

Business logic, a.k.a. how to run Netperf on Kubernetes using API calls

I won’t describe the whole code line by line here, you can check the source here. But I want to give an overview that will hopefully make the code easier to understand.

Handling state

The main concept is that we need to react to events in a different way, depending on what we have already completed and the state Netperf object is in. It’s basically a finite state machine: our action depends on the state we’re in and the event we receive. We assume a single Netperf object can be in one of the following states:

NetperfPhaseInitial: resource was created, but no actions were taken yet;
NetperfPhaseServer: server pod is being created, but the client is not yet started;
NetperfPhaseTest: the client pod is created as well and the test should be in progress;
NetperfPhaseDone: the client pod has stopped, we collect the result and stop both server and client pods; the test is complete;
NetperfPhaseError: something bad happened and the controller was not able to complete the test; we can’t proceed and finish with an error.

Check the following image for visualization of states (circles), events (black font) and actions (violet font) we have to handle.

How to write a Kubernetes controller - state diagram of Netperf object — The state diagram of a Netperf object

As you can see, to make everything work, we have to keep state – our current action depends on what previously already was applied to a Netperf object. That’s a major issue. We can’t just keep the state as the object’s state in the RAM memory, as we have to deal with failures. No matter if we want to run our controller on Kubernetes itself, as a Pod, or outside the cluster, as a standalone process – we have to handle restarts. So, the solution is to offload the problem of keeping state to some external durable storage. If our state is big, we could use some external key-value storage, like Redis or Etcd. Fortunately, our whole state is just a single variable with the state’s name. In this case, we can just keep the state within the Netperf object itself – the kubernetes API server will be our durable storage. And in the same go, we provide a better feedback to our users, who can check what’s the state of a Netperf test. When you write a Kubernetes controller, make sure you know what your state is and how to make it durable.

Still, the bottom line is: your process can terminate at any moment. Write a kubernetes controller assuming it can be terminated and restarted at any time.

Oh, one more thing: in your code, you should never change the object that you received in the control loop. Always make a copy of the object first – that’s why you have this generated DeepCopy() function. Then, save the copy with API server call.

Control flow overview

Let’s briefly go over what Netperf operator does for events in particular states:

We start with HandleNetperf() method, which just checks if the event is about a Netperf object being deleted or created/updated (since we synchronize the state to match the API object, there’s actually no need to tell a create from an update event),
- If it was a create/update, we go to handleNetperfUpdateEvent(), where we make sure that the Server pod is started and existing. Remember, we can receive this event multiple times, not just once or only when a new Netperf object is created! So, we check for the current Netperf object state. If it is “Initial” or “Server”, we run startServerPod(), which makes sure the pod either already exists or is created (but not blindly creates the pod). In other states, we ignore the update request, as to get to these states the server pod must have been already created.
- If the delete flag was set, we call deleteNetperfPods(), which basically does… nothing! Our netperf pods will be automatically deleted by the API server when the Netperf object that created them is deleted. This is possible because we correctly set the OwnerReference for our pods.
The second entry point into our business logic controller is with HandlePod() method. Keep in mind that with this approach that method is called for every pod in the same namespace. So, we start with checking if the owner of the pod is an existing Netperf object. If not, our controller has nothing to care for it. If it is a related pod event, in handlePodUpdateEvent() we check if it is about a Server or a Client pod. Then, we call the event handler function for the specific kind of pod.
- in handleServerPodEvent() we’re checking if the Server pod is already up and running. If not, we’re waiting. If it’s ready, we check if the client pod is already created for this Netperf object. Again, we’re not blindly creating a client pod, as it might have been already created. Instead, we check the current state and create the Client pod only if it doesn’t yet exist.
- in handleClientPodEvent() we’re checking for the Client pod status. If it has already completed, we can get the output log from it, parse it and get the netperf speed result value. We’re also cleaning up both the Client and Server pod. Finally, we do the one last update of the Nerperf object, to include the test result in the Status part of the object definition.

Building and running the project

When you wrote a Kubernetes controller you naturally want to build and run it. Project building is described with a fairly complex workflow on the operator-sdk github page. This workflow includes building a docker image, then pushing it to the registry and after that deploying to a test cluster. This is nice when you want to test the full deployment cycle, but it’s terrible for development, as it takes a long time to build and deploy. On the github page of Netperf Operator, I described another approach for the development cycle, where you can just build and debug a local go binary, without even touching docker and deploying to cluster. Be sure to check it out. You can also find information about how to run the Netperf Operator there.

How to write a Kubernetes controller – a short summary

OK, this entry is a lengthy one, not mentioning the linked code. Still, I really think that using a library like operator-sdk makes the whole thing much easier and faster. And creating a custom controller using a Custom Resource Definition really opens up plenty of possibilities for your needs. After all, Kubernetes is “just a new operating system” – at some point you have to write your own application… or rather write a Kubernets controller!

A very simple custom Kubernetes controller

SmartNat - dirt cheap Kubernetes ingress controller for TCP/UDP services

DockerCon EU 2017 - summary by me