In the previous post we covered the technical issues that developers are facing when writing an operator, and the problems we can solve on framework level. In this post we will describe a feature introduced lately in Java Operator SDK called Event Sources. This is a major rework of the core of the framework, which has now support for various optimizations and extensibility mechanisms. We will discover how and why to use them and what are the related best practices.
The Definition of the Problem
When creating an operator we extend Kubernetes API with our own custom types. The intention behind these custom resources is to hide the complexity of resource management processes and just give a nice API, where the user execute mostly CRUD operations on custom resources. All the complex logic and workflow then is implemented in a controller. However inside a controller we usually just manage other resources, creating pods, deployments, persistent volumes or other Kubernetes or non-Kubernetes resources. Let’s call these resources “dependent resources”, since they depend on our custom resource. So in other words, what happens is that we try to manage these dependent resources using a custom resource.
The usual workflow is that when a custom resource is created we also create the dependent resources. If all the dependent resources are in a desired state, the controller updates the status sub-resource of our custom resource, so it’s also visible for the user. There are two problems we usually face in this process:
- When we create a dependent resource, the duration while it’s successfully created and gets into a target state can take a long time (Think provisioning a databse for example). So we have two options, synchronously waiting in the controller — blocking the thread — until it’s created or asynchronous react on the changes of the dependent resource. Thus execute the controller again on changes of the dependent resource and continue in the workflow.
- Let’s consider we are in a state where all the dependent resources are created, but suddenly one of the dependent resources is destroyed — let’s say (for sake of simplicity) a pod is crashed. Until now we had no way to react on such event, until next time the controller is executed (or it was quite cumbersome to hack it into our operators). Until now we usually just executed the controller when the custom resource was changed. We could put there some timer (what we support also now) to periodically execute the controller and poll the state of the dependent resource. This works, be it’s not ideal, in case we have hundreds of custom resource instances, polling all the related APIs is not efficient. What we want is a way to get notified, in other words trigger the controller where there are some changes in the resources that we manage.
So to cover such scenarios, and make it possible to elegantly watch and react to state changes of dependent resources we introduced the concept of event sources.
Event sources are a relatively simple yet powerful and extensible concept to trigger controller executions. Usually based on changes of dependent resources. To solve the mentioned problems above, de-facto we watch resources we manage for changes, and reconcile the state if a resource is changed. Note that resources we are watching can be Kubernetes and also non-Kubernetes objects. Typically in case of non-Kubernetes objects or services we can extend our operator to handle webhooks or websockets or to react to any event coming from a service we interact with.
What happens is when we create a dependent resource we also register an Event Source that will propagate events regarding the changes of that resource. This way we avoid the need of polling, and can implement controllers very efficiently.
There are few interesting points here:
- The CustomResourceEvenSource event source is a special one, which sends events regarding changes of our custom resource, this is an event source which is always registered for every controller by default.
- An event is always related to a custom resource, so our API did not change: UpdateControl<R> createOrUpdateResource(R resource, Context<R> context);
We receive however, the event(s) which triggered the controller execution in context object.
- Concurrency is still handled for you, thus we still guarantee that there is no concurrent execution of the controller for the same custom resource (there is parallel execution if an event is related to another custom resource instance).
Note that if we receive multiple events while a controller is being executed, we buffer those events and execute the controller again, when the previous execution finished.
As you might see now, the core of the framework is an operator specific event system. Also note that within this system it’s easy to support (as we do) timed or periodic reconciliation, we just have to register a TimerEventSource.
During the development and usage of our first event sources, we discovered some patterns, related to usage of event sources. Here I will describe some best practices as we see them now.
Using Events Only as Triggers
When a controller executes we receive events that triggered the execution. Based on these events we see which system changed since the last execution. One of the patters we see tha can beneficial is that, although it might be tempting we strongly advise not to use these events at all in the controller, thus don’t implement reconciliation logic based on them. Instead always check the state of dependent resources, the specs and status of custom resources and reconcile based on those. Note that reconciling the whole state of every dependent resource can be very efficient in case the actual state is read from the local in-memory cache (see on caching below).
One of the reasons is simple: events can be lost. The operator can crash — the pod can be restarted — any time. There can be all kinds of network errors. When an operator is down we lose all the events which we would receive in an event source. This might not be a problems is some specific cases, we can also make sure that an event is eventually propagated regarding the latest changes or state of a dependent resource. However the logic that is aware of the events can be acutally more complex, reacting on all kind of cases. So not using events is also about the simplicity of the implementation.
Note that with this pattern, when we reconcile all the managed resources in the controller unaware of the events, what happens inside the controller at this point actually can be very similar to the logic how Terraform implements reconciliation. We just know when it makes sense to reconcile since we implement intelligent triggers for it in form of Event Sources.
Typically when we work with Kubernetes (but possibly with others), we manage the objects in a declarative way. This is true also for Event Sources. For example if we watch for changes of a Kubernetes Deployment object in an event source, we always receive the whole object from the Kubernetes API. Later when we try to reconcile in the controller (not using events) we would like to check the state of this deployment (but also other dependent resources), we could read the object again from Kubernetes API. However since we watch for the changes we know that we always receive the most up to date version in the Event Source. So naturally, what we can do is cache the latest received objects (in the Event Source) and read it from there if needed.
We don’t provide tooling or any support caching for now, but the controller has direct access to the Event Sources so this is up to the user to implement an Event Source with a support for caching.
When we watch a Kubernetes object (or any other resource) in an Event Source we do not necessarily want to propagate an event for every change we receive from the object, just changes which we know that could trigger some meaningful actions within a controller. So it’s again up to the implementation of an Event Source to provide a (possibly extendable/reusable) interface so it’s easy to filter out events which we are not interested in.
Event Sources are quite powerful components which support efficient implementation of controllers. We did not want to restrict the APIs or access to the information — like which events are triggering a controller — since some corner case implementations might need it. However this leads to more possible patterns on how to implement a controller. Which are now also a source of debates what is the proper way to do it. In future we want to create also higher level abstractions that will hopefully lead to easier and more trivial ways of reasoning.