OpenKruise v1.3, New Custom Pod Probe Capabilities and Significant Performance Improvements for Large-Scale Clusters
We’re pleased to announce the release of OpenKruise 1.3, which is a CNCF Sandbox level project.
OpenKruise is an extended component suite for Kubernetes, which mainly focuses on application automations, such as deployment, upgrade, ops and availability protection. Mostly features provided by OpenKruise are built primarily based on CRD extensions. They can work in pure Kubernetes clusters without any other dependences.
What's new?
In release v1.3, OpenKruise provides a new CRD named PodProbeMarker
, improves its performance in large-scale clusters, Advanced DaemonSet support pre-download image,
and some new features have been added to CloneSet, WorkloadSpread, AdvancedCronJob, SidecarSet etc.
Here we are going to introduce some changes of it.
1. New CRD and Controller: PodProbeMarker
Kubernetes provides three Pod lifecycle management:
- Readiness Probe Used to determine whether the business container is ready to respond to user requests. If the probe fails, the Pod will be removed from Service Endpoints.
- Liveness Probe Used to determine the health status of the container. If the probe fails, the kubelet will restart the container.
- Startup Probe Used to know when a container application has started. If such a probe is configured, it disables liveness and readiness checks until it succeeds.
So the Probe capabilities provided in Kubernetes have defined specific semantics and related behaviors. In addition, there is actually a need to customize Probe semantics and related behaviors, such as:
- GameServer defines Idle Probe to determine whether the Pod currently has a game match, if not, from the perspective of cost optimization, the Pod can be scaled down.
- K8S Operator defines the main-secondary probe to determine the role of the current Pod (main or secondary). When upgrading, the secondary can be upgraded first, so as to achieve the behavior of selecting the main only once during the upgrade process, reducing the service interruption time during the upgrade process.
OpenKruise provides the ability to customize the Probe and return the result to the Pod Status, and the user can decide the follow-up behavior based on the probe result.
An object of PodProbeMarker may look like this:
apiVersion: apps.kruise.io/v1alpha1
kind: PodProbeMarker
metadata:
name: game-server-probe
namespace: ns
spec:
selector:
matchLabels:
app: game-server
probes:
- name: Idle
containerName: game-server
probe:
exec:
command:
- /home/game/idle.sh
initialDelaySeconds: 10
timeoutSeconds: 3
periodSeconds: 10
successThreshold: 1
failureThreshold: 3
markerPolicy:
- state: Succeeded
labels:
gameserver-idle: 'true'
annotations:
controller.kubernetes.io/pod-deletion-cost: '-10'
- state: Failed
labels:
gameserver-idle: 'false'
annotations:
controller.kubernetes.io/pod-deletion-cost: '10'
podConditionType: game.io/idle
PodProbeMarker results can be viewed at Pod Object:
apiVersion: v1
kind: Pod
metadata:
labels:
app: game-server
gameserver-idle: 'true'
annotations:
controller.kubernetes.io/pod-deletion-cost: '-10'
name: game-server-58cb9f5688-7sbd8
namespace: ns
spec:
...
status:
conditions:
# podConditionType
- type: game.io/idle
# Probe State 'Succeeded' indicates 'True', and 'Failed' indicates 'False'
status: "True"
lastProbeTime: "2022-09-09T07:13:04Z"
lastTransitionTime: "2022-09-09T07:13:04Z"
# If the probe fails to execute, the message is stderr
message: ""
2. Performance optimization: significant performance improvements for large-scale clusters
- #1026 The introduction of a delayed queueing mechanism significantly optimizes the CloneSet controller work queue buildup problem when kruise-manager is pulled up in large-scale application clusters, ideally reducing initialization time by more than 80%.
- #1027 Optimize PodUnavailableBudget controller Event Handler logic to reduce the number of irrelevant Pods in the queue.
- #1011 The caching mechanism optimizes the CPU and Memory consumption of Advanced DaemonSet's repetitive simulation of Pod scheduling computations in large-scale clusters.
- #1015, #1068 Significantly reduce runtime memory consumption in large clusters. Complete the Disable DeepCopy feature started in v1.1, and reduce the conversion consumption of expressions type label selector.
3. SidecarSet support inject specific historical version
SidecarSet will record historical revision of some fields such as containers
, volumes
, initContainers
, imagePullSecrets
and patchPodMetadata
via ControllerRevision.
Based on this feature, you can easily select a specific historical revision to inject when creating Pods, rather than always inject the latest revision of sidecar.
SidecarSet records ControllerRevision in the same namespace as Kruise Manager. You can execute kubectl get controllerrevisions -n kruise-system -l kruise.io/sidecarset-name=<your-sidecarset-name>
to list the ControllerRevisions of your SidecarSet.
Moreover, the ControllerRevision name of current SidecarSet revision is shown in status.latestRevision
field, so you can record it very easily.
There are two configuration methods as follows:
select revision via ControllerRevision name
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
spec:
...
updateStrategy:
partition: 90%
injectionStrategy:
revision:
revisionName: <specific-controllerrevision-name>
select revision via custom version label
You can add or update the label apps.kruise.io/sidecarset-custom-version=<your-version-id>
to SidecarSet when creating or publishing SidecarSet, to mark each historical revision.
SidecarSet will bring this label down to the corresponding ControllerRevision object, and you can easily use the <your-version-id>
to describe which historical revision you want to inject.
Assume that you are publishing version-2
in canary way (you wish only 10% Pods will be upgraded), and you want to inject the stable version-1
to newly-created Pods to reduce the risk of the canary publishing:
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
metadata:
name: sidecarset
labels:
apps.kruise.io/sidecarset-custom-version: example/version-2
spec:
...
updateStrategy:
partition: 90%
injectionStrategy:
revision:
customVersion: example/version-1
4. SidecarSet support inject pod annotations
SidecarSet support inject pod annotations, as follows:
apiVersion: apps.kruise.io/v1alpha1
kind: SidecarSet
spec:
containers:
...
patchPodMetadata:
- annotations:
oom-score: '{"log-agent": 1}'
custom.example.com/sidecar-configuration: '{"command": "/home/admin/bin/start.sh", "log-level": "3"}'
patchPolicy: MergePatchJson
- annotations:
apps.kruise.io/container-launch-priority: Ordered
patchPolicy: Overwrite | Retain
patchPolicy is the injected policy, as follows:
- Retain: By default, if annotation[key]=value exists in the Pod, the original value of the Pod will be retained. Inject annotations[key]=value2 only if annotation[key] does not exist in the Pod.
- Overwrite: Corresponding to Retain, when annotation[key]=value exists in the Pod, it will be overwritten value2.
- MergePatchJson: Corresponding to Overwrite, the annotations value is a json string. If the annotations[key] does not exist in the Pod, it will be injected directly. If it exists, do a json value merge. For example: annotations[oom-score]='{"main": 2}' exists in the Pod, after injection, the value json is merged into annotations[oom-score]='{"log-agent": 1, "main": 2}'.
Note: When the patchPolicy is Overwrite and MergePatchJson, the annotations can be updated synchronously when the SidecarSet in-place update the Sidecar Container. However, if only the annotations are modified, it will not take effect. It must be in-place update together with the sidecar container image. When patchPolicy is Retain, the annotations will not be updated when the SidecarSet in-place update the Sidecar Container.
According to the above configuration, when the sidecarSet is injected into the sidecar container, it will inject Pod annotations synchronously, as follows:
apiVersion: v1
kind: Pod
metadata:
annotations:
apps.kruise.io/container-launch-priority: Ordered
oom-score: '{"log-agent": 1, "main": 2}'
custom.example.com/sidecar-configuration: '{"command": "/home/admin/bin/start.sh", "log-level": "3"}'
name: test-pod
spec:
containers:
...
Note: SidecarSet should not modify any configuration outside the sidecar container for security consideration, so if you want to use this capability, you need to first configure SidecarSet_PatchPodMetadata_WhiteList whitelist or turn off whitelist checks via Feature-gate SidecarSetPatchPodMetadataDefaultsAllowed=true.
5. Advanced DaemonSet support pre-downloading image for update
If you have enabled the PreDownloadImageForDaemonSetUpdate
feature-gate,
DaemonSet controller will automatically pre-download the image you want to update to the nodes of all old Pods.
It is quite useful to accelerate the progress of applications upgrade.
The parallelism of each new image pre-downloading by DaemonSet is 1
, which means the image is downloaded on nodes one by one.
You can change the parallelism using apps.kruise.io/image-predownload-parallelism
annotation on DaemonSet according to the capability of image registry,
for registries with more bandwidth and P2P image downloading ability, a larger parallelism can speed up the pre-download process.
apiVersion: apps.kruise.io/v1alpha1
kind: DaemonSet
metadata:
annotations:
apps.kruise.io/image-predownload-parallelism: "10"
6. CloneSet Scaling with PreparingDelete
CloneSet considers Pods in PreparingDelete
state as normal by default, which means these Pods will still be calculated in the replicas
number.
In this situation:
- if you scale down
replicas
fromN
toN-1
, when the Pod to be deleted is still inPreparingDelete
, you scale upreplicas
toN
, the CloneSet will move the Pod back toNormal
. - if you scale down
replicas
fromN
toN-1
and put a Pod intopodsToDelete
, when the specific Pod is still inPreparingDelete
, you scale upreplicas
toN
, the CloneSet will not create a new Pod until the specific Pod goes into terminating. - if you specifically delete a Pod without
replicas
changed, when the specific Pod is still inPreparingDelete
, the CloneSet will not create a new Pod until the specific Pod goes into terminating.
Since Kruise v1.3.0, you can put a apps.kruise.io/cloneset-scaling-exclude-preparing-delete: "true"
label into CloneSet, which indicates Pods in PreparingDelete
will not be calculated in the replicas
number.
In this situation:
- if you scale down
replicas
fromN
toN-1
, when the Pod to be deleted is still inPreparingDelete
, you scale upreplicas
toN
, the CloneSet will move the Pod back toNormal
. - if you scale down
replicas
fromN
toN-1
and put a Pod intopodsToDelete
, even if the specific Pod is still inPreparingDelete
, you scale upreplicas
toN
, the CloneSet will create a new Pod immediately. - if you specifically delete a Pod without
replicas
changed, even if the specific Pod is still inPreparingDelete
, the CloneSet will create a new Pod immediately.
7. Advanced CronJob Time zones
All CronJob schedule: times are based on the timezone of the kruise-controller-manager by default, which means the timezone set for the kruise-controller-manager container determines the timezone that the cron job controller uses.
However, we have introduce a spec.timeZone
field in v1.3.0.
You can set it to the name of a valid time zone name. For example, setting spec.timeZone: "Etc/UTC"
instructs Kruise to interpret the schedule relative to Coordinated Universal Time.
A time zone database from the Go standard library is included in the binaries and used as a fallback in case an external database is not available on the system.
8. Other changes
For more changes, their authors and commits, you can read the Github release.
Get Involved
Welcome to get involved with OpenKruise by joining us in Github/Slack/DingTalk/WeChat. Have something you’d like to broadcast to our community? Share your voice at our Bi-weekly community meeting (Chinese), or through the channels below:
- Join the community on Slack (English).
- Join the community on DingTalk: Search GroupID
23330762
(Chinese). - Join the community on WeChat (new): Search User
openkruise
and let the robot invite you (Chinese).