If you use remote worker nodes, consider which objects to use to run your applications.
It is recommend to use daemon sets or static pods based on the behavior you want in the event of network issues or power loss. In addition, you can use Kubernetes zones and tolerations to control or avoid pod evictions if the control plane cannot reach remote worker nodes.
- Daemon sets
-
Daemon sets are the best approach to managing pods on remote worker nodes for the following reasons:
-
Daemon sets do not typically need rescheduling behavior. If a node disconnects from the cluster, pods on the node can continue to run. OpenShift Container Platform does not change the state of daemon set pods, and leaves the pods in the state they last reported. For example, if a daemon set pod is in the Running state, when a node stops communicating, the pod keeps running and is assumed to be running by OpenShift Container Platform.
-
Daemon set pods, by default, are created with NoExecute tolerations for the node.kubernetes.io/unreachable and node.kubernetes.io/not-ready taints with no tolerationSeconds value. These default values ensure that daemon set pods are never evicted if the control plane cannot reach a node. For example:
Tolerations added to daemon set pods by default
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/disk-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/memory-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/pid-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/unschedulable
operator: Exists
effect: NoSchedule
-
Daemon sets can use labels to ensure that a workload runs on a matching worker node.
-
You can use an OpenShift Container Platform service endpoint to load balance daemon set pods.
|
|
Daemon sets do not schedule pods after a reboot of the node if OpenShift Container Platform cannot reach the node.
|
- Static pods
-
If you want pods restart if a node reboots, after a power loss for example, consider static pods. The kubelet on a node automatically restarts static pods as node restarts.
|
|
Static pods cannot use secrets and config maps.
|
- Kubernetes zones
-
Kubernetes zones can slow down the rate or, in some cases, completely stop pod evictions.
When the control plane cannot reach a node, the node controller, by default, applies node.kubernetes.io/unreachable taints and evicts pods at a rate of 0.1 nodes per second. However, in a cluster that uses Kubernetes zones, pod eviction behavior is altered.
If a zone is fully disrupted, where all nodes in the zone have a Ready condition that is False or Unknown, the control plane does not apply the node.kubernetes.io/unreachable taint to the nodes in that zone.
For partially disrupted zones, where more than 55% of the nodes have a False or Unknown condition, the pod eviction rate is reduced to 0.01 nodes per second. Nodes in smaller clusters, with fewer than 50 nodes, are not tainted. Your cluster must have more than three zones for these behavior to take effect.
You assign a node to a specific zone by applying the topology.kubernetes.io/region label in the node specification.
Sample node labels for Kubernetes zones
kind: Node
apiVersion: v1
metadata:
labels:
topology.kubernetes.io/region=east
You can adjust the amount of time that the kubelet checks the state of each node.
To set the interval that affects the timing of when the on-premise node controller marks nodes with the Unhealthy or Unreachable condition, create a KubeletConfig object that contains the node-status-update-frequency and node-status-report-frequency parameters.
The kubelet on each node determines the node status as defined by the node-status-update-frequency setting and reports that status to the cluster based on the node-status-report-frequency setting. By default, the kubelet determines the pod status every 10 seconds and reports the status every minute. However, if the node state changes, the kubelet reports the change to the cluster immediately. OpenShift Container Platform uses the node-status-report-frequency setting only when the Node Lease feature gate is enabled, which is the default state in OpenShift Container Platform clusters. If the Node Lease feature gate is disabled, the node reports its status based on the node-status-update-frequency setting.
Example kubelet config
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: disable-cpu-units
spec:
machineConfigPoolSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker (1)
kubeletConfig:
node-status-update-frequency: (2)
- "10s"
node-status-report-frequency: (3)
- "1m"
| 1 |
Specify the type of node type to which this KubeletConfig object applies using the label from the MachineConfig object. |
| 2 |
Specify the frequency that the kubelet checks the status of a node associated with this MachineConfig object. The default value is 10s. If you change this default, the node-status-report-frequency value is changed to the same value. |
| 3 |
Specify the frequency that the kubelet reports the status of a node associated with this MachineConfig object. The default value is 1m. |
The node-status-update-frequency parameter works with the node-monitor-grace-period and pod-eviction-timeout parameters.
-
The node-monitor-grace-period parameter specifies how long OpenShift Container Platform waits after a node associated with a MachineConfig object is marked Unhealthy if the controller manager does not receive the node heartbeat. Workloads on the node continue to run after this time. If the remote worker node rejoins the cluster after node-monitor-grace-period expires, pods continue to run. New pods can be scheduled to that node. The node-monitor-grace-period interval is 40s. The node-status-update-frequency value must be lower than the node-monitor-grace-period value.
-
The pod-eviction-timeout parameter specifies the amount of time OpenShift Container Platform waits after marking a node that is associated with a MachineConfig object as Unreachable to start marking pods for eviction. Evicted pods are rescheduled on other nodes. If the remote worker node rejoins the cluster after pod-eviction-timeout expires, the pods running on the remote worker node are terminated because the node controller has evicted the pods on-premise. Pods can then be rescheduled to that node. The pod-eviction-timeout interval is 5m0s.
|
|
Modifying the node-monitor-grace-period and pod-eviction-timeout parameters is not supported.
|
- Tolerations
-
You can use pod tolerations to mitigate the effects if the on-premise node controller adds a node.kubernetes.io/unreachable taint with a NoExecute effect to a node it cannot reach.
A taint with the NoExecute effect affects pods that are running on the node in the following ways:
-
Pods that do not tolerate the taint are queued for eviction.
-
Pods that tolerate the taint without specifying a tolerationSeconds value in their toleration specification remain bound forever.
-
Pods that tolerate the taint with a specified tolerationSeconds value remain bound for the specified amount of time. After the time elapses, the pods are queued for eviction.
You can delay or avoid pod eviction by configuring pods tolerations with the NoExecute effect for the node.kubernetes.io/unreachable and node.kubernetes.io/not-ready taints.
Example toleration in a pod spec
...
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute" (1)
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute" (2)
tolerationSeconds: 600
...
| 1 |
The NoExecute effect without tolerationSeconds lets pods remain forever if the control plane cannot reach the node. |
| 2 |
The NoExecute effect with tolerationSeconds: 600 lets pods remain for 10 minutes if the control plane marks the node as Unhealthy. |
OpenShift Container Platform uses the tolerationSeconds value after the pod-eviction-timeout value elapses.
- Other types of OpenShift Container Platform objects
-
You can use replica sets, deployments, and replication controllers. The scheduler can reschedule these pods onto other nodes after the node is disconnected for five minutes. Rescheduling onto other nodes can be beneficial for some workloads, such as REST APIs, where an administrator can guarantee a specific number of pods are running and accessible.
|
|
When working with remote worker nodes, rescheduling pods on different nodes might not be acceptable if remote worker nodes are intended to be reserved for specific functions.
|
stateful sets do not get restarted when there is an outage. The pods remain in the terminating state until the control plane can acknowledge that the pods are terminated.
To avoid scheduling a to a node that does not have access to the same type of persistent storage, OpenShift Container Platform cannot migrate pods that require persistent volumes to other zones in the case of network separation.