Node Disruption Policies

TOC

Understanding Node Disruption in Machine Configuration

By default, when you make certain changes to the systemd units section of a MachineConfig object, the Machine Configuration Operator will drain and reboot the nodes associated with that MachineConfig. However, changes to regular file entries typically do not cause a reboot, which may result in the configuration not taking effect as expected. To address this, you can define node disruption policies to specify which types of changes should trigger node reboots or other disruption actions. These policies are configured in the MachineConfiguration object located in the cpaas-system namespace. See the example below for configuring a node disruption policy.

Once defined, the Machine Configuration Operator validates the policy to detect issues such as invalid formatting. Then, it populates the policy into the status.nodeDisruptionPolicyStatus field of the MachineConfiguration object. These user-defined policies override the cluster's default disruption settings.

A default MachineConfiguration custom resource named cluster is installed with the cluster. You can configure node disruption behavior on this resource.


What You Can Control with Node Disruption Policies

Node disruption policies allow you to define what happens when changes are made to the following configuration areas:

  • Files: You can define behavior for file changes (excluding changes to the root directory). By default, file changes do not trigger any disruption. You can modify this behavior using the spec.defaultNodeDisruptionPolicySpecAction.files field.

  • Systemd Units: You can create or modify systemd services, including enabling, disabling, or changing their state. By default, changes to systemd units trigger a node drain and reboot.

  • SSH Public Keys: You can add or update SSH keys for the boot user. These changes are applied immediately by default and do not trigger a reboot or drain.

Each change is evaluated against the node disruption policy, which can trigger one or more of the following actions:

  • Reboot: Drain and reboot the node.
  • None: No disruption is triggered; the change is applied silently.
  • Drain: Drain the node without rebooting.
  • Restart: Restart the specified systemd service.
  • DaemonReload: Reload all systemd unit configurations.

Example: Default Node Disruption Policy

The following is the default configuration for the cluster MachineConfiguration resource after installation:

apiVersion: machineconfiguration.alauda.io/v1alpha1
kind: MachineConfiguration
metadata:
  name: cluster
spec:
  defaultNodeDisruptionPolicySpecAction:
    files:
    - type: None
    units:
    - type: Reboot
  nodeDisruptionPolicy:
    sshkey:
      actions:
      - type: None

This configuration means:

  • File changes trigger no action.
  • Systemd unit changes cause a node drain and reboot.
  • SSH key changes are applied without disruption.

Example: Customizing File Behavior

You can change the default action for file changes to trigger a reboot:

apiVersion: machineconfiguration.alauda.io/v1alpha1
kind: MachineConfiguration
metadata:
  name: cluster
spec:
  defaultNodeDisruptionPolicySpecAction:
    files:
    - type: Reboot
    units:
    - type: Reboot
  nodeDisruptionPolicy:
    sshkey:
      actions:
      - type: None

With this configuration, any change to a managed file will cause the Machine Configuration Operator to drain and reboot the affected node.


Example: Applying Policy to a Specific File

You can also define disruption actions for a specific file path. In the following example, changes to /usr/local/bin/myapp.sh will not trigger a node reboot, but instead reload the systemd configuration and restart the related service:

apiVersion: machineconfiguration.alauda.io/v1alpha1
kind: MachineConfiguration
metadata:
  name: cluster
spec:
  defaultNodeDisruptionPolicySpecAction:
    files:
    - type: Reboot
    units:
    - type: Reboot
  nodeDisruptionPolicy:
    files:
    - path: /usr/local/bin/myapp.sh
      actions:
      - type: DaemonReload
      - type: Restart
        restart:
          serviceName: myapp.service
    sshkey:
      actions:
      - type: None

In this case, when /usr/local/bin/myapp.sh is updated, the Machine Configuration Operator will reload all systemd units and restart the myapp.service—without draining or rebooting the node.