Configure Fleet Monitoring

Overview

Fleet Monitoring uses two services:

  • Alauda Container Platform Fleet Monitoring Central Service runs on the Global cluster and enables the Global side of Fleet Monitoring.
  • Alauda Container Platform Fleet Monitoring Cluster Service runs on every cluster that you want to include in Fleet Monitoring.

After Fleet Monitoring is enabled, connected clusters write fleet-level metrics to the Global VictoriaMetrics storage. The built-in Fleet Monitoring dashboards use these metrics to display fleet health, resource usage, data freshness, and project quota usage.

Before You Begin

Before configuring Fleet Monitoring, make sure the following requirements are met:

  • You have platform administrator permissions or the required permissions to install Operators and create the Fleet Monitoring resources.
  • The Global cluster has an available remote write endpoint for Fleet Monitoring data. VictoriaMetrics with a write endpoint is supported. Prometheus or Thanos-based Monitoring can also work if the endpoint exposed by the Monitoring feature accepts remote write traffic.
  • Each cluster that you want to connect is managed by the Global cluster.
  • Each cluster that you want to connect has the Monitoring feature enabled.
  • The Fleet Monitoring Operator packages have been pushed to the required clusters or are already available in OperatorHub. Fleet Monitoring is delivered as Agnostic Operators and is not included by default with the platform installation.
  • The New Web Console plugin deployment capability is available. Fleet Monitoring Operators are deployed through the New Web Console OperatorHub workflow. For more information, see Install the New Web Console.

Push Fleet Monitoring Operator Packages

Before installing Fleet Monitoring from OperatorHub, push the Fleet Monitoring Operator packages with violet.

Push Alauda Container Platform Fleet Monitoring Central Service to the Global cluster:

violet push <path/to/fleet-monitoring-central-service-operator-package> \
  --target-catalog-source "platform" \
  --platform-address "https://<your-platform-domain>" \
  --platform-token "<platform_token>" \
  --clusters "global"

Push Alauda Container Platform Fleet Monitoring Cluster Service to every cluster where the Cluster Service will be installed. If the Global cluster also needs to be included in Fleet Monitoring data, include global in the cluster list:

violet push <path/to/fleet-monitoring-cluster-service-operator-package> \
  --target-catalog-source "platform" \
  --platform-address "https://<your-platform-domain>" \
  --platform-token "<platform_token>" \
  --clusters "global,<workload-cluster-1>,<workload-cluster-2>"

After the package is pushed, the corresponding Operator appears in Marketplace > OperatorHub on the selected cluster. For more information about violet push, see Upload Packages.

Enable Fleet Monitoring on the Global Cluster

  1. Go to Administrator.

  2. In the left navigation bar, click Marketplace > OperatorHub.

  3. At the top of the page, select the global cluster.

  4. Search for Alauda Container Platform Fleet Monitoring Central Service.

  5. If the Operator status is not Installed, click Install and keep the default installation configuration unless your environment requires a different channel, namespace, or upgrade strategy.

    ParameterRecommended configuration
    ChannelUse the default channel provided by the Operator package.
    Installation ModeCluster. The Operator manages Fleet Monitoring resources for the cluster.
    Installation PlaceUse the recommended namespace provided by the Operator package.
    Upgrade StrategyManual, unless your platform upgrade policy requires automatic upgrades.
  6. Verify that Alauda Container Platform Fleet Monitoring Central Service is Installed in OperatorHub.

    If you select the Manual upgrade strategy and OperatorHub shows a pending install plan, approve the install plan to complete the installation.

  7. Verify that the Global cluster has an available VictoriaMetrics write endpoint.

    Fleet Monitoring uses the Global VictoriaMetrics write endpoint to receive data from connected clusters. If the Global cluster has only Prometheus or Thanos Query available, the connected clusters cannot write Fleet Monitoring data to the Global cluster.

  8. Create the FleetMonitoringHub resource on the Global cluster.

    apiVersion: monitoring.alauda.io/v1alpha1
    kind: FleetMonitoringHub
    metadata:
      name: fleet-monitoring-hub
    spec: {}

    FleetMonitoringHub is a cluster-scoped resource. Do not set metadata.namespace.

    FieldRequiredDescription
    metadata.nameYesMust be fleet-monitoring-hub.
    metadata.namespaceNoDo not set this field. FleetMonitoringHub is cluster-scoped.
    specYesUse an empty object {}. Creating the resource enables the Global side of Fleet Monitoring.
  9. Verify the FleetMonitoringHub status.

    kubectl get fleetmonitoringhub fleet-monitoring-hub -o yaml

    Check that the following conditions are ready:

    ConditionDescription
    ConfigurationReadyThe Global storage and database information are available.
    DashboardsReadyBuilt-in Fleet Monitoring dashboards are applied.
    GlobalRulesReadyGlobal recording rules are applied.

    You can also check the phase:

    kubectl get fleetmonitoringhub fleet-monitoring-hub -o jsonpath='{.status.phase}{"\n"}'

    The expected phase is Ready.

Connect a Cluster to Fleet Monitoring

Repeat the following steps on every cluster that you want to include in Fleet Monitoring.

  1. Go to Administrator.

  2. In the left navigation bar, click Marketplace > OperatorHub.

  3. At the top of the page, select the target cluster.

  4. Search for Alauda Container Platform Fleet Monitoring Cluster Service.

  5. If the Operator status is not Installed, click Install and keep the default installation configuration unless your environment requires a different channel, namespace, or upgrade strategy.

    ParameterRecommended configuration
    ChannelUse the default channel provided by the Operator package.
    Installation ModeCluster. The Operator manages Fleet Monitoring resources for the selected cluster.
    Installation PlaceUse the recommended namespace provided by the Operator package.
    Upgrade StrategyManual, unless your platform upgrade policy requires automatic upgrades.
  6. Verify that Alauda Container Platform Fleet Monitoring Cluster Service is Installed in OperatorHub.

    If you select the Manual upgrade strategy and OperatorHub shows a pending install plan, approve the install plan to complete the installation.

  7. Create the FleetMonitoringAgent resource on the target cluster.

    apiVersion: monitoring.alauda.io/v1alpha1
    kind: FleetMonitoringAgent
    metadata:
      name: fleet-monitoring-agent
    spec:
      interval: 5m

    FleetMonitoringAgent is a cluster-scoped resource. Do not set metadata.namespace.

    FieldRequiredDescription
    metadata.nameYesMust be fleet-monitoring-agent.
    metadata.namespaceNoDo not set this field. FleetMonitoringAgent is cluster-scoped.
    spec.intervalNoCollection interval. Supported values are 5m, 10m, 15m, and 30m. The default is 5m.

    To include the Global cluster itself in Fleet Monitoring data, also install Alauda Container Platform Fleet Monitoring Cluster Service on the Global cluster and create a FleetMonitoringAgent resource there.

  8. Verify the FleetMonitoringAgent status.

    kubectl get fleetmonitoringagent fleet-monitoring-agent -o yaml

    Check the following conditions:

    ConditionDescription
    ConfigurationReadyThe cluster name, local Monitoring access information, and database information are available.
    ResourcesAppliedFleet Monitoring resources, such as rules and workload resources, are applied.
    ReadyThe cluster is ready to write Fleet Monitoring data, or the cluster is intentionally skipped for a supported reason.

    You can also check the phase and the detected cluster name:

    kubectl get fleetmonitoringagent fleet-monitoring-agent -o jsonpath='Phase: {.status.phase}{"\n"}Cluster: {.status.cluster}{"\n"}'

    The expected phase is Ready. Common Ready reasons are:

    • WorkloadReady: The cluster deploys a VMAgent and is ready to write Fleet Monitoring data.
    • SkippedForGlobal: The cluster is the Global cluster, so the Cluster Service skips deploying a VMAgent back to the same Global storage.
    • SkippedBackendReuse: The cluster reuses the Global VictoriaMetrics backend, so the Cluster Service skips deploying a Fleet Monitoring VMAgent to avoid a write loop.

    If the target cluster writes data to the Global storage, verify that the database information is available:

    kubectl -n <fleet-monitoring-namespace> get secret fleet-monitoring-database

    Replace <fleet-monitoring-namespace> with the namespace where Alauda Container Platform Fleet Monitoring Cluster Service is installed.

  9. Open Platform > Observe > Fleet Monitoring and verify that the cluster appears in the dashboard data.

Configure the Collection Interval

The collection interval is configured on the FleetMonitoringAgent resource of each connected cluster.

Supported values:

  • 5m
  • 10m
  • 15m
  • 30m

Example:

apiVersion: monitoring.alauda.io/v1alpha1
kind: FleetMonitoringAgent
metadata:
  name: fleet-monitoring-agent
spec:
  interval: 10m

After you update spec.interval, the Fleet Monitoring Cluster Service reconciles the local collection configuration.

On clusters that deploy a Fleet Monitoring VMAgent, the VMAgent collection interval follows spec.interval, while Fleet Monitoring recording rules continue to evaluate at the system-managed interval used for federation. On the Global cluster and on clusters that reuse the Global VictoriaMetrics backend, where no Fleet Monitoring VMAgent is deployed, local Fleet Monitoring rules follow spec.interval.

Configure Custom Metrics and Recording Rules

Fleet Monitoring includes built-in metrics and recording rules. Cluster administrators can append custom metrics and recording rules on each connected cluster.

Use the following workflow when you want to report a user-defined metric into Fleet Monitoring:

  1. On the connected workload cluster, define a local Fleet recording rule that converts the source metric into a Fleet metric name.
  2. Add that recorded metric name to the Fleet allowlist so the Fleet Monitoring VMAgent federates and remote-writes it to the Global cluster.
  3. If you need a Fleet-level rollup such as a 1-hour aggregate, add a separate custom Hub-side PrometheusRule on the Global cluster.
  4. Verify the recorded metric and rollup metric by using Fleet Monitoring queries or dashboards.

Decide Whether You Need Agent-side Rules Only or Both Agent-side and Hub-side Rules

Choose one of the following patterns:

  • Use only the connected-cluster ConfigMap when you need the raw Fleet metric on the Global cluster and can query it directly.
  • Use both the connected-cluster ConfigMap and a Global-cluster custom PrometheusRule when you also need Fleet-level rollups such as 1-hour aggregates for dashboards or long-range views.

Example target:

  • Source metric on the connected cluster: node_load15
  • Fleet metric recorded on the connected cluster: fleet:node:node_load15:avg
  • Optional 1-hour rollup on the Global cluster: fleet:node:node_load15:avg:avg_over_time_1h

Configure the Connected Cluster

Create or update the fleet-monitoring-custom-metrics ConfigMap in the namespace where Alauda Container Platform Fleet Monitoring Cluster Service is installed on the connected cluster.

This ConfigMap has two roles:

  • metrics.yaml adds metric names to the Fleet Monitoring VMAgent federate allowlist.
  • recording-rules-prometheus.yaml or recording-rules-victoriametrics.yaml defines the local recording rule that produces the Fleet metric.

Example:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fleet-monitoring-custom-metrics
  namespace: <fleet-monitoring-namespace>
data:
  metrics.yaml: |
    default:
      - fleet:node:node_load15:avg
  recording-rules-prometheus.yaml: |
    groups:
      - name: fmval-custom
        rules:
          - record: fleet:node:node_load15:avg
            expr: avg by (node) (node_load15)
  recording-rules-victoriametrics.yaml: |
    groups:
      - name: fmval-custom
        rules:
          - record: fleet:node:node_load15:avg
            expr: avg by (node) (node_load15)

Replace <fleet-monitoring-namespace> with the namespace where Alauda Container Platform Fleet Monitoring Cluster Service is installed.

metrics.yaml appends metric names to the built-in allowlist.

For recording rules, the Cluster Service loads only the key that matches the local Monitoring stack:

  • recording-rules-prometheus.yaml on Prometheus-based clusters
  • recording-rules-victoriametrics.yaml on VictoriaMetrics-based clusters

Custom configuration can append metrics and rules. It does not remove or override built-in defaults.

If the ConfigMap has an invalid format or contains invalid rules, Fleet Monitoring keeps the built-in defaults and reports the error in the FleetMonitoringAgent status.

In a connected workload cluster that deploys a Fleet Monitoring VMAgent, the Agent reconciler normalizes the rendered Fleet Monitoring recording-rule group interval to 1m. Do not rely on a custom interval value in the ConfigMap to control the final rendered local Fleet Monitoring rule interval.

For custom Fleet metrics, use the Fleet naming convention for the recorded metric, for example fleet:node:node_load15:avg. This keeps the metric compatible with Fleet Monitoring dashboards, rollups, and query patterns.

Fleet Monitoring queries and dashboards require the recorded time series to carry the cluster label. The Agent reconciler automatically adds cluster=<local-cluster-name> to rendered Fleet Monitoring recording rules when the rule does not already define that label.

Verify the Connected-cluster Rendering

After you update the ConfigMap, the Fleet Monitoring Agent automatically reconciles the local rule and VMAgent configuration. No restart is required.

Check the rendered local rule:

kubectl -n <fleet-monitoring-namespace> get prometheusrule fleet-monitoring-agent-recording-rules -o yaml

Confirm that:

  • the custom group appears in spec.groups
  • the custom recorded metric appears in the rule list
  • the rendered rule carries labels.cluster=<connected-cluster-name>

Check the rendered VMAgent federate allowlist:

kubectl -n <fleet-monitoring-namespace> get configmap fleet-monitoring-vmagent -o yaml

Confirm that the custom recorded metric appears in data.prometheus.yml under params.match[].

Add a Custom Hub Rollup Rule

If you want a custom Fleet metric to have a Fleet-level rollup such as a 1-hour aggregate, create a separate PrometheusRule on the Global cluster in the namespace where Alauda Container Platform Fleet Monitoring Central Service is installed.

Example:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: fmval-custom-hub-rollup
  namespace: <fleet-monitoring-namespace>
  labels:
    prometheus: kube-prometheus
    rule.cpaas.io/is-record: "true"
    alert.cpaas.io/owner: System
    alert.cpaas.io/project: ""
    monitoring.alauda.io/hub: fleet-monitoring-hub
spec:
  groups:
    - name: fmval-custom-rollup-1h
      interval: 1h
      rules:
        - record: fleet:node:node_load15:avg:avg_over_time_1h
          expr: avg_over_time(fleet:node:node_load15:avg[1h])

Replace <fleet-monitoring-namespace> with the namespace where Alauda Container Platform Fleet Monitoring Central Service is installed.

This custom PrometheusRule is additive. It does not need to copy the built-in Hub rules and should not be created with Fleet Monitoring operator ownership metadata.

Verify the Hub-side Rollup

Check the custom Hub-side rule:

kubectl -n <fleet-monitoring-namespace> get prometheusrule fmval-custom-hub-rollup -o yaml

Confirm that:

  • the rule exists on the Global cluster
  • the rule uses the expected Fleet metric as input
  • the rollup output metric name matches the dashboard or query expression you plan to use

Query the Custom Fleet Metric

When querying Fleet metrics through the platform Monitoring API or Fleet Monitoring dashboards, explicitly include vmcluster=~".*" in the selector. In the current platform query path, omitting this selector can cause the query proxy to narrow the query to the Global monitoring backend and return no Fleet data for connected clusters.

Example queries:

  • Raw custom Fleet metric:

    avg_over_time(fleet:node:node_load15:avg{vmcluster=~".*",cluster="g1-c1"}[1h])
  • 1-hour rollup metric:

    last_over_time(fleet:node:node_load15:avg:avg_over_time_1h{vmcluster=~".*",cluster="g1-c1"}[2h])

Common Mistakes

Watch for the following issues:

  • Creating fleet-monitoring-custom-metrics in cpaas-system when Fleet Monitoring is installed in another namespace such as fleet-monitoring
  • Adding the source metric name to metrics.yaml instead of the recorded Fleet metric name
  • Defining the local recording rule but not adding the recorded Fleet metric name to metrics.yaml
  • Expecting a custom interval in the connected-cluster ConfigMap to remain effective after rendering
  • Querying Fleet metrics without vmcluster=~".*" in the selector
  • Expecting a Global Fleet rollup metric before creating the corresponding custom Hub-side rule

Verify Data Freshness

After clusters are connected, open Platform > Observe > Fleet Monitoring and check the following information on the overview dashboard:

  • Connected Clusters
  • Stale Clusters
  • Last Write Ago
  • Data Freshness Exceptions

If a cluster appears in the Cluster variable but is not counted as connected, the cluster can be known to the platform but not writing Fleet Monitoring data. Check whether the cluster has Alauda Container Platform Fleet Monitoring Cluster Service installed from OperatorHub and has a ready FleetMonitoringAgent.

Troubleshooting

No Fleet Monitoring data is displayed

Check the following items:

  • Alauda Container Platform Fleet Monitoring Central Service is installed on the Global cluster.
  • The FleetMonitoringHub resource exists and has ready conditions.
  • The Global cluster has an available VictoriaMetrics storage and write endpoint. Prometheus-only Monitoring cannot receive Fleet Monitoring remote write data.
  • Built-in dashboards and Global rules are applied.

A cluster is missing from Connected Clusters

Check the following items on the target cluster:

  • Alauda Container Platform Fleet Monitoring Cluster Service is installed.
  • The FleetMonitoringAgent resource exists.
  • The cluster is managed by the Global cluster.
  • The Monitoring feature is enabled on the cluster.
  • The FleetMonitoringAgent status does not report missing database information or invalid Monitoring feature information.
  • The fleet-monitoring-database Secret contains a remoteWriteURL that points to the Global VictoriaMetrics write endpoint.
  • The fleet-monitoring-vmagent logs do not report remote write errors. If the logs show 405 Method Not Allowed, the remoteWriteURL can be pointing to a Prometheus or Thanos Query endpoint instead of the VictoriaMetrics write endpoint.

A custom dashboard does not appear in Fleet Monitoring

Check whether the dashboard was created on the Global cluster and the dashboard resource in the cpaas-system namespace has the following label:

cpaas.io/dashboard.tag.fleet-monitoring: "true"

Dashboards without this label do not have the fleet-monitoring tag and are not listed in the Fleet Monitoring Switch list.

Data freshness is abnormal

Check the following items:

  • The connected cluster is running.
  • The Fleet Monitoring Cluster Service pods are healthy.
  • The local Monitoring component on the connected cluster is healthy.
  • The connected cluster can write data to the Global VictoriaMetrics storage.
  • The FleetMonitoringAgent status does not report configuration or resource application errors.

Learn More