Сбор метрик для компонентов Tekton

Содержание

Обзор Предварительные требования Tekton Pipelines Метрики PipelineRun Уровни label для running_pipelinerunsЗначения label statusМетрики TaskRun Конфигурация config-observabilityBucket'ы histogram Рекомендуемая production-конфигурация Tekton Triggers Метрики контроллера (порт 9000)Метрики EventListener Sink Tekton Results Метрики Watcher Метрики удаления Общие метрики config-observability Watcher Метрики API Server Tekton Chains Метрики Chains Метрики Controller Framework Настройка ServiceMonitor Pipeline ServiceMonitor Triggers ServiceMonitor EventListener Sink ServiceMonitor Results ServiceMonitor Chains ServiceMonitor Проверка Проверка endpoints метрик напрямую Проверка Prometheus targets Примеры PromQL-запросов Примеры MonitorDashboard Dashboard Tekton Pipeline Интерпретация Dashboard Tekton Pipeline (частые вопросы)Dashboard Tekton Triggers Интерпретация Dashboard Tekton Triggers (частые вопросы)Dashboard Tekton Results Интерпретация Dashboard Tekton Results (частые вопросы)Dashboard Tekton Chains Интерпретация Dashboard Tekton Chains (частые вопросы)

Обзор

Компоненты Tekton предоставляют совместимые с Prometheus метрики через HTTP endpoints. При развертывании ресурсов ServiceMonitor Prometheus (или VictoriaMetrics) может автоматически обнаруживать и собирать эти метрики.

Примечание по namespace: В этом документе tekton-pipelines используется как namespace по умолчанию для компонентов control plane (Pipelines, Triggers, Results, Chains). Основное исключение — службы EventListener, которые работают в application namespace, где создаются EventListener.

Если в вашем развертывании используются другие namespace, обновите как команды, так и поля namespaceSelector в ресурсах ServiceMonitor ниже.

В этом документе описаны метрики для следующих компонентов Tekton:

Tekton Pipelines — метрики выполнения PipelineRun / TaskRun
Tekton Triggers — метрики EventListener, TriggerBinding и связанных ресурсов
Tekton Results — метрики удаления Run и хранения данных
Tekton Chains — метрики подписи и provenance
Controller Framework — инфраструктурные метрики, общие для всех контроллеров

Также рассматриваются:

Как настроить поведение метрик через config-observability
Как развернуть ресурсы ServiceMonitor для сбора метрик
Как проверить, что сбор метрик работает

Предварительные требования

Компоненты control plane Tekton установлены и запущены (как минимум те компоненты, метрики которых вы планируете собирать: Pipelines, Triggers, Results и/или Chains).
kubectl настроен для целевого кластера, и ваша учетная запись может создавать ресурсы ServiceMonitor в monitoring namespace.
Развернут monitoring stack (Prometheus или совместимый VictoriaMetrics), который может обнаруживать и собирать ресурсы ServiceMonitor (или эквивалентные объекты обнаружения сбора, используемые вашей платформой).
Экземпляр Prometheus/VictoriaMetrics настроен на обнаружение создаваемых вами объектов ServiceMonitor (namespace и label selectors должны совпадать).
Network policies и firewall разрешают scraper pod'ам подключаться к портам метрик Tekton (9090 для большинства служб control plane, 9000 для Triggers controller и EventListener sink).
Если нужны метрики EventListener sink, EventListener должны существовать в своих целевых namespace и предоставлять порт http-metrics.

Tekton Pipelines

Компонент Tekton Pipelines включает несколько подслужб, которые предоставляют метрики на порту 9090:

Service	Description	Metrics Port
`tekton-pipelines-controller`	Main reconciler for PipelineRun / TaskRun	9090
`tekton-pipelines-webhook`	Admission webhook	9090
`tekton-events-controller`	CloudEvents controller	9090
`tekton-pipelines-remote-resolvers`	Remote resource resolution	9090

Метрики Pipeline controller используют префикс tekton_pipelines_controller_.

Метрики PipelineRun

Metric Name	Type	Description	Labels
`pipelinerun_duration_seconds`	Histogram / LastValue	PipelineRun execution time in seconds	`status`, `namespace`, `pipeline`, `pipelinerun`, `reason`*
`pipelinerun_total`	Counter	Total number of completed PipelineRuns	`status`
`running_pipelineruns`	LastValue (Gauge)	Number of currently running PipelineRuns	Controlled by `metrics.running-pipelinerun.level` (see below)
`running_pipelineruns_waiting_on_pipeline_resolution`	LastValue (Gauge)	PipelineRuns waiting on Pipeline reference resolution	-
`running_pipelineruns_waiting_on_task_resolution`	LastValue (Gauge)	PipelineRuns waiting on Task reference resolution	-

* Labels marked with * are optional and depend on the config-observability configuration.

Уровни label для `running_pipelineruns`

Labels метрики running_pipelineruns управляются параметром metrics.running-pipelinerun.level:

Level	Labels
`""` (default, cluster)	No labels
`"namespace"`	`namespace`
`"pipeline"`	`namespace`, `pipeline`
`"pipelinerun"`	`namespace`, `pipeline`, `pipelinerun`

Значения label `status`

Для метрик PipelineRun:

success — PipelineRun завершился успешно
failed — PipelineRun завершился с ошибкой
cancelled — PipelineRun был отменен

Для метрик TaskRun:

success — TaskRun завершился успешно
failed — TaskRun завершился с ошибкой

Метрики TaskRun

Metric Name	Type	Description	Labels
`taskrun_duration_seconds`	Histogram / LastValue	Standalone TaskRun execution time in seconds	`status`, `namespace`, `task`, `taskrun`, `reason`*
`pipelinerun_taskrun_duration_seconds`	Histogram / LastValue	TaskRun execution time when part of a PipelineRun	`status`, `namespace`, `task`, `taskrun`, `pipeline`, `pipelinerun`, `reason`*
`taskrun_total`	Counter	Total number of completed TaskRuns	`status`
`running_taskruns`	LastValue (Gauge)	Number of currently running TaskRuns	-
`running_taskruns_waiting_on_task_resolution_count`	LastValue (Gauge)	TaskRuns waiting on Task reference resolution	-
`running_taskruns_throttled_by_quota`	LastValue (Gauge)	TaskRuns throttled by ResourceQuota	`namespace`*
`running_taskruns_throttled_by_node`	LastValue (Gauge)	TaskRuns throttled by node-level resource constraints	`namespace`*
`taskruns_pod_latency_milliseconds`	LastValue	Pod scheduling latency for TaskRuns in milliseconds	`namespace`, `pod`, `task`, `taskrun`

Конфигурация `config-observability`

ConfigMap config-observability в namespace tekton-pipelines управляет поведением метрик для Pipeline controller. Этот ConfigMap управляется Tekton Operator и должен настраиваться через поле spec.pipeline.options.configMaps ресурса TektonConfig. Подробности см. в разделе Adjusting Optional Configuration Items for Subcomponents.

Поведение hot reload: config-observability отслеживается в runtime. Большинство изменений ключей (например, metrics.*) вступают в силу без перезапуска Pod. Подождите один или два интервала сбора, чтобы изменения в dashboard/query стали видны. Перезапуск требуется только при изменении настроек Pod spec (например, при изменении CONFIG_OBSERVABILITY_NAME в Deployment).

Пример настройки через TektonConfig:

apiVersion: operator.tekton.dev/v1alpha1
kind: TektonConfig
metadata:
  name: config
spec:
  pipeline:
    options:
      disabled: false
      configMaps:
        config-observability:
          data:
            metrics.backend-destination: prometheus

            # PipelineRun metrics aggregation level.
            # Values: "pipelinerun" | "pipeline" (default) | "namespace"
            #   - "pipelinerun": includes pipeline + pipelinerun labels; duration uses LastValue
            #   - "pipeline": includes pipeline label only
            #   - "namespace": no pipeline/pipelinerun labels
            metrics.pipelinerun.level: "pipeline"

            # TaskRun metrics aggregation level.
            # Values: "taskrun" | "task" (default) | "namespace"
            #   - "taskrun": includes task + taskrun labels; duration uses LastValue
            #   - "task": includes task label only
            #   - "namespace": no task/taskrun labels
            metrics.taskrun.level: "task"

            # Duration metric type for PipelineRun / TaskRun.
            # Values: "histogram" (default) | "lastvalue"
            # Note: When pipelinerun.level is "pipelinerun" or taskrun.level is "taskrun",
            #       duration type is forced to "lastvalue" regardless of this setting.
            metrics.pipelinerun.duration-type: "histogram"
            metrics.taskrun.duration-type: "histogram"

            # Running PipelineRun metrics aggregation level.
            # Values: "pipelinerun" | "pipeline" | "namespace" | "" (default, cluster-level)
            metrics.running-pipelinerun.level: ""

            # Include reason label on duration metrics (pipelinerun_duration_seconds,
            # taskrun_duration_seconds, pipelinerun_taskrun_duration_seconds).
            # Values: "true" | "false" (default)
            # Warning: Enabling this increases label cardinality.
            # Note: Despite the key name, this does NOT affect count metrics
            # (pipelinerun_total / taskrun_total), only duration metrics.
            metrics.count.enable-reason: "false"

            # Include namespace label on throttled TaskRun metrics.
            # Values: "true" | "false" (default)
            metrics.taskrun.throttle.enable-namespace: "false"

Bucket'ы histogram

Когда тип длительности — histogram, используются следующие границы bucket'ов (в секундах):

10, 30, 60, 300, 900, 1800, 3600, 5400, 10800, 21600, 43200, 86400

Это соответствует: 10s, 30s, 1m, 5m, 15m, 30m, 1h, 1.5h, 3h, 6h, 12h, 24h.

Tekton Triggers

Компонент Tekton Triggers предоставляет две категории метрик из разных процессов.

Метрики контроллера (порт 9000)

Triggers controller сообщает метрики количества ресурсов каждые 60 секунд.

Service	Metrics Port
`tekton-triggers-controller`	9000

Метрики Triggers controller используют префикс controller_.

Metric Name	Type	Description	Labels
`eventlistener_count`	LastValue (Gauge)	Number of EventListener resources	-
`triggerbinding_count`	LastValue (Gauge)	Number of TriggerBinding resources	-
`clustertriggerbinding_count`	LastValue (Gauge)	Number of ClusterTriggerBinding resources	-
`triggertemplate_count`	LastValue (Gauge)	Number of TriggerTemplate resources	-
`clusterinterceptor_count`	LastValue (Gauge)	Number of ClusterInterceptor resources	-

Метрики EventListener Sink

Каждый pod EventListener предоставляет дополнительные HTTP-метрики и метрики обработки событий. Эти метрики поступают из процесса EventListener sink (а не из controller). Префикс метрик Prometheus — eventlistener_.

Metric Name (Prometheus)	Type	Description	Labels
`eventlistener_http_duration_seconds`	Histogram	EventListener HTTP request duration	-
`eventlistener_event_received_count`	Counter	Total events received by the sink	`status`
`eventlistener_triggered_resources`	Counter	Total resources created by triggers	`kind`

bucket'ы histogram eventlistener_http_duration_seconds: 0.001, 0.01, 0.1, 1, 10 (seconds)
Значения status для eventlistener_event_received_count: succeeded, failed
Значения kind для eventlistener_triggered_resources: Kind ресурса Kubernetes созданного объекта (например, PipelineRun, TaskRun)

Эти метрики sink предоставляются для каждого pod EventListener отдельно, а не центральным controller. Возможно, вам потребуется отдельный ServiceMonitor или PodMonitor для их сбора, если pod EventListener предоставляет порт метрик.

Tekton Results

Tekton Results имеет две подслужбы, которые предоставляют метрики.

Service	Description	Metrics Port
`tekton-results-watcher`	Watches and cleans up PipelineRun/TaskRun resources	9090
`tekton-results-api`	gRPC/REST API server	9090

Метрики Watcher

Метрики Watcher используют префикс watcher_.

Метрики удаления

Metric Name	Type	Description	Labels
`pipelinerun_delete_count`	Counter	Total number of deleted PipelineRuns	`status`, `namespace`
`pipelinerun_delete_duration_seconds`	Histogram / LastValue	Time from PipelineRun completion to deletion	`status`, `namespace`, `pipeline`*
`taskrun_delete_count`	Counter	Total number of deleted TaskRuns	`status`, `namespace`
`taskrun_delete_duration_seconds`	Histogram / LastValue	Time from TaskRun completion to deletion	`status`, `namespace`, `pipeline`, `task`

* Optional labels depend on config-observability settings for the Results Watcher.

Примечание: pipelinerun_delete_count, pipelinerun_delete_duration_seconds, taskrun_delete_count и taskrun_delete_duration_seconds записываются только тогда, когда Watcher действительно удаляет run. Эти метрики будут оставаться пустыми (без точек данных), если для Deployment tekton-results-watcher не задан флаг --completed_run_grace_period со значением, отличным от нуля. По умолчанию этот флаг имеет значение 0, что отключает автоматическое удаление. Установите положительную длительность (например, 10m), чтобы включить удаление после периода ожидания, или отрицательное значение, чтобы удалять сразу после архивации.

Значения label status для Results Watcher:

success — Run завершился успешно
failed — Run завершился с ошибкой
cancelled — Run был отменен

Общие метрики

Эти метрики регистрируются обоими reconciler'ами PipelineRun и TaskRun в Watcher и отслеживают события, связанные с хранением данных.

Metric Name	Type	Description	Labels
`runs_not_stored_count`	Counter	Runs deleted without being stored to Results	`kind`, `namespace`
`run_storage_latency_seconds`	Histogram	Time from run completion to successful storage	`kind`, `namespace`

Label kind определяет тип run (PipelineRun / TaskRun в некоторых metric series, pipelinerun / taskrun в других).

Примечание: runs_not_stored_count записывается только тогда, когда run удаляется извне (например, через kubectl delete), пока Watcher удерживает finalizer для координации архивации. Он будет оставаться пустым, если не выполняются все следующие условия:

Флаг --logs_api имеет значение false (хранение логов отключено) — если логи включены, Watcher полностью пропускает координацию на основе finalizer.

Флаг --disable_crd_update имеет значение false (обновление annotations включено).

Флаг --store_deadline задан как ненулевая длительность — это максимальное время ожидания Watcher завершения архивации, прежде чем он сдастся и разрешит удаление.

Run удаляется извне до того, как он будет успешно заархивирован (нет annotation results.tekton.dev/stored=true), и время store_deadline истекло.

В обычной работе (run архивируется до удаления или удаление инициируется самим Watcher через --completed_run_grace_period) этот счетчик остается равным нулю. Ненулевое значение указывает на возможную потерю данных: run был удален до того, как его состояние удалось сохранить в Results API.

Быстрое воспроизведение (test environment): Если вы не видите эту метрику, обычно это означает, что условия срабатывания не были выполнены, а не то, что метрика отсутствует.

Настройте Results Watcher через TektonConfig так, чтобы logs_api=false, disable_crd_update=false, а store_deadline имел ненулевое значение (например, 30s).

Временно установите число replicas Results API в 0 через TektonConfig (spec.result.options.deployments.tekton-results-api.spec.replicas: 0), чтобы run не могли архивироваться.

Создайте TaskRun или PipelineRun и дождитесь его завершения.

Дождитесь истечения store_deadline, затем удалите run извне (kubectl delete ...).

Проверьте /metrics Watcher или Prometheus на наличие watcher_runs_not_stored_count (имя с префиксом компонента в формате exposition); значение должно увеличиться.

Восстановите исходный TektonConfig (снова включите replicas Results API и обычные настройки logs_api).

В histogram run_storage_latency_seconds используются следующие границы bucket'ов (в секундах):

0.1, 0.5, 1, 2, 5, 10, 30, 60, 120, 300, 600, 1800

`config-observability` Watcher

У Results Watcher есть собственный ConfigMap config-observability (имя задается через переменную окружения CONFIG_OBSERVABILITY_NAME, обычно tekton-results-config-observability). Этот ConfigMap управляется Tekton Operator и должен настраиваться через поле spec.result.options.configMaps ресурса TektonConfig. Подробности см. в разделе Adjusting Optional Configuration Items for Subcomponents.

Поведение hot reload: Results Watcher также отслеживает этот ConfigMap и применяет большинство изменений ключей без перезапуска Pod. Перезапуск требуется только при изменении настроек на уровне Deployment (например, env vars/args).

Поддерживаются следующие ключи:

Key	Default	Values	Description
`metrics.pipelinerun.level`	`pipeline`	`pipeline`, `namespace`	Controls `pipeline` label on delete duration metrics
`metrics.taskrun.level`	`task`	`task`, `namespace`	Controls `task` label on delete duration metrics
`metrics.pipelinerun.duration-type`	`histogram`	`histogram`, `lastvalue`	Duration metric aggregation type for both PipelineRun and TaskRun deletion
`metrics.taskrun.duration-type`	`histogram`	`histogram`, `lastvalue`	Parsed but currently not used; `metrics.pipelinerun.duration-type` controls both

Примечание: В отличие от Tekton Pipelines, Results Watcher не поддерживает уровни детализации pipelinerun / taskrun для individual-run granularity. Также у него нет ключей metrics.count.enable-reason, metrics.running-pipelinerun.level и metrics.taskrun.throttle.enable-namespace.

Известная проблема upstream: taskrun_delete_duration_seconds использует metrics.pipelinerun.duration-type (а не metrics.taskrun.duration-type) для определения типа aggregation. Похоже, это ошибка copy-paste в исходном коде Results.

Метрики API Server

API server предоставляет стандартные Prometheus-метрики gRPC через библиотеку go-grpc-prometheus на порту 9090. К ним относятся:

grpc_server_handled_total — общее число RPC, завершенных на сервере
grpc_server_started_total — общее число RPC, начатых на сервере
grpc_server_msg_received_total / grpc_server_msg_sent_total — количество сообщений
grpc_server_handling_seconds (если включен PROMETHEUS_HISTOGRAM) — длительность обработки RPC

Tekton Chains

Tekton Chains — это компонент безопасности, который генерирует, подписывает и сохраняет provenance для артефактов, собранных с помощью Tekton Pipelines. Он отслеживает завершенные TaskRun и PipelineRun, а затем создает attestations и signatures.

Service	Description	Metrics Port
`tekton-chains-metrics`	Chains watcher/controller	9090 (`http-metrics`)

Метрики Chains controller используют префикс watcher_ (такой же, как у Results Watcher, но сами имена пользовательских метрик отличаются, поэтому конфликтов нет).

Метрики Chains

Все метрики Chains — это Counters без labels.

Metric Name (Prometheus)	Type	Description
`watcher_taskrun_sign_created_total`	Counter	Total signed messages for TaskRuns
`watcher_taskrun_payload_stored_total`	Counter	Total stored payloads for TaskRuns
`watcher_taskrun_marked_signed_total`	Counter	Total TaskRuns marked as signed
`watcher_pipelinerun_sign_created_total`	Counter	Total signed messages for PipelineRuns
`watcher_pipelinerun_payload_stored_total`	Counter	Total stored payloads for PipelineRuns
`watcher_pipelinerun_marked_signed_total`	Counter	Total PipelineRuns marked as signed

Примечание: В официальной документации Tekton Chains также упоминаются counters *_signing_failures_total как для TaskRun, так и для PipelineRun, но в текущем upstream исходном коде они отсутствуют. Проверьте это для вашей развернутой версии.

Метрики Controller Framework

Все контроллеры Tekton автоматически предоставляют следующие инфраструктурные метрики. Эти метрики используют тот же префикс, что и пользовательские метрики компонента (например, tekton_pipelines_controller_, controller_, watcher_).

Metric Name (without prefix)	Type	Description
`client_latency`	Histogram	Kubernetes API client request latency (seconds)
`client_results`	Counter	Kubernetes API request count (by status code)
`workqueue_depth`	Gauge	Current workqueue depth
`workqueue_adds_total`	Counter	Total workqueue additions
`workqueue_queue_latency_seconds`	Histogram	Time items spend waiting in the workqueue
`workqueue_work_duration_seconds`	Histogram	Time spent processing workqueue items
`workqueue_retries_total`	Counter	Total workqueue retries
`workqueue_unfinished_work_seconds`	Histogram	Duration of unfinished workqueue items
`workqueue_longest_running_processor_seconds`	Histogram	Duration of longest running workqueue processor
`reconcile_count`	Counter	Total reconciler invocations (labeled by `reconciler`, `success`, `namespace_name`)
`reconcile_latency`	Histogram	Reconciler invocation latency (labeled by `reconciler`, `success`, `namespace_name`)

Настройка ServiceMonitor

Чтобы включить сбор метрик Tekton через Prometheus, разверните ресурсы ServiceMonitor.

Предварительные требования перечислены в разделе Предварительные требования.

Используйте следующие рекомендации в зависимости от вашего monitoring stack:

Если вы используете Prometheus (Prometheus Operator), labels, такие как metadata.labels.prometheus: kube-prometheus, должны соответствовать spec.serviceMonitorSelector в CR Prometheus; иначе этот ServiceMonitor не будет собираться.
Если вы используете VictoriaMetrics, обычно labels вроде prometheus: kube-prometheus не требуются; создавайте ServiceMonitor/VMServiceScrape в соответствии с вашей схемой мониторинга.

При использовании Prometheus используйте следующие команды, чтобы найти и проверить selector:

# 1) Locate Prometheus CRs (resource type: monitoring.coreos.com/v1, Kind=Prometheus)
$ kubectl get prometheus -A

# 2) Check ServiceMonitor selector on the target Prometheus instance
$ kubectl get prometheus -n <prometheus-namespace> <prometheus-name> -o yaml | yq '.spec.serviceMonitorSelector'

Если в вашем кластере нет CR Prometheus, monitoring обычно управляется платформой (например, VictoriaMetrics) или реализован иначе. В таких случаях labels вроде prometheus: kube-prometheus обычно не требуются; следуйте правилам сбора вашей платформы.

Для получения дополнительной информации см. Integrating External Metrics.

Pipeline ServiceMonitor

Pipeline ServiceMonitor YAML

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-pipelines-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-pipelines
    # prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: tekton-pipelines
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - tekton-pipelines

Этот ServiceMonitor сопоставляет службы Pipeline с label app.kubernetes.io/part-of: tekton-pipelines (включая remote-resolvers) и собирает их в namespace tekton-pipelines.

Triggers ServiceMonitor

Triggers ServiceMonitor YAML

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-triggers-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-triggers
    # prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: tekton-triggers
      app.kubernetes.io/component: controller
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - tekton-pipelines

Этот ServiceMonitor собирает только метрики Triggers controller (controller_*). Он не включает метрики EventListener sink.

EventListener Sink ServiceMonitor

EventListener Sink ServiceMonitor YAML

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-eventlistener-sink-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-eventlistener-sink
    # prometheus: kube-prometheus
spec:
  selector:
    matchExpressions:
    - key: eventlistener
      operator: Exists
    - key: app.kubernetes.io/managed-by
      operator: In
      values:
      - EventListener
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    any: true

Службы EventListener обычно работают в application namespace, поэтому в этом примере используется namespaceSelector.any: true для сквозного сбора между namespace. Если вам нужен более строгий scope, переключитесь на matchNames и явно перечислите разрешенные namespace.

Results ServiceMonitor

У служб Results есть как app.kubernetes.io/part-of: tekton-results, так и label app.kubernetes.io/name. Чтобы точно нацелиться на API + Watcher (и исключить Postgres), в этом примере используется app.kubernetes.io/name:

Results ServiceMonitor YAML

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-results-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-results
    # prometheus: kube-prometheus
spec:
  selector:
    matchExpressions:
    - key: app.kubernetes.io/name
      operator: In
      values:
      - tekton-results-api
      - tekton-results-watcher
  endpoints:
  - port: prometheus
    path: /metrics
    interval: 30s
  - port: metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - tekton-pipelines

API server Results использует имя порта prometheus (9090), а Watcher использует имя порта metrics (9090). Каждый сервис exposes только одно из этих имен портов, поэтому будет собираться только соответствующий endpoint.

Chains ServiceMonitor

Chains ServiceMonitor YAML

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: tekton-chains-metrics
  namespace: tekton-pipelines
  labels:
    app.kubernetes.io/name: tekton-chains
    # prometheus: kube-prometheus
spec:
  selector:
    matchLabels:
      app.kubernetes.io/part-of: tekton-chains
  endpoints:
  - port: http-metrics
    path: /metrics
    interval: 30s
  namespaceSelector:
    matchNames:
    - tekton-pipelines

Проверка

После развертывания ресурсов ServiceMonitor проверьте, что Prometheus собирает targets.

Проверка endpoints метрик напрямую

# Pipeline controller
$ kubectl port-forward -n tekton-pipelines svc/tekton-pipelines-controller 9090:9090
$ curl -s http://localhost:9090/metrics | grep tekton_pipelines_controller_

# HELP tekton_pipelines_controller_client_latency How long Kubernetes API requests take
# TYPE tekton_pipelines_controller_client_latency histogram
tekton_pipelines_controller_client_latency_bucket{name="",le="1e-05"} 0
tekton_pipelines_controller_client_latency_bucket{name="",le="0.0001"} 0
tekton_pipelines_controller_client_latency_bucket{name="",le="0.001"} 0

# Triggers controller
$ kubectl port-forward -n tekton-pipelines svc/tekton-triggers-controller 9000:9000
$ curl -s http://localhost:9000/metrics | grep controller_

# HELP controller_client_latency How long Kubernetes API requests take
# TYPE controller_client_latency histogram
controller_client_latency_bucket{name="",le="1e-05"} 0
controller_client_latency_bucket{name="",le="0.0001"} 1
controller_client_latency_bucket{name="",le="0.001"} 2

# EventListener sink metrics (replace namespace/service)
$ kubectl port-forward -n <eventlistener-namespace> svc/<eventlistener-service> 9000:9000
$ curl -s http://localhost:9000/metrics | grep eventlistener_

# HELP eventlistener_client_latency How long Kubernetes API requests take
# TYPE eventlistener_client_latency histogram
eventlistener_client_latency_bucket{name="",le="1e-05"} 0
eventlistener_client_latency_bucket{name="",le="0.0001"} 0
eventlistener_client_latency_bucket{name="",le="0.001"} 0

# HELP eventlistener_triggered_resources Count of the number of triggered eventlistener resources
# TYPE eventlistener_triggered_resources counter
eventlistener_triggered_resources{kind="PipelineRun"} 10

# Results watcher
$ kubectl port-forward -n tekton-pipelines svc/tekton-results-watcher 9091:9090
$ curl -s http://localhost:9091/metrics | grep watcher_

# HELP watcher_client_latency How long Kubernetes API requests take
# TYPE watcher_client_latency histogram
watcher_client_latency_bucket{name="",le="1e-05"} 0
watcher_client_latency_bucket{name="",le="0.0001"} 0
watcher_client_latency_bucket{name="",le="0.001"} 0

# Results API
$ kubectl port-forward -n tekton-pipelines svc/tekton-results-api-service 9092:9090
$ curl -s http://localhost:9092/metrics | grep grpc_server_

# HELP grpc_server_handled_total Total number of RPCs completed on the server, regardless of success or failure.
# TYPE grpc_server_handled_total counter
grpc_server_handled_total{grpc_code="Aborted",grpc_method="Check",grpc_service="grpc.health.v1.Health",grpc_type="unary"} 0
grpc_server_handled_total{grpc_code="Aborted",grpc_method="CreateRecord",grpc_service="tekton.results.v1alpha2.Results",grpc_type="unary"} 0
grpc_server_handled_total{grpc_code="Aborted",grpc_method="CreateResult",grpc_service="tekton.results.v1alpha2.Results",grpc_type="unary"} 0

# HELP grpc_server_started_total Total number of RPCs started on the server.
# TYPE grpc_server_started_total counter
grpc_server_started_total{grpc_method="Check",grpc_service="grpc.health.v1.Health",grpc_type="unary"} 337606
grpc_server_started_total{grpc_method="CreateRecord",grpc_service="tekton.results.v1alpha2.Results",grpc_type="unary"} 10301
grpc_server_started_total{grpc_method="CreateResult",grpc_service="tekton.results.v1alpha2.Results",grpc_type="unary"} 832

# Chains controller
$ kubectl port-forward -n tekton-pipelines svc/tekton-chains-metrics 9093:9090
$ curl -s http://localhost:9093/metrics | grep watcher_

# HELP watcher_client_latency How long Kubernetes API requests take
# TYPE watcher_client_latency histogram
watcher_client_latency_bucket{name="",le="1e-05"} 0
watcher_client_latency_bucket{name="",le="0.0001"} 0
watcher_client_latency_bucket{name="",le="0.001"} 0

Метрики EventListener sink, такие как eventlistener_event_received_count и eventlistener_http_duration_seconds, зависят от запросов. Перед проверкой этих метрик отправьте в EventListener как минимум один запрос.

Проверка Prometheus targets

# Verify ServiceMonitor resources exist
$ kubectl get servicemonitor -n tekton-pipelines

NAME                                AGE
tekton-chains-metrics               10m
tekton-eventlistener-sink-metrics   10m
tekton-pipelines-metrics            10m
tekton-results-metrics              10m
tekton-triggers-metrics             10m

# Check Prometheus targets (via Prometheus UI or API)
# Look for targets with job labels matching the ServiceMonitor names

Примеры PromQL-запросов

# PipelineRun cumulative success rate (avoids misinterpretation in empty completion windows)
100 * sum(tekton_pipelines_controller_pipelinerun_total{status="success"}) / clamp_min(sum(tekton_pipelines_controller_pipelinerun_total), 1)

# Completed PipelineRuns in the last 5 minutes (throughput)
round(sum(increase(tekton_pipelines_controller_pipelinerun_total[5m])))

# PipelineRun duration P95 (histogram mode)
histogram_quantile(0.95,
  rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])
)

# TaskRun duration P95 (histogram mode, includes standalone + in-pipeline TaskRuns)
histogram_quantile(0.95,
  (
    sum by (le) (rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket[5m]))
    +
    sum by (le) (rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket[5m]))
  )
)

# PipelineRun duration (lastvalue mode)
avg_over_time(tekton_pipelines_controller_pipelinerun_duration_seconds[5m])

# Currently running PipelineRuns (single series to avoid duplicate legends)
max(tekton_pipelines_controller_running_pipelineruns)

# TaskRuns throttled by resource quota
max(tekton_pipelines_controller_running_taskruns_throttled_by_quota)

# Trigger resource counts
controller_eventlistener_count
controller_triggertemplate_count

# Chains signing activity
watcher_taskrun_sign_created_total
watcher_pipelinerun_sign_created_total

Примеры MonitorDashboard

Следующие ресурсы MonitorDashboard предоставляют готовые к использованию dashboard для мониторинга компонентов Tekton. Разверните их в namespace cpaas-system в папке tekton.

Важно: Каждый panel должен содержать id (уникальное целое число), datasource: prometheus и transformations: []. Каждый target должен содержать datasource: prometheus и refId. Панели Duration P50/P95 в этом документе используют запросы к *_bucket и требуют metrics.*.duration-type=histogram; если вы используете lastvalue, замените эти запросы выражениями в стиле LastValue, такими как avg_over_time(...).

Dashboard Tekton Pipeline

Tekton Pipeline Dashboard YAML

kind: MonitorDashboard
apiVersion: ait.alauda.io/v1alpha2
metadata:
  labels:
    cpaas.io/dashboard.folder: tekton
    cpaas.io/dashboard.is.home.dashboard: "false"
    cpaas.io/dashboard.tag.tekton: "true"
  name: tekton-pipeline
  namespace: cpaas-system
spec:
  body:
    titleZh: Tekton Pipeline Overview
    tags:
      - tekton
    time:
      from: now-1h
      to: now
    templating:
      list: []
    panels:
      - id: 1
        title: PipelineRun Total (by status)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 0, y: 0 }
        targets:
          - datasource: prometheus
            expr: sum by (status) (tekton_pipelines_controller_pipelinerun_total)
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 2
        title: TaskRun Total (by status)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 8, y: 0 }
        targets:
          - datasource: prometheus
            expr: sum by (status) (tekton_pipelines_controller_taskrun_total)
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 3
        title: PipelineRun Success Rate (cumulative)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 4, x: 16, y: 0 }
        targets:
          - datasource: prometheus
            expr: "100 * sum(tekton_pipelines_controller_pipelinerun_total{status=\"success\"}) / clamp_min(sum(tekton_pipelines_controller_pipelinerun_total), 1)"
            refId: A
        fieldConfig:
          defaults:
            unit: percent
            color: { mode: thresholds }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds:
              mode: absolute
              steps:
                - { color: red, value: null }
                - { color: orange, value: 80 }
                - { color: green, value: 95 }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 12
        title: Completed PipelineRuns (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 4, x: 20, y: 0 }
        targets:
          - datasource: prometheus
            expr: "round(sum(increase(tekton_pipelines_controller_pipelinerun_total[5m])))"
            legendFormat: completed
            refId: A
        fieldConfig:
          defaults:
            unit: short
            decimals: 0
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 4
        title: Running PipelineRuns
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 0, y: 8 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_pipelineruns)
            legendFormat: running
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 5
        title: Running TaskRuns
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 8, y: 8 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_taskruns)
            legendFormat: running
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 6
        title: TaskRuns Throttled
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 16, y: 8 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_taskruns_throttled_by_quota)
            legendFormat: by quota
            refId: A
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_taskruns_throttled_by_node)
            legendFormat: by node
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: orange, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 7
        title: PipelineRun Duration P50 / P95
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 0, y: 16 }
        targets:
          - datasource: prometheus
            expr: (histogram_quantile(0.5, sum by (le) (rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P50
            refId: A
          - datasource: prometheus
            expr: (histogram_quantile(0.95, sum by (le) (rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_pipelinerun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P95
            refId: B
        fieldConfig:
          defaults:
            unit: s
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 8
        title: TaskRun Duration P50 / P95 (Standalone)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 8, y: 16 }
        targets:
          - datasource: prometheus
            expr: (histogram_quantile(0.5, sum by (le) (rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P50
            refId: A
          - datasource: prometheus
            expr: (histogram_quantile(0.95, sum by (le) (rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_taskrun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P95
            refId: B
        fieldConfig:
          defaults:
            unit: s
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 13
        title: TaskRun Duration P50 / P95 (In-Pipeline)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 16, y: 16 }
        targets:
          - datasource: prometheus
            expr: (histogram_quantile(0.5, sum by (le) (rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P50
            refId: A
          - datasource: prometheus
            expr: (histogram_quantile(0.95, sum by (le) (rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket[5m])))) and on() (sum(rate(tekton_pipelines_controller_pipelinerun_taskrun_duration_seconds_bucket{le="+Inf"}[5m])) > 0)
            legendFormat: P95
            refId: B
        fieldConfig:
          defaults:
            unit: s
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 9
        title: Workqueue Depth
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 0, y: 24 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_workqueue_depth)
            legendFormat: depth
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 10
        title: Reconcile Count (by success)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 8, y: 24 }
        targets:
          - datasource: prometheus
            expr: sum(increase(tekton_pipelines_controller_reconcile_count{success="true"}[5m]))
            legendFormat: success=true
            refId: A
          - datasource: prometheus
            expr: sum(increase(tekton_pipelines_controller_reconcile_count{success="false"}[5m]))
            legendFormat: success=false
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 11
        title: Resolution Waiting
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 8, x: 16, y: 24 }
        targets:
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_pipelineruns_waiting_on_pipeline_resolution)
            legendFormat: PR waiting pipeline
            refId: A
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_pipelineruns_waiting_on_task_resolution)
            legendFormat: PR waiting task
            refId: B
          - datasource: prometheus
            expr: max(tekton_pipelines_controller_running_taskruns_waiting_on_task_resolution_count)
            legendFormat: TR waiting task
            refId: C
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: orange, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []

Интерпретация Dashboard Tekton Pipeline (частые вопросы)

PipelineRun Total (by status) — это счетчик событий завершения, записываемый controller'ом, а не общее число объектов PipelineRun. В текущей реализации пользовательская отмена (spec.status=Cancelled) может не попасть в этот путь подсчета, поэтому серия cancelled может отсутствовать. Для проверки объема отмен смотрите объекты PipelineRun и events.
Running PipelineRuns — это снимок состояния в реальном времени (сколько сейчас запущено). Он может изменяться независимо от PipelineRun Total.
Completed PipelineRuns (last 5m) — это throughput (сколько run завершилось за последние 5 минут). Значение 0 при низкой нагрузке или в периоды простоя — это нормально.
PipelineRun Success Rate (cumulative) вычисляется накопительно с момента запуска controller, а не за 5-минутное окно. Кратковременная ошибка не вызывает немедленного резкого изменения.
Reconcile Count (by success) измеряет reconcile-циклы controller, а не количество PipelineRun.
Серии status отображаются только для тех значений label, для которых в выбранном диапазоне времени есть samples. Если в окне нет samples для какого-то status, соответствующая кривая/легенда не появится.
TaskRun Duration P50 / P95 (Standalone) и TaskRun Duration P50 / P95 (In-Pipeline) разделены, чтобы избежать нестабильности mixed-query. В средах, где доступен только один family histogram, вторая панель может быть пустой — это ожидаемо.

Dashboard Tekton Triggers

Tekton Triggers Dashboard YAML

kind: MonitorDashboard
apiVersion: ait.alauda.io/v1alpha2
metadata:
  labels:
    cpaas.io/dashboard.folder: tekton
    cpaas.io/dashboard.is.home.dashboard: "false"
    cpaas.io/dashboard.tag.tekton: "true"
  name: tekton-triggers
  namespace: cpaas-system
spec:
  body:
    titleZh: Tekton Triggers Overview
    tags:
      - tekton
    time:
      from: now-1h
      to: now
    templating:
      list: []
    panels:
      - id: 1
        title: EventListener Count
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 5, x: 0, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_eventlistener_count
            legendFormat: EventListener
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 2
        title: TriggerTemplate Count
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 5, x: 5, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_triggertemplate_count
            legendFormat: TriggerTemplate
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 3
        title: TriggerBinding Count
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 5, x: 10, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_triggerbinding_count
            legendFormat: TriggerBinding
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 4
        title: ClusterTriggerBinding
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 5, x: 15, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_clustertriggerbinding_count
            legendFormat: ClusterTriggerBinding
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 5
        title: ClusterInterceptor
        type: timeseries
        datasource: prometheus
        gridPos: { h: 6, w: 4, x: 20, y: 0 }
        targets:
          - datasource: prometheus
            expr: controller_clusterinterceptor_count
            legendFormat: ClusterInterceptor
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 6
        title: All Trigger Resource Counts (trend)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 24, x: 0, y: 6 }
        targets:
          - datasource: prometheus
            expr: controller_eventlistener_count
            legendFormat: EventListener
            refId: A
          - datasource: prometheus
            expr: controller_triggertemplate_count
            legendFormat: TriggerTemplate
            refId: B
          - datasource: prometheus
            expr: controller_triggerbinding_count
            legendFormat: TriggerBinding
            refId: C
          - datasource: prometheus
            expr: controller_clustertriggerbinding_count
            legendFormat: ClusterTriggerBinding
            refId: D
          - datasource: prometheus
            expr: controller_clusterinterceptor_count
            legendFormat: ClusterInterceptor
            refId: E
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []

Интерпретация Dashboard Tekton Triggers (частые вопросы)

EventListener Count, TriggerTemplate Count, TriggerBinding Count, ClusterTriggerBinding и ClusterInterceptor — это снимки количества объектов, а не объем запросов и не throughput обработки событий.
All Trigger Resource Counts (trend) показывает объединенный тренд тех же счетчиков ресурсов. Кратковременные отклонения по сравнению с панелями одного ресурса в пределах интервала сбора — это нормально.
Отображение 0, когда ресурсов Triggers нет, — это нормально и не указывает на сбой сбора.

Dashboard Tekton Results

Tekton Results Dashboard YAML

kind: MonitorDashboard
apiVersion: ait.alauda.io/v1alpha2
metadata:
  labels:
    cpaas.io/dashboard.folder: tekton
    cpaas.io/dashboard.is.home.dashboard: "false"
    cpaas.io/dashboard.tag.tekton: "true"
  name: tekton-results
  namespace: cpaas-system
spec:
  body:
    titleZh: Tekton Results Overview
    tags:
      - tekton
    time:
      from: now-1h
      to: now
    templating:
      list: []
    panels:
      - id: 1
        title: PipelineRun Reconcile Count (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 0 }
        targets:
          - datasource: prometheus
            expr: round(sum(increase(watcher_reconcile_count{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler",success="true"}[5m])))
            legendFormat: success=true
            refId: A
          - datasource: prometheus
            expr: round(sum(increase(watcher_reconcile_count{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler",success="false"}[5m])))
            legendFormat: success=false
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 2
        title: TaskRun Reconcile Count (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 0 }
        targets:
          - datasource: prometheus
            expr: round(sum(increase(watcher_reconcile_count{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler",success="true"}[5m])))
            legendFormat: success=true
            refId: A
          - datasource: prometheus
            expr: round(sum(increase(watcher_reconcile_count{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler",success="false"}[5m])))
            legendFormat: success=false
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 3
        title: PipelineRun Reconcile Latency P95
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 8 }
        targets:
          - datasource: prometheus
            expr: histogram_quantile(0.95, sum by (le) (rate(watcher_reconcile_latency_bucket{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler"}[5m])))
            legendFormat: P95
            refId: A
        fieldConfig:
          defaults:
            unit: ms
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 4
        title: TaskRun Reconcile Latency P95
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 8 }
        targets:
          - datasource: prometheus
            expr: histogram_quantile(0.95, sum by (le) (rate(watcher_reconcile_latency_bucket{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler"}[5m])))
            legendFormat: P95
            refId: A
        fieldConfig:
          defaults:
            unit: ms
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 5
        title: Workqueue Depth (PipelineRun vs TaskRun)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 16 }
        targets:
          - datasource: prometheus
            expr: sum(watcher_work_queue_depth{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler"})
            legendFormat: pipelinerun
            refId: A
          - datasource: prometheus
            expr: sum(watcher_work_queue_depth{reconciler="github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler"})
            legendFormat: taskrun
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 6
        title: Workqueue Adds (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 16 }
        targets:
          - datasource: prometheus
            expr: round(sum(increase(watcher_workqueue_adds_total{name=~"github.com.tektoncd.results.pkg.watcher.reconciler.pipelinerun.Reconciler-(consumer|fast|slow)"}[5m])))
            legendFormat: pipelinerun adds
            refId: A
          - datasource: prometheus
            expr: round(sum(increase(watcher_workqueue_adds_total{name=~"github.com.tektoncd.results.pkg.watcher.reconciler.taskrun.Reconciler-(consumer|fast|slow)"}[5m])))
            legendFormat: taskrun adds
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 7
        title: gRPC Request Rate (Results API)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 24 }
        targets:
          - datasource: prometheus
            expr: "sum(rate(grpc_server_handled_total{grpc_service=~\"tekton.results.*\"}[5m]))"
            legendFormat: requests
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 8
        title: gRPC Error Percentage (Results API, excl. NotFound/Canceled)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 24 }
        targets:
          - datasource: prometheus
            expr: "100 * ((sum(rate(grpc_server_handled_total{grpc_service=~\"tekton.results.*\",grpc_code!~\"OK|NotFound|Canceled\"}[5m])) or vector(0)) / clamp_min((sum(rate(grpc_server_handled_total{grpc_service=~\"tekton.results.*\"}[5m])) or vector(0)), 0.001))"
            legendFormat: error %
            refId: A
        fieldConfig:
          defaults:
            unit: percent
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: red, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []

Интерпретация Dashboard Tekton Results (частые вопросы)

Эта версия dashboard основана на метриках reconcile/workqueue Results Watcher и gRPC-метриках Results API, поэтому она остается заполненной в типичных развертываниях (logs_api=true, автоматическое удаление отключено).
PipelineRun Reconcile Count (last 5m) и TaskRun Reconcile Count (last 5m) показывают отдельные приращения за 5 минут для success=true и success=false.
PipelineRun Reconcile Latency P95 и TaskRun Reconcile Latency P95 вычисляются из histogram reconcile latency Watcher. При низком трафике линия может быть разреженной.
Workqueue Depth показывает текущую глубину очереди, а Workqueue Adds (last 5m) — объем enqueue за последние 5 минут.
gRPC Error Percentage (Results API, excl. NotFound/Canceled) — это процент аномальных ошибок от общего числа запросов без учета стандартных business return codes (NotFound, Canceled).

Dashboard Tekton Chains

Tekton Chains Dashboard YAML

kind: MonitorDashboard
apiVersion: ait.alauda.io/v1alpha2
metadata:
  labels:
    cpaas.io/dashboard.folder: tekton
    cpaas.io/dashboard.is.home.dashboard: "false"
    cpaas.io/dashboard.tag.tekton: "true"
  name: tekton-chains
  namespace: cpaas-system
spec:
  body:
    titleZh: Tekton Chains Overview
    tags:
      - tekton
    time:
      from: now-1h
      to: now
    templating:
      list: []
    panels:
      - id: 1
        title: TaskRun Signatures Created (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 0 }
        targets:
          - datasource: prometheus
            expr: round(increase(watcher_taskrun_sign_created_total[5m]))
            legendFormat: sign created
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 2
        title: PipelineRun Signatures Created (last 5m)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 0 }
        targets:
          - datasource: prometheus
            expr: round(increase(watcher_pipelinerun_sign_created_total[5m]))
            legendFormat: sign created
            refId: A
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 3
        title: Payloads Stored (last 5m, TaskRun vs PipelineRun)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 0, y: 8 }
        targets:
          - datasource: prometheus
            expr: round(increase(watcher_taskrun_payload_stored_total[5m]))
            legendFormat: TaskRun
            refId: A
          - datasource: prometheus
            expr: round(increase(watcher_pipelinerun_payload_stored_total[5m]))
            legendFormat: PipelineRun
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []
      - id: 4
        title: Marked Signed (last 5m, TaskRun vs PipelineRun)
        type: timeseries
        datasource: prometheus
        gridPos: { h: 8, w: 12, x: 12, y: 8 }
        targets:
          - datasource: prometheus
            expr: round(increase(watcher_taskrun_marked_signed_total[5m]))
            legendFormat: TaskRun
            refId: A
          - datasource: prometheus
            expr: round(increase(watcher_pipelinerun_marked_signed_total[5m]))
            legendFormat: PipelineRun
            refId: B
        fieldConfig:
          defaults:
            color: { mode: palette-classic }
            custom: { drawStyle: line, fillOpacity: 0, lineWidth: 1, spanNulls: false }
            thresholds: { mode: absolute, steps: [{ color: green, value: null }] }
          overrides: []
        options:
          legend: { calcs: [latest], displayMode: list, placement: bottom, showLegend: true }
          tooltip: { mode: multi, sort: desc }
        transformations: []

Интерпретация Dashboard Tekton Chains (частые вопросы)

TaskRun Signatures Created (last 5m), PipelineRun Signatures Created (last 5m), Payloads Stored (last 5m) и Marked Signed (last 5m) используют increase(...[5m]) и показывают приращения за последние пять минут.
При отсутствии новой активности по подписи или хранению эти линии падают до 0; это не означает сбой компонента.
Payloads Stored и Marked Signed отражают разные этапы обработки, поэтому их значения не обязаны всегда совпадать.

#Сбор метрик для компонентов Tekton

#Содержание

#Обзор

#Предварительные требования

#Tekton Pipelines

#Метрики PipelineRun

#Уровни label для running_pipelineruns

#Значения label status

#Метрики TaskRun

#Конфигурация config-observability

#Bucket'ы histogram

#Рекомендуемая production-конфигурация

#Tekton Triggers

#Метрики контроллера (порт 9000)

#Метрики EventListener Sink

#Tekton Results

#Метрики Watcher

#Метрики удаления

#Общие метрики

#config-observability Watcher

#Метрики API Server

#Tekton Chains

#Метрики Chains

#Метрики Controller Framework

#Настройка ServiceMonitor

#Pipeline ServiceMonitor

#Triggers ServiceMonitor

#EventListener Sink ServiceMonitor

#Results ServiceMonitor

#Chains ServiceMonitor

#Проверка

#Проверка endpoints метрик напрямую

#Проверка Prometheus targets

#Примеры PromQL-запросов

#Примеры MonitorDashboard

#Dashboard Tekton Pipeline

#Интерпретация Dashboard Tekton Pipeline (частые вопросы)

#Dashboard Tekton Triggers

#Интерпретация Dashboard Tekton Triggers (частые вопросы)

#Dashboard Tekton Results

#Интерпретация Dashboard Tekton Results (частые вопросы)

#Dashboard Tekton Chains

#Интерпретация Dashboard Tekton Chains (частые вопросы)