Prometheus Metrics

Prometheus metrics exported by Kueue

Kueue exposes prometheus metrics to monitor the health of the system and the status of ClusterQueues and LocalQueues.

Kueue health

Use the following metrics to monitor the health of the kueue controllers:

Metric name Type Description Labels
kueue_admission_attempts_total Counter The total number of attempts to admit workloads. Each admission attempt might try to admit more than one workload. result: possible values are success or inadmissible
kueue_admission_attempt_duration_seconds Histogram The latency of an admission attempt. result: possible values are success or inadmissible

ClusterQueue status

Use the following metrics to monitor the status of your ClusterQueues:

Metric name Type Description Labels
kueue_pending_workloads Gauge The number of pending workloads. cluster_queue: the name of the ClusterQueue
status: possible values are active or inadmissible
kueue_quota_reserved_workloads_total Counter The total number of quota reserved workloads. cluster_queue: the name of the ClusterQueue
kueue_quota_reserved_wait_time_seconds Histogram The time between a workload was created or requeued until it got quota reservation. cluster_queue: the name of the ClusterQueue
kueue_admitted_workloads_total Counter The total number of admitted workloads. cluster_queue: the name of the ClusterQueue
kueue_evicted_workloads_total Counter The total number of evicted workloads. cluster_queue: the name of the ClusterQueue
reason: Possible values are Preempted, PodsReadyTimeout, AdmissionCheck, ClusterQueueStopped or Deactivated
kueue_admission_wait_time_seconds Histogram The time between a workload was created or requeued until admission. cluster_queue: the name of the ClusterQueue
kueue_admission_checks_wait_time_seconds Histogram The time from when a workload got the quota reservation until admission. cluster_queue: the name of the ClusterQueue
kueue_admitted_active_workloads Gauge The number of admitted Workloads that are active (unsuspended and not finished) cluster_queue: the name of the ClusterQueue
kueue_cluster_queue_status Gauge Reports the status of the ClusterQueue cluster_queue: The name of the ClusterQueue
status: Possible values are pending, active or terminated. For a ClusterQueue, the metric only reports a value of 1 for one of the statuses.
kueue_reserving_active_workloads Gauge The number of Workloads that are reserving quota, per cluster_queue. cluster_queue: the name of the ClusterQueue
kueue_admission_cycle_preemption_skips Gauge The number of Workloads in the ClusterQueue that got preemption candidates but had to be skipped because other ClusterQueues needed the same resources in the same cycle cluster_queue: the name of the ClusterQueue
kueue_preempted_workloads_total Counter The number of preempted workloads per preempting_cluster_queue preempting_cluster_queue: the name of the ClusterQueue
reason: possible values are InClusterQueue means that the workload was preempted by a workload in the same ClusterQueue; InCohortReclamation means that the workload was preempted by a workload in the same cohort due to reclamation of nominal quota; InCohortFairSharing means that the workload was preempted by a workload in the same cohort due to fair sharing; InCohortReclaimWhileBorrowing means that the workload was preempted by a workload in the same cohort due to reclamation of nominal quota while borrowing

LocalQueue Status (alpha)

The following metrics are available only if LocalQueueMetrics feature gate is enabled. Check the Change the feature gates configuration section of the Installation for details.

Metric Name Type Description Labels
kueue_local_queue_pending_workloads Gauge The number of pending workloads, per local_queue and status. name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
status: can be either active for the number of active pending workloads or inadmissible
kueue_local_queue_quota_reserved_workloads_total Counter The number of workloads with quota reserved in a LocalQueue name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
kueue_local_queue_quota_reserved_wait_time_seconds Histogram The time between a workload was created or requeued until it got quota reservation, per local_queue name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
kueue_local_queue_admitted_workloads_total Counter The total number of admitted workloads per local_queue name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
kueue_local_queue_admission_checks_wait_time_seconds Histogram The time from when a workload got the quota reservation until admission, per local_queue name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
kueue_local_queue_admission_wait_time_seconds Histogram The time between a workload was created or requeued until admission, per local_queue name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
kueue_local_queue_evicted_workloads_total Counter The number of evicted workloads per local_queue name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
reason: the reason the workload was pre-empted. It can have the following values [“Preempted”, “PodsReadyTimeout”, “AdmissionCheck”, “ClusterQueueStopped”, “Deactivated”]
kueue_local_queue_reserving_active_workloads Gauge The number of Workloads that are reserving quota, per localQueue name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
kueue_local_queue_admitted_active_workloads Gauge The number of admitted Workloads that are active (unsuspended and not finished), per localQueue name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
kueue_local_queue_status Gauge Reports a LocalQueue’s active status (ability to schedule workloads) name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
active: one of [True, False, Unknown] and exclusively one is positive at any given time
kueue_local_queue_resource_reservation Gauge Reports the LocalQueue’s total resource usage within all theflavors name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
flavor: the name of the flavor which resources are being consumed from
resource: the resource which is being consumed
kueue_local_queue_resource_usage Gauge Reports the localQueue’s total resource reservation within all the flavors name: the name of the LocalQueue
namespace: the namespace that the LocalQueue resides in
flavor: the name of the flavor which resources are being consumed from
resource: the resource which is being consumed

Cohort Status

Metric name Type Description Labels
kueue_cohort_weighted_share Gauge Reports a value that representing the maximum of the ratios of usage above nominal quota to the lendable resources in the Cohort, among all the resources provided by the Cohort, and divided by the weight. If zero, it means that the usage of the Cohort is below the nominal quota. If the Cohort has a weight of zero, this will return 9223372036854775807, the maximum possible share value. cohort: The name of the Cohort

Optional metrics

The following metrics are available only if metrics.enableClusterQueueResources is enabled in the manager’s configuration.

Metric name Type Description Labels
kueue_cluster_queue_resource_reservation Gauge Reports the cluster_queue’s total resource reservation within all the flavors cohort: The cohort in which the queue belongs
cluster_queue: The name of the ClusterQueue
flavor: referenced flavor
resource: The resource name
kueue_cluster_queue_resource_usage Gauge Reports the ClusterQueue’s total resource usage cohort: The cohort in which the queue belongs
cluster_queue: The name of the ClusterQueue
flavor: referenced flavor
resource: The resource name
kueue_cluster_queue_nominal_quota Gauge Reports the ClusterQueue’s resource quota cohort: The cohort in which the queue belongs
cluster_queue: The name of the ClusterQueue
flavor: referenced flavor
resource: The resource name
kueue_cluster_queue_borrowing_limit Gauge Reports the ClusterQueue’s resource borrowing limit cohort: The cohort in which the queue belongs
cluster_queue: The name of the ClusterQueue
flavor: referenced flavor
resource: The resource name
kueue_cluster_queue_lending_limit Gauge Reports the cluster_queue’s resource lending limit within all the flavors cohort: The cohort in which the queue belongs
cluster_queue: The name of the ClusterQueue
flavor: referenced flavor
resource: The resource name
kueue_cluster_queue_weighted_share Gauge Reports a value that representing the maximum of the ratios of usage above nominal quota to the lendable resources in the cohort, among all the resources provided by the ClusterQueue. cluster_queue: The name of the ClusterQueue