GreenThreadDocs

GreenThread integrates with the kube-prometheus-stack for metrics, dashboards, and alerting. When monitoring is enabled, GreenThread automatically deploys ServiceMonitors, PodMonitors, and Grafana dashboards.

Install order

The monitoring stack must be installed before GreenThread (or before running helm upgrade with monitoring.enabled=true) so that GreenThread can detect the Prometheus Operator CRDs at install time.

Install kube-prometheus-stack

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prom prometheus-community/kube-prometheus-stack \
  --version 82.5.0 \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
Cross-namespace scraping

The two NilUsesHelmValues=false flags are critical. Without them, Prometheus only scrapes ServiceMonitors and PodMonitors that carry the Helm release label, which means it will not pick up GreenThread's monitors from the greenthread-system namespace.

Verify the stack

kubectl get pods -n monitoring

All pods should reach Running within a few minutes. The stack includes Prometheus, Alertmanager, Grafana, kube-state-metrics, and node-exporter.

Enable monitoring in GreenThread

If you haven't already enabled monitoring during the initial install, upgrade your GreenThread release:

helm upgrade greenthread \
  oci://licence.greenthread.ai/greenthread/charts/greenthread \
  --namespace greenthread-system \
  --reuse-values \
  --set monitoring.enabled=true

This deploys the following monitoring resources:

ResourceTypeDescription
greenthread-sidecarPodMonitorScrapes sidecar metrics from every model pod
greenthread-controllerServiceMonitorScrapes controller metrics
greenthread-apiserverServiceMonitorScrapes API server metrics
greenthread-dcgmPodMonitorScrapes DCGM GPU metrics
greenthread-vllmPodMonitorScrapes vLLM inference engine metrics
GreenThread GPU DashboardConfigMap (Grafana)GPU utilization, memory, temperature
GreenThread Models DashboardConfigMap (Grafana)Model lifecycle, wake/sleep times, request rates
GreenThread System DashboardConfigMap (Grafana)Controller reconciliation, queue depth, errors
GreenThread vLLM DashboardConfigMap (Grafana)vLLM inference latency, throughput, KV cache

Access Grafana

kubectl port-forward svc/kube-prom-grafana 3000:80 -n monitoring

Open http://localhost:3000.

Retrieve the admin password:

# Username: admin
kubectl get secret -n monitoring kube-prom-grafana \
  -o jsonpath='{.data.admin-password}' | base64 -d; echo

The GreenThread dashboards are automatically provisioned and appear under the GreenThread folder in Grafana.

Sidecar metrics

The sidecar exposes the following Prometheus metrics on each model pod:

MetricTypeLabelsDescription
gthread_sidecar_stateGaugestateCurrent sidecar state (sleeping, pending, waking, serving, deactivating). Exactly one label is 1 at any time.
gthread_sidecar_queue_depthGaugeCurrent request queue depth
gthread_sidecar_in_flight_requestsGaugeNumber of in-flight requests being processed
gthread_sidecar_wake_duration_secondsHistogramTime to wake a model from sleeping state
gthread_sidecar_sleep_duration_secondsHistogramTime to sleep a model (drain + checkpoint)
gthread_sidecar_requests_totalCounterstatusTotal requests processed (success/error)
gthread_sidecar_gpu_memory_reserved_bytesGaugegpuReserved GPU memory in bytes per GPU index
gthread_sidecar_gpu_memory_serving_bytesGaugegpuMeasured serving GPU memory in bytes per GPU index
gthread_sidecar_preemptions_totalCounterrolePreemption events (preempted/preempting)
gthread_sidecar_preemption_barrier_timeouts_totalCounterDrain timeouts during preemption barrier
gthread_sidecar_wake_deduplications_totalCounterTimes a wake request was coalesced with an in-flight wake
gthread_sidecar_gpu_cas_conflicts_totalCounterOptimistic CAS conflicts on GPU CRDs

Example Prometheus queries

Wake latency (p99)

histogram_quantile(0.99, rate(gthread_sidecar_wake_duration_seconds_bucket[5m]))

Request rate per model

sum by (pod) (rate(gthread_sidecar_requests_total{status="success"}[5m]))

Models currently serving

count(gthread_sidecar_state{state="serving"} == 1)

GPU memory utilization

gthread_sidecar_gpu_memory_serving_bytes / on(gpu) gpu_memory_total_bytes

Queue depth across all models

sum(gthread_sidecar_queue_depth)

Alerting

You can create PrometheusRule resources to alert on GreenThread conditions. Example alerts:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: greenthread-alerts
  namespace: monitoring
spec:
  groups:
    - name: greenthread
      rules:
        - alert: ModelWakeLatencyHigh
          expr: histogram_quantile(0.99, rate(gthread_sidecar_wake_duration_seconds_bucket[5m])) > 10
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Model wake latency p99 exceeds 10 seconds"

        - alert: ModelStuckWaking
          expr: gthread_sidecar_state{state="waking"} == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Model stuck in waking state for over 5 minutes"

        - alert: HighPreemptionRate
          expr: rate(gthread_sidecar_preemptions_total[5m]) > 0.1
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "High preemption rate — consider adding GPU capacity"

Next steps