GreenThread integrates with the kube-prometheus-stack for metrics, dashboards, and alerting. When monitoring is enabled, GreenThread automatically deploys ServiceMonitors, PodMonitors, and Grafana dashboards.
The monitoring stack must be installed before GreenThread (or before running helm upgrade with monitoring.enabled=true) so that GreenThread can detect the Prometheus Operator CRDs at install time.
Install kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prom prometheus-community/kube-prometheus-stack \
--version 82.5.0 \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
The two NilUsesHelmValues=false flags are critical. Without them, Prometheus only scrapes ServiceMonitors and PodMonitors that carry the Helm release label, which means it will not pick up GreenThread's monitors from the greenthread-system namespace.
Verify the stack
kubectl get pods -n monitoring
All pods should reach Running within a few minutes. The stack includes Prometheus, Alertmanager, Grafana, kube-state-metrics, and node-exporter.
Enable monitoring in GreenThread
If you haven't already enabled monitoring during the initial install, upgrade your GreenThread release:
helm upgrade greenthread \
oci://licence.greenthread.ai/greenthread/charts/greenthread \
--namespace greenthread-system \
--reuse-values \
--set monitoring.enabled=true
This deploys the following monitoring resources:
| Resource | Type | Description |
|---|---|---|
greenthread-sidecar | PodMonitor | Scrapes sidecar metrics from every model pod |
greenthread-controller | ServiceMonitor | Scrapes controller metrics |
greenthread-apiserver | ServiceMonitor | Scrapes API server metrics |
greenthread-dcgm | PodMonitor | Scrapes DCGM GPU metrics |
greenthread-vllm | PodMonitor | Scrapes vLLM inference engine metrics |
| GreenThread GPU Dashboard | ConfigMap (Grafana) | GPU utilization, memory, temperature |
| GreenThread Models Dashboard | ConfigMap (Grafana) | Model lifecycle, wake/sleep times, request rates |
| GreenThread System Dashboard | ConfigMap (Grafana) | Controller reconciliation, queue depth, errors |
| GreenThread vLLM Dashboard | ConfigMap (Grafana) | vLLM inference latency, throughput, KV cache |
Access Grafana
kubectl port-forward svc/kube-prom-grafana 3000:80 -n monitoring
Open http://localhost:3000.
Retrieve the admin password:
# Username: admin
kubectl get secret -n monitoring kube-prom-grafana \
-o jsonpath='{.data.admin-password}' | base64 -d; echo
The GreenThread dashboards are automatically provisioned and appear under the GreenThread folder in Grafana.
Sidecar metrics
The sidecar exposes the following Prometheus metrics on each model pod:
| Metric | Type | Labels | Description |
|---|---|---|---|
gthread_sidecar_state | Gauge | state | Current sidecar state (sleeping, pending, waking, serving, deactivating). Exactly one label is 1 at any time. |
gthread_sidecar_queue_depth | Gauge | — | Current request queue depth |
gthread_sidecar_in_flight_requests | Gauge | — | Number of in-flight requests being processed |
gthread_sidecar_wake_duration_seconds | Histogram | — | Time to wake a model from sleeping state |
gthread_sidecar_sleep_duration_seconds | Histogram | — | Time to sleep a model (drain + checkpoint) |
gthread_sidecar_requests_total | Counter | status | Total requests processed (success/error) |
gthread_sidecar_gpu_memory_reserved_bytes | Gauge | gpu | Reserved GPU memory in bytes per GPU index |
gthread_sidecar_gpu_memory_serving_bytes | Gauge | gpu | Measured serving GPU memory in bytes per GPU index |
gthread_sidecar_preemptions_total | Counter | role | Preemption events (preempted/preempting) |
gthread_sidecar_preemption_barrier_timeouts_total | Counter | — | Drain timeouts during preemption barrier |
gthread_sidecar_wake_deduplications_total | Counter | — | Times a wake request was coalesced with an in-flight wake |
gthread_sidecar_gpu_cas_conflicts_total | Counter | — | Optimistic CAS conflicts on GPU CRDs |
Example Prometheus queries
Wake latency (p99)
histogram_quantile(0.99, rate(gthread_sidecar_wake_duration_seconds_bucket[5m]))
Request rate per model
sum by (pod) (rate(gthread_sidecar_requests_total{status="success"}[5m]))
Models currently serving
count(gthread_sidecar_state{state="serving"} == 1)
GPU memory utilization
gthread_sidecar_gpu_memory_serving_bytes / on(gpu) gpu_memory_total_bytes
Queue depth across all models
sum(gthread_sidecar_queue_depth)
Alerting
You can create PrometheusRule resources to alert on GreenThread conditions. Example alerts:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: greenthread-alerts
namespace: monitoring
spec:
groups:
- name: greenthread
rules:
- alert: ModelWakeLatencyHigh
expr: histogram_quantile(0.99, rate(gthread_sidecar_wake_duration_seconds_bucket[5m])) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Model wake latency p99 exceeds 10 seconds"
- alert: ModelStuckWaking
expr: gthread_sidecar_state{state="waking"} == 1
for: 5m
labels:
severity: critical
annotations:
summary: "Model stuck in waking state for over 5 minutes"
- alert: HighPreemptionRate
expr: rate(gthread_sidecar_preemptions_total[5m]) > 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "High preemption rate — consider adding GPU capacity"
Next steps
- Metrics & Usage — JSON metrics APIs and usage tracking
- Model States & Lifecycle — Understanding model state transitions
- Fairness Policy — GPU scheduling and preemption
