GreenThread exposes Prometheus metrics from every model pod's sidecar. When monitoring is enabled, these metrics are automatically scraped by the kube-prometheus-stack and visualized in pre-built Grafana dashboards.
For installation of the kube-prometheus-stack, Grafana access, and alerting configuration, see Monitoring.
Sidecar metrics
Each model pod's sidecar exposes the following Prometheus metrics:
State and lifecycle
| Metric | Type | Labels | Description |
|---|---|---|---|
gthread_sidecar_state | Gauge | state | Current sidecar state. Exactly one label value is 1 at any time: sleeping, pending, waking, serving, deactivating. |
gthread_sidecar_wake_duration_seconds | Histogram | — | Time to wake a model from sleeping to serving |
gthread_sidecar_sleep_duration_seconds | Histogram | — | Time to sleep a model (drain + GPU release) |
Request handling
| Metric | Type | Labels | Description |
|---|---|---|---|
gthread_sidecar_requests_total | Counter | status | Total requests processed. Labels: success, error. |
gthread_sidecar_queue_depth | Gauge | — | Current request queue depth |
gthread_sidecar_in_flight_requests | Gauge | — | Number of requests currently being processed |
GPU memory
| Metric | Type | Labels | Description |
|---|---|---|---|
gthread_sidecar_gpu_memory_reserved_bytes | Gauge | gpu | Reserved GPU memory in bytes per GPU index |
gthread_sidecar_gpu_memory_serving_bytes | Gauge | gpu | Measured serving GPU memory in bytes per GPU index |
Preemption and scheduling
| Metric | Type | Labels | Description |
|---|---|---|---|
gthread_sidecar_preemptions_total | Counter | role | Preemption events. Labels: preempted (this model was evicted), preemptor (this model evicted another). |
gthread_sidecar_preemption_barrier_timeouts_total | Counter | — | Drain timeouts during preemption barrier |
gthread_sidecar_wake_deduplications_total | Counter | — | Times a wake request was coalesced with an in-flight wake |
gthread_sidecar_gpu_cas_conflicts_total | Counter | — | Optimistic CAS conflicts on GPU CRD updates |
Scrape configuration
When monitoring.enabled=true in the Helm chart, GreenThread automatically deploys:
| Resource | Type | Description |
|---|---|---|
greenthread-sidecar | PodMonitor | Scrapes sidecar metrics from every model pod |
greenthread-controller | ServiceMonitor | Scrapes controller metrics |
greenthread-apiserver | ServiceMonitor | Scrapes API server metrics |
greenthread-dcgm | PodMonitor | Scrapes DCGM GPU metrics |
greenthread-vllm | PodMonitor | Scrapes vLLM inference engine metrics |
No manual Prometheus configuration is needed — the monitors are picked up automatically by the kube-prometheus-stack.
The kube-prometheus-stack must be installed with serviceMonitorSelectorNilUsesHelmValues=false and podMonitorSelectorNilUsesHelmValues=false to scrape monitors from the greenthread-system namespace. See Monitoring for installation details.
Grafana dashboards
GreenThread deploys four dashboards as ConfigMaps that are automatically provisioned into Grafana:
| Dashboard | Description |
|---|---|
| GreenThread GPU Dashboard | GPU utilization, memory, temperature across all nodes |
| GreenThread Models Dashboard | Model lifecycle, wake/sleep times, request rates per model |
| GreenThread System Dashboard | Controller reconciliation, queue depth, errors |
| GreenThread vLLM Dashboard | vLLM inference latency, throughput, KV cache utilization |
Dashboards appear under the GreenThread folder in Grafana.
Example queries
Wake latency (p99)
histogram_quantile(0.99, rate(gthread_sidecar_wake_duration_seconds_bucket[5m]))
Request rate per model
sum by (pod) (rate(gthread_sidecar_requests_total{status="success"}[5m]))
Models currently serving
count(gthread_sidecar_state{state="serving"} == 1)
Queue depth across all models
sum(gthread_sidecar_queue_depth)
Preemption rate
sum(rate(gthread_sidecar_preemptions_total[5m])) by (role)
GPU memory utilization per model
gthread_sidecar_gpu_memory_serving_bytes
Sleep/wake frequency
# Wakes per minute
sum(rate(gthread_sidecar_wake_duration_seconds_count[5m])) * 60
# Sleeps per minute
sum(rate(gthread_sidecar_sleep_duration_seconds_count[5m])) * 60
vLLM metrics
In addition to sidecar metrics, the vLLM inference engine exposes its own metrics (scraped by the greenthread-vllm PodMonitor). Key vLLM metrics include:
| Metric | Type | Description |
|---|---|---|
vllm:request_success_total | Counter | Successful inference requests |
vllm:request_duration_seconds | Histogram | End-to-end request duration |
vllm:time_to_first_token_seconds | Histogram | Time to first token (TTFT) |
vllm:time_per_output_token_seconds | Histogram | Inter-token latency |
vllm:num_requests_running | Gauge | Currently running requests |
vllm:num_requests_waiting | Gauge | Requests waiting in queue |
vllm:gpu_cache_usage_perc | Gauge | KV cache utilization percentage |
vllm:prompt_tokens_total | Counter | Total prompt tokens processed |
vllm:generation_tokens_total | Counter | Total completion tokens generated |
DCGM metrics
GPU hardware metrics are exposed via DCGM (Data Center GPU Manager), scraped by the greenthread-dcgm PodMonitor:
| Metric | Description |
|---|---|
DCGM_FI_DEV_GPU_UTIL | GPU compute utilization (%) |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory copy utilization (%) |
DCGM_FI_DEV_FB_USED | Framebuffer memory used (MB) |
DCGM_FI_DEV_FB_FREE | Framebuffer memory free (MB) |
DCGM_FI_DEV_GPU_TEMP | GPU temperature (C) |
DCGM_FI_DEV_POWER_USAGE | Power draw (W) |
GPU CRD as live state
Beyond Prometheus metrics, the GPU custom resources provide a live view of the scheduling state:
kubectl get gpu -A
Each GPU CRD's status includes:
| Field | Description |
|---|---|
stats.usedMemoryBytes | Actual measured GPU memory in use (NVML) |
stats.freeMemoryBytes | Free GPU memory (NVML) |
stats.utilizationPercent | GPU compute utilization |
stats.temperatureCelsius | GPU temperature |
stats.powerWatts | Current power draw |
occupants | List of models loaded on this GPU with their state and reserved memory |
This provides real-time scheduling visibility without needing Prometheus:
# Quick overview of GPU occupants
kubectl get gpu -A -o jsonpath='{range .items[*]}{.metadata.name}: {range .status.occupants[*]}{.modelName}({.state}) {end}{"\n"}{end}'