GreenThread Docs

GreenThread exposes Prometheus metrics from every model pod's sidecar. When monitoring is enabled, these metrics are automatically scraped by the kube-prometheus-stack and visualized in pre-built Grafana dashboards.

Full monitoring setup

For installation of the kube-prometheus-stack, Grafana access, and alerting configuration, see Monitoring.

Sidecar metrics

Each model pod's sidecar exposes the following Prometheus metrics:

State and lifecycle

Metric	Type	Labels	Description
`gthread_sidecar_state`	Gauge	`state`	Current sidecar state. Exactly one label value is `1` at any time: `sleeping`, `pending`, `waking`, `serving`, `deactivating`.
`gthread_sidecar_wake_duration_seconds`	Histogram	—	Time to wake a model from sleeping to serving
`gthread_sidecar_sleep_duration_seconds`	Histogram	—	Time to sleep a model (drain + GPU release)

Request handling

Metric	Type	Labels	Description
`gthread_sidecar_requests_total`	Counter	`status`	Total requests processed. Labels: `success`, `error`.
`gthread_sidecar_queue_depth`	Gauge	—	Current request queue depth
`gthread_sidecar_in_flight_requests`	Gauge	—	Number of requests currently being processed

GPU memory

Metric	Type	Labels	Description
`gthread_sidecar_gpu_memory_reserved_bytes`	Gauge	`gpu`	Reserved GPU memory in bytes per GPU index
`gthread_sidecar_gpu_memory_serving_bytes`	Gauge	`gpu`	Measured serving GPU memory in bytes per GPU index

Preemption and scheduling

Metric	Type	Labels	Description
`gthread_sidecar_preemptions_total`	Counter	`role`	Preemption events. Labels: `preempted` (this model was evicted), `preemptor` (this model evicted another).
`gthread_sidecar_preemption_barrier_timeouts_total`	Counter	—	Drain timeouts during preemption barrier
`gthread_sidecar_wake_deduplications_total`	Counter	—	Times a wake request was coalesced with an in-flight wake
`gthread_sidecar_gpu_cas_conflicts_total`	Counter	—	Optimistic CAS conflicts on GPU CRD updates

Scrape configuration

When monitoring.enabled=true in the Helm chart, GreenThread automatically deploys:

Resource	Type	Description
`greenthread-sidecar`	PodMonitor	Scrapes sidecar metrics from every model pod
`greenthread-controller`	ServiceMonitor	Scrapes controller metrics
`greenthread-apiserver`	ServiceMonitor	Scrapes API server metrics
`greenthread-dcgm`	PodMonitor	Scrapes DCGM GPU metrics
`greenthread-vllm`	PodMonitor	Scrapes vLLM inference engine metrics

No manual Prometheus configuration is needed — the monitors are picked up automatically by the kube-prometheus-stack.

Cross-namespace scraping

The kube-prometheus-stack must be installed with serviceMonitorSelectorNilUsesHelmValues=false and podMonitorSelectorNilUsesHelmValues=false to scrape monitors from the greenthread-system namespace. See Monitoring for installation details.

Grafana dashboards

GreenThread deploys four dashboards as ConfigMaps that are automatically provisioned into Grafana:

Dashboard	Description
GreenThread GPU Dashboard	GPU utilization, memory, temperature across all nodes
GreenThread Models Dashboard	Model lifecycle, wake/sleep times, request rates per model
GreenThread System Dashboard	Controller reconciliation, queue depth, errors
GreenThread vLLM Dashboard	vLLM inference latency, throughput, KV cache utilization

Dashboards appear under the GreenThread folder in Grafana.

Example queries

Wake latency (p99)

histogram_quantile(0.99, rate(gthread_sidecar_wake_duration_seconds_bucket[5m]))

Request rate per model

sum by (pod) (rate(gthread_sidecar_requests_total{status="success"}[5m]))

Models currently serving

count(gthread_sidecar_state{state="serving"} == 1)

Queue depth across all models

sum(gthread_sidecar_queue_depth)

Preemption rate

sum(rate(gthread_sidecar_preemptions_total[5m])) by (role)

GPU memory utilization per model

gthread_sidecar_gpu_memory_serving_bytes

Sleep/wake frequency

# Wakes per minute
sum(rate(gthread_sidecar_wake_duration_seconds_count[5m])) * 60

# Sleeps per minute
sum(rate(gthread_sidecar_sleep_duration_seconds_count[5m])) * 60

vLLM metrics

In addition to sidecar metrics, the vLLM inference engine exposes its own metrics (scraped by the greenthread-vllm PodMonitor). Key vLLM metrics include:

Metric	Type	Description
`vllm:request_success_total`	Counter	Successful inference requests
`vllm:request_duration_seconds`	Histogram	End-to-end request duration
`vllm:time_to_first_token_seconds`	Histogram	Time to first token (TTFT)
`vllm:time_per_output_token_seconds`	Histogram	Inter-token latency
`vllm:num_requests_running`	Gauge	Currently running requests
`vllm:num_requests_waiting`	Gauge	Requests waiting in queue
`vllm:gpu_cache_usage_perc`	Gauge	KV cache utilization percentage
`vllm:prompt_tokens_total`	Counter	Total prompt tokens processed
`vllm:generation_tokens_total`	Counter	Total completion tokens generated

DCGM metrics

GPU hardware metrics are exposed via DCGM (Data Center GPU Manager), scraped by the greenthread-dcgm PodMonitor:

Metric	Description
`DCGM_FI_DEV_GPU_UTIL`	GPU compute utilization (%)
`DCGM_FI_DEV_MEM_COPY_UTIL`	Memory copy utilization (%)
`DCGM_FI_DEV_FB_USED`	Framebuffer memory used (MB)
`DCGM_FI_DEV_FB_FREE`	Framebuffer memory free (MB)
`DCGM_FI_DEV_GPU_TEMP`	GPU temperature (C)
`DCGM_FI_DEV_POWER_USAGE`	Power draw (W)

GPU CRD as live state

Beyond Prometheus metrics, the GPU custom resources provide a live view of the scheduling state:

kubectl get gpu -A

Each GPU CRD's status includes:

Field	Description
`stats.usedMemoryBytes`	Actual measured GPU memory in use (NVML)
`stats.freeMemoryBytes`	Free GPU memory (NVML)
`stats.utilizationPercent`	GPU compute utilization
`stats.temperatureCelsius`	GPU temperature
`stats.powerWatts`	Current power draw
`occupants`	List of models loaded on this GPU with their state and reserved memory

This provides real-time scheduling visibility without needing Prometheus:

# Quick overview of GPU occupants
kubectl get gpu -A -o jsonpath='{range .items[*]}{.metadata.name}: {range .status.occupants[*]}{.modelName}({.state}) {end}{"\n"}{end}'