GreenThreadDocs

GreenThread exposes Prometheus metrics from every model pod's sidecar. When monitoring is enabled, these metrics are automatically scraped by the kube-prometheus-stack and visualized in pre-built Grafana dashboards.

Full monitoring setup

For installation of the kube-prometheus-stack, Grafana access, and alerting configuration, see Monitoring.

Sidecar metrics

Each model pod's sidecar exposes the following Prometheus metrics:

State and lifecycle

MetricTypeLabelsDescription
gthread_sidecar_stateGaugestateCurrent sidecar state. Exactly one label value is 1 at any time: sleeping, pending, waking, serving, deactivating.
gthread_sidecar_wake_duration_secondsHistogramTime to wake a model from sleeping to serving
gthread_sidecar_sleep_duration_secondsHistogramTime to sleep a model (drain + GPU release)

Request handling

MetricTypeLabelsDescription
gthread_sidecar_requests_totalCounterstatusTotal requests processed. Labels: success, error.
gthread_sidecar_queue_depthGaugeCurrent request queue depth
gthread_sidecar_in_flight_requestsGaugeNumber of requests currently being processed

GPU memory

MetricTypeLabelsDescription
gthread_sidecar_gpu_memory_reserved_bytesGaugegpuReserved GPU memory in bytes per GPU index
gthread_sidecar_gpu_memory_serving_bytesGaugegpuMeasured serving GPU memory in bytes per GPU index

Preemption and scheduling

MetricTypeLabelsDescription
gthread_sidecar_preemptions_totalCounterrolePreemption events. Labels: preempted (this model was evicted), preemptor (this model evicted another).
gthread_sidecar_preemption_barrier_timeouts_totalCounterDrain timeouts during preemption barrier
gthread_sidecar_wake_deduplications_totalCounterTimes a wake request was coalesced with an in-flight wake
gthread_sidecar_gpu_cas_conflicts_totalCounterOptimistic CAS conflicts on GPU CRD updates

Scrape configuration

When monitoring.enabled=true in the Helm chart, GreenThread automatically deploys:

ResourceTypeDescription
greenthread-sidecarPodMonitorScrapes sidecar metrics from every model pod
greenthread-controllerServiceMonitorScrapes controller metrics
greenthread-apiserverServiceMonitorScrapes API server metrics
greenthread-dcgmPodMonitorScrapes DCGM GPU metrics
greenthread-vllmPodMonitorScrapes vLLM inference engine metrics

No manual Prometheus configuration is needed — the monitors are picked up automatically by the kube-prometheus-stack.

Cross-namespace scraping

The kube-prometheus-stack must be installed with serviceMonitorSelectorNilUsesHelmValues=false and podMonitorSelectorNilUsesHelmValues=false to scrape monitors from the greenthread-system namespace. See Monitoring for installation details.

Grafana dashboards

GreenThread deploys four dashboards as ConfigMaps that are automatically provisioned into Grafana:

DashboardDescription
GreenThread GPU DashboardGPU utilization, memory, temperature across all nodes
GreenThread Models DashboardModel lifecycle, wake/sleep times, request rates per model
GreenThread System DashboardController reconciliation, queue depth, errors
GreenThread vLLM DashboardvLLM inference latency, throughput, KV cache utilization

Dashboards appear under the GreenThread folder in Grafana.

Example queries

Wake latency (p99)

histogram_quantile(0.99, rate(gthread_sidecar_wake_duration_seconds_bucket[5m]))

Request rate per model

sum by (pod) (rate(gthread_sidecar_requests_total{status="success"}[5m]))

Models currently serving

count(gthread_sidecar_state{state="serving"} == 1)

Queue depth across all models

sum(gthread_sidecar_queue_depth)

Preemption rate

sum(rate(gthread_sidecar_preemptions_total[5m])) by (role)

GPU memory utilization per model

gthread_sidecar_gpu_memory_serving_bytes

Sleep/wake frequency

# Wakes per minute
sum(rate(gthread_sidecar_wake_duration_seconds_count[5m])) * 60

# Sleeps per minute
sum(rate(gthread_sidecar_sleep_duration_seconds_count[5m])) * 60

vLLM metrics

In addition to sidecar metrics, the vLLM inference engine exposes its own metrics (scraped by the greenthread-vllm PodMonitor). Key vLLM metrics include:

MetricTypeDescription
vllm:request_success_totalCounterSuccessful inference requests
vllm:request_duration_secondsHistogramEnd-to-end request duration
vllm:time_to_first_token_secondsHistogramTime to first token (TTFT)
vllm:time_per_output_token_secondsHistogramInter-token latency
vllm:num_requests_runningGaugeCurrently running requests
vllm:num_requests_waitingGaugeRequests waiting in queue
vllm:gpu_cache_usage_percGaugeKV cache utilization percentage
vllm:prompt_tokens_totalCounterTotal prompt tokens processed
vllm:generation_tokens_totalCounterTotal completion tokens generated

DCGM metrics

GPU hardware metrics are exposed via DCGM (Data Center GPU Manager), scraped by the greenthread-dcgm PodMonitor:

MetricDescription
DCGM_FI_DEV_GPU_UTILGPU compute utilization (%)
DCGM_FI_DEV_MEM_COPY_UTILMemory copy utilization (%)
DCGM_FI_DEV_FB_USEDFramebuffer memory used (MB)
DCGM_FI_DEV_FB_FREEFramebuffer memory free (MB)
DCGM_FI_DEV_GPU_TEMPGPU temperature (C)
DCGM_FI_DEV_POWER_USAGEPower draw (W)

GPU CRD as live state

Beyond Prometheus metrics, the GPU custom resources provide a live view of the scheduling state:

kubectl get gpu -A

Each GPU CRD's status includes:

FieldDescription
stats.usedMemoryBytesActual measured GPU memory in use (NVML)
stats.freeMemoryBytesFree GPU memory (NVML)
stats.utilizationPercentGPU compute utilization
stats.temperatureCelsiusGPU temperature
stats.powerWattsCurrent power draw
occupantsList of models loaded on this GPU with their state and reserved memory

This provides real-time scheduling visibility without needing Prometheus:

# Quick overview of GPU occupants
kubectl get gpu -A -o jsonpath='{range .items[*]}{.metadata.name}: {range .status.occupants[*]}{.modelName}({.state}) {end}{"\n"}{end}'