Models in GreenThread have two layers of state: the Model CRD phase (managed by the controller) and the sidecar state (managed by the sidecar proxy in each model pod). Understanding both helps you monitor deployments and debug issues.
Model CRD phases
The Model custom resource progresses through lifecycle phases as the controller reconciles it.
| Phase | Description |
|---|---|
Pending | Model resource created, waiting for processing |
Converting | Downloading from HuggingFace and converting to gthread load format on NVMe |
Staging | Weights are being staged (verified on disk or loaded into CPU memory) |
Ready | Model is ready for inference — pod is running, either serving or sleeping |
Failed | Conversion or pod creation failed |
Check the current phase:
kubectl get model -A
NAMESPACE NAME PHASE REPLICAS SERVING SLEEPING NODE GPUS VRAM/GPU
greenthread-system llama-3-1-8b Ready 1 0 1 gpu-node1 0 17.2 GiB
greenthread-system qwen-3-5-35b-a3b Ready 1 1 0 gpu-node1 1 88.2 GiB
Conversion
During the Converting phase, the controller creates a Kubernetes Job that:
- Downloads model weights from HuggingFace (using the
huggingfaceTokenfrom Helm values) - Converts weights to the
gthreadload format optimized for fast sleep/wake - Stores the converted model on NVMe at the host path configured by
storage.modelsPath
Track conversion progress:
kubectl get model llama-3-1-8b -n greenthread-system -o jsonpath='{.status.conversion}'
Once conversion completes on one node, the storage agent distributes the model to other GPU nodes via peer-to-peer transfer.
Multi-node distribution
After conversion, the model may need to be available on multiple nodes. The status tracks this:
kubectl get model llama-3-1-8b -n greenthread-system \
-o jsonpath='{.status.distribution.availableOnNodes}'
Storage agents on each node discover peers via DNS and can pull models from any node that has them, avoiding redundant HuggingFace downloads.
Sidecar states
Once a model reaches the Ready phase, its pod runs a sidecar proxy that manages the model's runtime state. The sidecar has five states:
| State | Description | VRAM Usage |
|---|---|---|
sleeping | Process running, GPU memory fully released | 0 bytes |
pending | Sleeping, waiting for GPU memory (preemption may be needed) | 0 bytes |
waking | Acquiring GPU memory and reloading weights | Growing |
serving | Actively handling inference requests | Full allocation |
deactivating | Draining in-flight requests before sleep | Full allocation |
When sleeping, GreenThread releases all GPU memory — not just most of it. Sleeping models consume exactly 0 bytes of VRAM, making the GPU fully available for other models.
Check sidecar state
Each model exposes a /status endpoint through the gateway:
GATEWAY_URL=$(kubectl get gateway -n greenthread-system greenthread-gateway \
-o jsonpath='{.status.addresses[0].value}')
curl http://$GATEWAY_URL/llama-3-1-8b/status
{
"model": "llama-3-1-8b",
"state": "sleeping",
"bootReady": true,
"queue": {
"inFlight": 0,
"barriered": false
}
}
| Field | Description |
|---|---|
state | Current sidecar state |
bootReady | true once the sidecar has completed initial boot |
queue.inFlight | Number of requests currently being processed |
queue.barriered | true during sleep drain — new requests are rejected with 503 |
Sleep flow
When a model transitions from serving to sleeping (either voluntary idle timeout or preemption):
- State →
deactivating— the sidecar activates the preemption barrier - Barrier stops new requests —
TryClaimRequest()returnsfalse, new requests get503 - Drain in-flight requests — waits up to
drainTimeoutfor active requests to complete - Release GPU — GreenThread releases all GPU memory (true zero VRAM)
- Release memory — updates GPU CRD occupant from
serving→sleepingwithreservedMemoryBytes: 0 - State →
sleeping— model is fully asleep, GPU memory is free
Idle timeout
The sidecar monitors request activity. If no requests arrive for sleep.idleTimeout (default: 5 minutes), the model voluntarily sleeps. The idle timer resets on every request.
The model must also have been serving for at least fairness.minRuntime before it can sleep voluntarily, preventing rapid sleep/wake cycles.
Preemption-triggered sleep
When another model needs GPU memory and triggers preemption, it writes a PreemptionIntent to the GPU CRD. The victim model's sidecar detects this via the preemption watcher (polling every 2 seconds) and initiates involuntary sleep, following the same drain-and-checkpoint flow.
Wake flow
When an inference request arrives for a sleeping model:
- State →
pending— request is queued, sidecar checks if model fits on GPU - Check fit — reads GPU CRDs to see if there's enough available memory
- If fits →
waking— acquires wake locks, claims memory budget on GPU CRDs - If doesn't fit → preemption flow:
- Adds
PreemptionIntentto GPU CRDs (signals other sidecars) - Waits
fairness.maxWaitTime— another model might voluntarily sleep - If still doesn't fit, selects LRU non-popular victim and signals preemption
- Waits for victim to release memory (polls GPU CRD occupant state)
- Adds
- Reload weights — calls vLLM to reload model weights via the storage agent
- Wake vLLM — restores KV cache and inference state
- State →
serving— requests are dequeued and processed
Wake deduplication
If multiple requests arrive while a model is waking, they share the same wake operation. The sidecar tracks an in-flight wake result and additional callers wait for the same outcome. This prevents redundant wake attempts.
Wake times
Wake latency depends on the storage tier where model weights are staged:
| Source | Approximate wake time | Description |
|---|---|---|
| CPU pinned RAM | ~600ms | DMA transfer from pinned host memory to GPU (~65 GB/s) |
| NVMe disk (GDS) | ~2 seconds | GPU Direct Storage transfer (~20 GB/s) |
| NVMe disk (no GDS) | ~5-6 seconds | Memory-mapped copy fallback |
Configure fast wake by setting staging.pinned: true in the Model CRD. See Storage & Pinning.
GPU CRD occupant tracking
The GPU custom resource tracks which models occupy each GPU:
kubectl get gpu -A -o yaml
Each GPU's status.occupants list shows:
| Field | Description |
|---|---|
podName | Pod that owns this occupant slot |
modelName | Model loaded by this occupant |
state | serving, sleeping, or converting |
reservedMemoryBytes | GPU memory reserved (0 when sleeping) |
popular | Whether the model is protected from preemption |
lastAccessed | Timestamp of last inference request |
becameServingAt | When this occupant last transitioned to serving |
The sidecar updates these fields via optimistic CAS (compare-and-swap) updates to the GPU CRD status. If a concurrent update conflicts, the sidecar retries.
Troubleshooting states
Model stuck in Converting
Check the conversion job:
kubectl get jobs -n greenthread-system | grep <model-name>
kubectl logs job/<model-name>-convert -n greenthread-system
Common causes: HuggingFace token missing or invalid, model requires access approval, insufficient disk space.
Model stuck in Pending
The controller hasn't started processing yet. Check controller logs:
kubectl logs -n greenthread-system -l app=gthread-controller --tail=30
Sidecar stuck in waking
The model can't acquire GPU memory. Check GPU occupants:
kubectl get gpu -A -o jsonpath='{range .items[*]}{.metadata.name}: {.status.occupants}{"\n"}{end}'
If all GPUs are occupied by popular models, preemption cannot proceed. Either remove the popular flag from a model or add more GPU capacity.
Sidecar stuck in pending
Similar to stuck waking — the model is waiting for GPU memory. Check for preemption intents:
kubectl get gpu -A -o jsonpath='{range .items[*]}{.metadata.name}: {.status.preemptionIntents}{"\n"}{end}'