GreenThread Docs

Models in GreenThread have two layers of state: the Model CRD phase (managed by the controller) and the sidecar state (managed by the sidecar proxy in each model pod). Understanding both helps you monitor deployments and debug issues.

Model CRD phases

The Model custom resource progresses through lifecycle phases as the controller reconciles it.

Phase	Description
`Pending`	Model resource created, waiting for processing
`Converting`	Downloading from HuggingFace and converting to `gthread` load format on NVMe
`Staging`	Weights are being staged (verified on disk or loaded into CPU memory)
`Ready`	Model is ready for inference — pod is running, either serving or sleeping
`Failed`	Conversion or pod creation failed

Check the current phase:

kubectl get model -A

NAMESPACE            NAME                PHASE   REPLICAS   SERVING   SLEEPING   NODE        GPUS   VRAM/GPU
greenthread-system   llama-3-1-8b        Ready   1          0         1          gpu-node1   0      17.2 GiB
greenthread-system   qwen-3-5-35b-a3b    Ready   1          1         0          gpu-node1   1      88.2 GiB

Conversion

During the Converting phase, the controller creates a Kubernetes Job that:

Downloads model weights from HuggingFace (using the huggingfaceToken from Helm values)
Converts weights to the gthread load format optimized for fast sleep/wake
Stores the converted model on NVMe at the host path configured by storage.modelsPath

Track conversion progress:

kubectl get model llama-3-1-8b -n greenthread-system -o jsonpath='{.status.conversion}'

Once conversion completes on one node, the storage agent distributes the model to other GPU nodes via peer-to-peer transfer.

Multi-node distribution

After conversion, the model may need to be available on multiple nodes. The status tracks this:

kubectl get model llama-3-1-8b -n greenthread-system \
  -o jsonpath='{.status.distribution.availableOnNodes}'

Storage agents on each node discover peers via DNS and can pull models from any node that has them, avoiding redundant HuggingFace downloads.

Sidecar states

Once a model reaches the Ready phase, its pod runs a sidecar proxy that manages the model's runtime state. The sidecar has five states:

State	Description	VRAM Usage
`sleeping`	Process running, GPU memory fully released	0 bytes
`pending`	Sleeping, waiting for GPU memory (preemption may be needed)	0 bytes
`waking`	Acquiring GPU memory and reloading weights	Growing
`serving`	Actively handling inference requests	Full allocation
`deactivating`	Draining in-flight requests before sleep	Full allocation

True zero VRAM

When sleeping, GreenThread releases all GPU memory — not just most of it. Sleeping models consume exactly 0 bytes of VRAM, making the GPU fully available for other models.

Check sidecar state

Each model exposes a /status endpoint through the gateway:

GATEWAY_URL=$(kubectl get gateway -n greenthread-system greenthread-gateway \
  -o jsonpath='{.status.addresses[0].value}')

curl http://$GATEWAY_URL/llama-3-1-8b/status

{
  "model": "llama-3-1-8b",
  "state": "sleeping",
  "bootReady": true,
  "queue": {
    "inFlight": 0,
    "barriered": false
  }
}

Field	Description
`state`	Current sidecar state
`bootReady`	`true` once the sidecar has completed initial boot
`queue.inFlight`	Number of requests currently being processed
`queue.barriered`	`true` during sleep drain — new requests are rejected with `503`

Sleep flow

When a model transitions from serving to sleeping (either voluntary idle timeout or preemption):

State → deactivating — the sidecar activates the preemption barrier
Barrier stops new requests — TryClaimRequest() returns false, new requests get 503
Drain in-flight requests — waits up to drainTimeout for active requests to complete
Release GPU — GreenThread releases all GPU memory (true zero VRAM)
Release memory — updates GPU CRD occupant from serving → sleeping with reservedMemoryBytes: 0
State → sleeping — model is fully asleep, GPU memory is free

Idle timeout

The sidecar monitors request activity. If no requests arrive for sleep.idleTimeout (default: 5 minutes), the model voluntarily sleeps. The idle timer resets on every request.

The model must also have been serving for at least fairness.minRuntime before it can sleep voluntarily, preventing rapid sleep/wake cycles.

Preemption-triggered sleep

When another model needs GPU memory and triggers preemption, it writes a PreemptionIntent to the GPU CRD. The victim model's sidecar detects this via the preemption watcher (polling every 2 seconds) and initiates involuntary sleep, following the same drain-and-checkpoint flow.

Wake flow

When an inference request arrives for a sleeping model:

State → pending — request is queued, sidecar checks if model fits on GPU
Check fit — reads GPU CRDs to see if there's enough available memory
If fits → waking — acquires wake locks, claims memory budget on GPU CRDs
If doesn't fit → preemption flow:
- Adds PreemptionIntent to GPU CRDs (signals other sidecars)
- Waits fairness.maxWaitTime — another model might voluntarily sleep
- If still doesn't fit, selects LRU non-popular victim and signals preemption
- Waits for victim to release memory (polls GPU CRD occupant state)
Reload weights — calls vLLM to reload model weights via the storage agent
Wake vLLM — restores KV cache and inference state
State → serving — requests are dequeued and processed

Wake deduplication

If multiple requests arrive while a model is waking, they share the same wake operation. The sidecar tracks an in-flight wake result and additional callers wait for the same outcome. This prevents redundant wake attempts.

Wake times

Wake latency depends on the storage tier where model weights are staged:

Source	Approximate wake time	Description
CPU pinned RAM	~600ms	DMA transfer from pinned host memory to GPU (~65 GB/s)
NVMe disk (GDS)	~2 seconds	GPU Direct Storage transfer (~20 GB/s)
NVMe disk (no GDS)	~5-6 seconds	Memory-mapped copy fallback

Configure fast wake by setting staging.pinned: true in the Model CRD. See Storage & Pinning.

GPU CRD occupant tracking

The GPU custom resource tracks which models occupy each GPU:

kubectl get gpu -A -o yaml

Each GPU's status.occupants list shows:

Field	Description
`podName`	Pod that owns this occupant slot
`modelName`	Model loaded by this occupant
`state`	`serving`, `sleeping`, or `converting`
`reservedMemoryBytes`	GPU memory reserved (0 when sleeping)
`popular`	Whether the model is protected from preemption
`lastAccessed`	Timestamp of last inference request
`becameServingAt`	When this occupant last transitioned to serving

The sidecar updates these fields via optimistic CAS (compare-and-swap) updates to the GPU CRD status. If a concurrent update conflicts, the sidecar retries.

Troubleshooting states

Model stuck in `Converting`

Check the conversion job:

kubectl get jobs -n greenthread-system | grep <model-name>
kubectl logs job/<model-name>-convert -n greenthread-system

Common causes: HuggingFace token missing or invalid, model requires access approval, insufficient disk space.

Model stuck in `Pending`

The controller hasn't started processing yet. Check controller logs:

kubectl logs -n greenthread-system -l app=gthread-controller --tail=30

Sidecar stuck in `waking`

The model can't acquire GPU memory. Check GPU occupants:

kubectl get gpu -A -o jsonpath='{range .items[*]}{.metadata.name}: {.status.occupants}{"\n"}{end}'

If all GPUs are occupied by popular models, preemption cannot proceed. Either remove the popular flag from a model or add more GPU capacity.

Sidecar stuck in `pending`

Similar to stuck waking — the model is waiting for GPU memory. Check for preemption intents:

kubectl get gpu -A -o jsonpath='{range .items[*]}{.metadata.name}: {.status.preemptionIntents}{"\n"}{end}'