GreenThreadDocs

Models in GreenThread have two layers of state: the Model CRD phase (managed by the controller) and the sidecar state (managed by the sidecar proxy in each model pod). Understanding both helps you monitor deployments and debug issues.

Model CRD phases

The Model custom resource progresses through lifecycle phases as the controller reconciles it.

PhaseDescription
PendingModel resource created, waiting for processing
ConvertingDownloading from HuggingFace and converting to gthread load format on NVMe
StagingWeights are being staged (verified on disk or loaded into CPU memory)
ReadyModel is ready for inference — pod is running, either serving or sleeping
FailedConversion or pod creation failed

Check the current phase:

kubectl get model -A
NAMESPACE            NAME                PHASE   REPLICAS   SERVING   SLEEPING   NODE        GPUS   VRAM/GPU
greenthread-system   llama-3-1-8b        Ready   1          0         1          gpu-node1   0      17.2 GiB
greenthread-system   qwen-3-5-35b-a3b    Ready   1          1         0          gpu-node1   1      88.2 GiB

Conversion

During the Converting phase, the controller creates a Kubernetes Job that:

  1. Downloads model weights from HuggingFace (using the huggingfaceToken from Helm values)
  2. Converts weights to the gthread load format optimized for fast sleep/wake
  3. Stores the converted model on NVMe at the host path configured by storage.modelsPath

Track conversion progress:

kubectl get model llama-3-1-8b -n greenthread-system -o jsonpath='{.status.conversion}'

Once conversion completes on one node, the storage agent distributes the model to other GPU nodes via peer-to-peer transfer.

Multi-node distribution

After conversion, the model may need to be available on multiple nodes. The status tracks this:

kubectl get model llama-3-1-8b -n greenthread-system \
  -o jsonpath='{.status.distribution.availableOnNodes}'

Storage agents on each node discover peers via DNS and can pull models from any node that has them, avoiding redundant HuggingFace downloads.

Sidecar states

Once a model reaches the Ready phase, its pod runs a sidecar proxy that manages the model's runtime state. The sidecar has five states:

StateDescriptionVRAM Usage
sleepingProcess running, GPU memory fully released0 bytes
pendingSleeping, waiting for GPU memory (preemption may be needed)0 bytes
wakingAcquiring GPU memory and reloading weightsGrowing
servingActively handling inference requestsFull allocation
deactivatingDraining in-flight requests before sleepFull allocation
True zero VRAM

When sleeping, GreenThread releases all GPU memory — not just most of it. Sleeping models consume exactly 0 bytes of VRAM, making the GPU fully available for other models.

Check sidecar state

Each model exposes a /status endpoint through the gateway:

GATEWAY_URL=$(kubectl get gateway -n greenthread-system greenthread-gateway \
  -o jsonpath='{.status.addresses[0].value}')

curl http://$GATEWAY_URL/llama-3-1-8b/status
{
  "model": "llama-3-1-8b",
  "state": "sleeping",
  "bootReady": true,
  "queue": {
    "inFlight": 0,
    "barriered": false
  }
}
FieldDescription
stateCurrent sidecar state
bootReadytrue once the sidecar has completed initial boot
queue.inFlightNumber of requests currently being processed
queue.barrieredtrue during sleep drain — new requests are rejected with 503

Sleep flow

When a model transitions from serving to sleeping (either voluntary idle timeout or preemption):

  1. State → deactivating — the sidecar activates the preemption barrier
  2. Barrier stops new requestsTryClaimRequest() returns false, new requests get 503
  3. Drain in-flight requests — waits up to drainTimeout for active requests to complete
  4. Release GPU — GreenThread releases all GPU memory (true zero VRAM)
  5. Release memory — updates GPU CRD occupant from servingsleeping with reservedMemoryBytes: 0
  6. State → sleeping — model is fully asleep, GPU memory is free

Idle timeout

The sidecar monitors request activity. If no requests arrive for sleep.idleTimeout (default: 5 minutes), the model voluntarily sleeps. The idle timer resets on every request.

The model must also have been serving for at least fairness.minRuntime before it can sleep voluntarily, preventing rapid sleep/wake cycles.

Preemption-triggered sleep

When another model needs GPU memory and triggers preemption, it writes a PreemptionIntent to the GPU CRD. The victim model's sidecar detects this via the preemption watcher (polling every 2 seconds) and initiates involuntary sleep, following the same drain-and-checkpoint flow.

Wake flow

When an inference request arrives for a sleeping model:

  1. State → pending — request is queued, sidecar checks if model fits on GPU
  2. Check fit — reads GPU CRDs to see if there's enough available memory
  3. If fits → waking — acquires wake locks, claims memory budget on GPU CRDs
  4. If doesn't fit → preemption flow:
    • Adds PreemptionIntent to GPU CRDs (signals other sidecars)
    • Waits fairness.maxWaitTime — another model might voluntarily sleep
    • If still doesn't fit, selects LRU non-popular victim and signals preemption
    • Waits for victim to release memory (polls GPU CRD occupant state)
  5. Reload weights — calls vLLM to reload model weights via the storage agent
  6. Wake vLLM — restores KV cache and inference state
  7. State → serving — requests are dequeued and processed

Wake deduplication

If multiple requests arrive while a model is waking, they share the same wake operation. The sidecar tracks an in-flight wake result and additional callers wait for the same outcome. This prevents redundant wake attempts.

Wake times

Wake latency depends on the storage tier where model weights are staged:

SourceApproximate wake timeDescription
CPU pinned RAM~600msDMA transfer from pinned host memory to GPU (~65 GB/s)
NVMe disk (GDS)~2 secondsGPU Direct Storage transfer (~20 GB/s)
NVMe disk (no GDS)~5-6 secondsMemory-mapped copy fallback

Configure fast wake by setting staging.pinned: true in the Model CRD. See Storage & Pinning.

GPU CRD occupant tracking

The GPU custom resource tracks which models occupy each GPU:

kubectl get gpu -A -o yaml

Each GPU's status.occupants list shows:

FieldDescription
podNamePod that owns this occupant slot
modelNameModel loaded by this occupant
stateserving, sleeping, or converting
reservedMemoryBytesGPU memory reserved (0 when sleeping)
popularWhether the model is protected from preemption
lastAccessedTimestamp of last inference request
becameServingAtWhen this occupant last transitioned to serving

The sidecar updates these fields via optimistic CAS (compare-and-swap) updates to the GPU CRD status. If a concurrent update conflicts, the sidecar retries.

Troubleshooting states

Model stuck in Converting

Check the conversion job:

kubectl get jobs -n greenthread-system | grep <model-name>
kubectl logs job/<model-name>-convert -n greenthread-system

Common causes: HuggingFace token missing or invalid, model requires access approval, insufficient disk space.

Model stuck in Pending

The controller hasn't started processing yet. Check controller logs:

kubectl logs -n greenthread-system -l app=gthread-controller --tail=30

Sidecar stuck in waking

The model can't acquire GPU memory. Check GPU occupants:

kubectl get gpu -A -o jsonpath='{range .items[*]}{.metadata.name}: {.status.occupants}{"\n"}{end}'

If all GPUs are occupied by popular models, preemption cannot proceed. Either remove the popular flag from a model or add more GPU capacity.

Sidecar stuck in pending

Similar to stuck waking — the model is waiting for GPU memory. Check for preemption intents:

kubectl get gpu -A -o jsonpath='{range .items[*]}{.metadata.name}: {.status.preemptionIntents}{"\n"}{end}'