When multiple models compete for limited GPU memory, GreenThread uses a fairness-based preemption algorithm to decide which model sleeps and which wakes. The policy ensures that no model waits indefinitely while another monopolizes GPU resources.
How preemption works
When a sleeping model receives an inference request and there isn't enough GPU memory available:
- Check fit: The waking sidecar reads GPU CRDs to see if the model fits alongside current occupants.
- Add preemption intent: If it doesn't fit, the sidecar writes a
PreemptionIntentto the GPU CRDs, signaling its need for memory. - Fairness wait: The sidecar waits for
maxWaitTime, re-checking fit every second. Another model might voluntarily sleep during this window (e.g. idle timeout), avoiding forced preemption. - Select victim: After the fairness period, the sidecar selects the best preemption candidate.
- Signal and wait: The victim's preemption watcher detects the intent and initiates sleep. The waking sidecar polls until the victim releases memory.
- Wake: Once memory is available, the waking sidecar claims it and loads the model.
Configuration
Fairness is configured per-model in the Model CRD:
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
name: llama-3-1-8b
namespace: greenthread-system
spec:
modelName: meta-llama/Llama-3.1-8B-Instruct
fairness:
minRuntime: 10s
maxWaitTime: 5s
popular: false
| Field | Type | Default | Description |
|---|---|---|---|
minRuntime | duration | 10s | Minimum time a model must serve before it can be preempted. Prevents rapid sleep/wake thrashing. |
maxWaitTime | duration | 5s | How long a waiting model waits before forcing preemption. During this window, voluntary sleeps may free memory without forced eviction. |
popular | bool | false | If true, the model is never preempted. It stays on GPU permanently until manually scaled down. |
Victim selection algorithm
When preemption is required, the algorithm follows this priority:
- Skip popular models — models with
fairness.popular: trueare never evicted. - Skip recently woken models — models that have been serving for less than their
minRuntimeare skipped, preventing thrashing. - Skip non-serving models — only
servingoccupants are candidates (sleeping models already have 0 VRAM). - LRU ordering — among eligible candidates, the model with the oldest
lastAccessedtimestamp is evicted first. - Same-GPU constraint — only models on the same GPU(s) that the waking model needs are considered.
candidates = GPU occupants
→ filter: state == serving
→ filter: not popular
→ filter: serving for >= minRuntime
→ filter: not self
→ sort: oldest lastAccessed first (LRU)
→ evict first candidate
If no eligible candidates exist (e.g. all occupants are popular or recently woken), the wake fails with an error.
Popular model protection
Mark a model as popular to prevent it from ever being preempted:
spec:
fairness:
popular: true
Popular models stay on GPU permanently. Use this for your most critical, always-hot models.
If all GPUs are occupied by popular models, no other model can wake. Ensure you leave enough GPU capacity for non-popular models to be scheduled.
Preemption barrier
The preemption barrier ensures graceful request handling during model eviction:
- Barrier activates — the victim sidecar stops accepting new requests (
503response). - Drain in-flight — waits up to
sleep.drainTimeoutfor active requests to complete. - Release GPU — once drained, the model releases all VRAM.
- Barrier deactivates — cleanup after sleep completes.
If the drain timeout expires with requests still in-flight, the sleep proceeds anyway to prevent starvation of the waiting model.
GPU CRD coordination
All scheduling decisions are coordinated through GPU custom resources using optimistic concurrency (CAS updates):
kubectl get gpu -A
NAMESPACE NAME NODE INDEX PRODUCT AVAILABLE
greenthread-system gpu-node1-0 gpu-node1 0 NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition 102641958912
greenthread-system gpu-node1-1 gpu-node1 1 NVIDIA-RTX-PRO-6000-Blackwell-Server-Edition 102641958912
Each GPU CRD tracks:
- Occupants — which models are loaded (serving or sleeping) with their reserved memory
- Available memory — total memory minus sum of serving occupants' reserved memory
- Preemption intents — which models are waiting for memory on this GPU
- Wake lock — which pod is currently in the process of waking (prevents concurrent wake races)
Wake locks
Before loading weights, the waking sidecar acquires a wake lock on each target GPU CRD. This prevents two models from simultaneously trying to claim the same memory. The lock is released after the wake completes (or fails).
CAS conflicts
GPU CRD updates use optimistic concurrency — if another sidecar modified the GPU CRD between read and write, the update fails and retries. This is tracked by the gthread_sidecar_gpu_cas_conflicts_total metric.
Memory-aware scheduling
The scheduler uses measured VRAM rather than theoretical estimates:
- On first boot, the sidecar measures actual GPU memory consumption via NVML after vLLM loads the model.
- This measured value (
status.servingMemoryPerGPU) is stored on the Model CRD. - All subsequent scheduling decisions use this measured value for accurate bin packing.
This means the first time a model boots, it may need the full GPU. After that, the scheduler knows exactly how much memory it needs and can pack multiple models per GPU.
Tuning recommendations
Low-latency models
For models that must always respond quickly:
fairness:
popular: true # Never evicted
staging:
pinned: true # Fast wake if it somehow sleeps
Background/batch models
For models used occasionally where latency is less critical:
fairness:
minRuntime: 30s # Serve for at least 30s before eviction
maxWaitTime: 10s # Wait longer before forcing preemption
sleep:
idleTimeout: 2m # Sleep quickly when idle
staging:
pinned: false # Don't hold CPU RAM
Shared GPU environments
When packing multiple models per GPU:
spec:
gpuMemoryUtilization: "0.50" # Leave room for other models
fairness:
minRuntime: 60s # Give each model decent serving time
maxWaitTime: 15s # Be patient before preempting