Models are deployed as Model custom resources in the greenthread-system namespace. The Model CRD controls every aspect of how your model is served — from GPU memory allocation and tensor parallelism to sleep behavior, fairness policy, and LoRA adapter support.
Full spec
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
name: llama-3-1-8b
namespace: greenthread-system
spec:
modelName: meta-llama/Llama-3.1-8B-Instruct
dtype: bfloat16
gpuMemoryUtilization: "0.90"
tensorParallelSize: 1
replicas: 1
quantization: ""
extraArgs:
- "--enforce-eager"
sleep:
idleTimeout: 5m
drainTimeout: 30s
fairness:
minRuntime: 10s
maxWaitTime: 5s
popular: false
staging:
pinned: false
lora:
enabled: false
maxRank: 64
cacheDir: ""
downloadTimeout: 5m
kvCache:
enabled: false
maxSize: ""
resources: {}
Core fields
modelName
modelName: meta-llama/Llama-3.1-8B-Instruct
The HuggingFace model identifier. GreenThread downloads the model, converts it to the gthread load format, and stores it on NVMe. This field is required.
dtype
dtype: bfloat16
Data type for model weights. Affects VRAM usage and inference precision.
| Value | Description |
|---|---|
bfloat16 | Default. Good balance of precision and memory usage. |
float16 | Standard half precision. |
float32 | Full precision. Double the VRAM of bf16/fp16. |
gpuMemoryUtilization
gpuMemoryUtilization: "0.90"
Fraction of GPU memory vLLM is allowed to use. Range: (0, 1]. Stored as a string to preserve decimal precision.
| Value | Use case |
|---|---|
"0.50" – "0.70" | Multi-tenant — multiple models per GPU |
"0.85" – "0.95" | Single model — maximize KV cache and throughput |
Lower values leave room for other models to share the same GPU via sleep/wake scheduling.
tensorParallelSize
tensorParallelSize: 2
Number of GPUs to shard the model across using tensor parallelism. Must be a power of 2.
| Size | Use case |
|---|---|
1 | Models up to ~80GB VRAM (e.g. 8B, 20B models) |
2 | Models 70B+ in bf16 |
4 | Very large models (100B+) |
8 | Maximum sharding for the largest models |
Each replica uses tensorParallelSize GPUs. The scheduler assigns contiguous GPU indices on the same node.
replicas
replicas: 1
Number of serving pods to create. Each replica independently claims GPUs and handles requests.
quantization
quantization: mxfp4
Quantization method applied during model conversion. Must be compatible with the model weights.
| Method | Description |
|---|---|
mxfp4 | Microscaling FP4 quantization |
awq | Activation-aware weight quantization |
gptq | Post-training quantization |
| (empty) | No quantization — full precision weights |
extraArgs
extraArgs:
- "--enforce-eager"
- "--max-model-len"
- "8192"
- "--max-num-seqs"
- "128"
Raw CLI arguments passed directly to the vLLM inference engine. Each argument is a separate list item.
Extra args bypass validation. Invalid arguments will cause the vLLM process to fail on startup. Check vLLM documentation for supported flags.
Common extraArgs:
| Argument | Description |
|---|---|
--enforce-eager | Disable CUDA graphs. Recommended for sleep/wake compatibility. |
--trust-remote-code | Allow custom model code from HuggingFace. Required for some architectures. |
--max-model-len <n> | Override the maximum sequence length. Reduces VRAM usage. |
--max-num-seqs <n> | Maximum concurrent sequences. Lower values reduce VRAM. |
--reasoning-parser <name> | Enable reasoning/thinking token parsing (e.g. qwen3, deepseek). |
--enable-auto-tool-choice | Enable automatic tool/function calling. |
--tool-call-parser <name> | Tool call parser (e.g. qwen3_coder, hermes). |
Sleep spec
Controls when and how models go to sleep. See Lifecycle & States for details on sleep/wake transitions.
sleep:
idleTimeout: 5m
drainTimeout: 30s
| Field | Type | Default | Description |
|---|---|---|---|
idleTimeout | duration | 5m | Duration with no requests before the model voluntarily sleeps |
drainTimeout | duration | 30s | Maximum time to wait for in-flight requests to complete during sleep |
When a model sleeps, GreenThread releases all VRAM, making the GPU available for other models.
Fairness spec
Controls preemption behavior when multiple models compete for GPU memory. See GPU Scheduling for the full algorithm.
fairness:
minRuntime: 10s
maxWaitTime: 5s
popular: false
| Field | Type | Default | Description |
|---|---|---|---|
minRuntime | duration | 10s | Minimum time a model must serve before it can be preempted |
maxWaitTime | duration | 5s | How long a waiting model waits before it can preempt others |
popular | bool | false | If true, the model is never preempted — stays on GPU permanently |
Staging spec
Controls how model weights are staged in CPU memory for fast wake. See Storage & Pinning for details.
staging:
pinned: false
| Field | Type | Default | Description |
|---|---|---|---|
pinned | bool | false | If true, keep model weights in CPU pinned RAM after wake for ~600ms wake times. If false, weights are evicted to disk after GPU load (~2s with GDS, longer without). |
LoRA spec
Configures LoRA adapter support. See LoRA Adapters for the integration pattern.
lora:
enabled: true
maxRank: 64
cacheDir: /tmp/lora-cache
downloadTimeout: 5m
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable LoRA adapter loading |
maxRank | int | 64 | Maximum LoRA rank supported (must be >= rank of any adapter you serve) |
cacheDir | string | "" | Local path for downloaded LoRA adapter files |
downloadTimeout | duration | 5m | Maximum time to wait for adapter download from object storage |
KV cache spec
Configures KV cache persistence via LMCache. When enabled, vLLM stores computed KV cache entries in Redis so that repeated prompt prefixes skip recomputation.
kvCache:
enabled: true
maxSize: "10Gi"
| Field | Type | Default | Description |
|---|---|---|---|
enabled | bool | false | Enable KV cache persistence |
maxSize | string | "" | Maximum KV cache size per replica (e.g. "10Gi") |
KV cache persistence requires lmcache.enabled=true in the Helm chart values to deploy Redis, and uses the LMCacheConnectorV1 vLLM connector. See KV Cache Persistence for the full setup guide, configuration details, and troubleshooting.
Pod placement
Node selector
nodeSelector:
nvidia.com/gpu.present: "true"
topology.kubernetes.io/zone: "us-west-2a"
Standard Kubernetes node selector to constrain which nodes the model pod can run on.
Tolerations
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
Standard Kubernetes tolerations for GPU node taints.
Service account
serviceAccountName: my-model-sa
Kubernetes ServiceAccount for the model pod. Used for IRSA/workload identity when LoRA adapters need to download from object storage (e.g. S3).
Image overrides
sidecarImage: registry.example.com/greenthread-sidecar:v1.2.0
vllmImage: registry.example.com/vllm:v0.8.0
Override the default sidecar or vLLM container images. If empty, the controller uses its built-in defaults.
Resource overrides
resources:
requests:
cpu: "4"
memory: "32Gi"
limits:
cpu: "8"
memory: "64Gi"
Standard Kubernetes resource requirements for the vLLM container.
Status fields
The controller and sidecar populate these status fields. They are read-only.
kubectl get model llama-3-1-8b -n greenthread-system -o yaml
| Field | Description |
|---|---|
status.phase | Current lifecycle phase: Pending, Converting, Staging, Ready, Failed |
status.url | Full inference endpoint URL (e.g. http://<gateway>/llama-3-1-8b/v1) |
status.servingMemoryPerGPU | Measured VRAM per GPU in bytes (set by sidecar on first boot) |
status.memory.servingHuman | Human-readable serving VRAM (e.g. "88.4 GiB") |
status.replicas | Total pod replicas |
status.readyReplicas | Pods that have completed boot |
status.servingReplicas | Pods actively serving requests |
status.sleepingReplicas | Pods in sleep state |
status.node | Node where the model pod is running |
status.gpus | Comma-separated GPU indices (e.g. "0,1") |
status.conversion.status | Conversion state: Pending, Running, Complete, Failed |
status.conversion.totalSizeBytes | Total size of converted model |
status.distribution.availableOnNodes | Nodes where the model is available |
status.distribution.stagingTier | Per-node staging tier (system or disk) |
kubectl output
kubectl get model -A
NAMESPACE NAME PHASE REPLICAS SERVING SLEEPING NODE GPUS VRAM/GPU URL
greenthread-system llama-3-1-8b Ready 1 0 1 gpu-node1 0 17.2 GiB http://<gateway>/llama-3-1-8b/v1
greenthread-system qwen-3-5-35b-a3b Ready 1 1 0 gpu-node1 1 88.2 GiB http://<gateway>/qwen-3-5-35b-a3b/v1
