GreenThreadDocs

Models are deployed as Model custom resources in the greenthread-system namespace. The Model CRD controls every aspect of how your model is served — from GPU memory allocation and tensor parallelism to sleep behavior, fairness policy, and LoRA adapter support.

Full spec

apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
  name: llama-3-1-8b
  namespace: greenthread-system
spec:
  modelName: meta-llama/Llama-3.1-8B-Instruct
  dtype: bfloat16
  gpuMemoryUtilization: "0.90"
  tensorParallelSize: 1
  replicas: 1
  quantization: ""
  extraArgs:
    - "--enforce-eager"
  sleep:
    idleTimeout: 5m
    drainTimeout: 30s
  fairness:
    minRuntime: 10s
    maxWaitTime: 5s
    popular: false
  staging:
    pinned: false
  lora:
    enabled: false
    maxRank: 64
    cacheDir: ""
    downloadTimeout: 5m
  kvCache:
    enabled: false
    maxSize: ""
  resources: {}

Core fields

modelName

modelName: meta-llama/Llama-3.1-8B-Instruct

The HuggingFace model identifier. GreenThread downloads the model, converts it to the gthread load format, and stores it on NVMe. This field is required.

dtype

dtype: bfloat16

Data type for model weights. Affects VRAM usage and inference precision.

ValueDescription
bfloat16Default. Good balance of precision and memory usage.
float16Standard half precision.
float32Full precision. Double the VRAM of bf16/fp16.

gpuMemoryUtilization

gpuMemoryUtilization: "0.90"

Fraction of GPU memory vLLM is allowed to use. Range: (0, 1]. Stored as a string to preserve decimal precision.

ValueUse case
"0.50""0.70"Multi-tenant — multiple models per GPU
"0.85""0.95"Single model — maximize KV cache and throughput

Lower values leave room for other models to share the same GPU via sleep/wake scheduling.

tensorParallelSize

tensorParallelSize: 2

Number of GPUs to shard the model across using tensor parallelism. Must be a power of 2.

SizeUse case
1Models up to ~80GB VRAM (e.g. 8B, 20B models)
2Models 70B+ in bf16
4Very large models (100B+)
8Maximum sharding for the largest models

Each replica uses tensorParallelSize GPUs. The scheduler assigns contiguous GPU indices on the same node.

replicas

replicas: 1

Number of serving pods to create. Each replica independently claims GPUs and handles requests.

quantization

quantization: mxfp4

Quantization method applied during model conversion. Must be compatible with the model weights.

MethodDescription
mxfp4Microscaling FP4 quantization
awqActivation-aware weight quantization
gptqPost-training quantization
(empty)No quantization — full precision weights

extraArgs

extraArgs:
  - "--enforce-eager"
  - "--max-model-len"
  - "8192"
  - "--max-num-seqs"
  - "128"

Raw CLI arguments passed directly to the vLLM inference engine. Each argument is a separate list item.

Use with care

Extra args bypass validation. Invalid arguments will cause the vLLM process to fail on startup. Check vLLM documentation for supported flags.

Common extraArgs:

ArgumentDescription
--enforce-eagerDisable CUDA graphs. Recommended for sleep/wake compatibility.
--trust-remote-codeAllow custom model code from HuggingFace. Required for some architectures.
--max-model-len <n>Override the maximum sequence length. Reduces VRAM usage.
--max-num-seqs <n>Maximum concurrent sequences. Lower values reduce VRAM.
--reasoning-parser <name>Enable reasoning/thinking token parsing (e.g. qwen3, deepseek).
--enable-auto-tool-choiceEnable automatic tool/function calling.
--tool-call-parser <name>Tool call parser (e.g. qwen3_coder, hermes).

Sleep spec

Controls when and how models go to sleep. See Lifecycle & States for details on sleep/wake transitions.

sleep:
  idleTimeout: 5m
  drainTimeout: 30s
FieldTypeDefaultDescription
idleTimeoutduration5mDuration with no requests before the model voluntarily sleeps
drainTimeoutduration30sMaximum time to wait for in-flight requests to complete during sleep

When a model sleeps, GreenThread releases all VRAM, making the GPU available for other models.

Fairness spec

Controls preemption behavior when multiple models compete for GPU memory. See GPU Scheduling for the full algorithm.

fairness:
  minRuntime: 10s
  maxWaitTime: 5s
  popular: false
FieldTypeDefaultDescription
minRuntimeduration10sMinimum time a model must serve before it can be preempted
maxWaitTimeduration5sHow long a waiting model waits before it can preempt others
popularboolfalseIf true, the model is never preempted — stays on GPU permanently

Staging spec

Controls how model weights are staged in CPU memory for fast wake. See Storage & Pinning for details.

staging:
  pinned: false
FieldTypeDefaultDescription
pinnedboolfalseIf true, keep model weights in CPU pinned RAM after wake for ~600ms wake times. If false, weights are evicted to disk after GPU load (~2s with GDS, longer without).

LoRA spec

Configures LoRA adapter support. See LoRA Adapters for the integration pattern.

lora:
  enabled: true
  maxRank: 64
  cacheDir: /tmp/lora-cache
  downloadTimeout: 5m
FieldTypeDefaultDescription
enabledboolfalseEnable LoRA adapter loading
maxRankint64Maximum LoRA rank supported (must be >= rank of any adapter you serve)
cacheDirstring""Local path for downloaded LoRA adapter files
downloadTimeoutduration5mMaximum time to wait for adapter download from object storage

KV cache spec

Configures KV cache persistence via LMCache. When enabled, vLLM stores computed KV cache entries in Redis so that repeated prompt prefixes skip recomputation.

kvCache:
  enabled: true
  maxSize: "10Gi"
FieldTypeDefaultDescription
enabledboolfalseEnable KV cache persistence
maxSizestring""Maximum KV cache size per replica (e.g. "10Gi")
LMCache setup required

KV cache persistence requires lmcache.enabled=true in the Helm chart values to deploy Redis, and uses the LMCacheConnectorV1 vLLM connector. See KV Cache Persistence for the full setup guide, configuration details, and troubleshooting.

Pod placement

Node selector

nodeSelector:
  nvidia.com/gpu.present: "true"
  topology.kubernetes.io/zone: "us-west-2a"

Standard Kubernetes node selector to constrain which nodes the model pod can run on.

Tolerations

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Standard Kubernetes tolerations for GPU node taints.

Service account

serviceAccountName: my-model-sa

Kubernetes ServiceAccount for the model pod. Used for IRSA/workload identity when LoRA adapters need to download from object storage (e.g. S3).

Image overrides

sidecarImage: registry.example.com/greenthread-sidecar:v1.2.0
vllmImage: registry.example.com/vllm:v0.8.0

Override the default sidecar or vLLM container images. If empty, the controller uses its built-in defaults.

Resource overrides

resources:
  requests:
    cpu: "4"
    memory: "32Gi"
  limits:
    cpu: "8"
    memory: "64Gi"

Standard Kubernetes resource requirements for the vLLM container.

Status fields

The controller and sidecar populate these status fields. They are read-only.

kubectl get model llama-3-1-8b -n greenthread-system -o yaml
FieldDescription
status.phaseCurrent lifecycle phase: Pending, Converting, Staging, Ready, Failed
status.urlFull inference endpoint URL (e.g. http://<gateway>/llama-3-1-8b/v1)
status.servingMemoryPerGPUMeasured VRAM per GPU in bytes (set by sidecar on first boot)
status.memory.servingHumanHuman-readable serving VRAM (e.g. "88.4 GiB")
status.replicasTotal pod replicas
status.readyReplicasPods that have completed boot
status.servingReplicasPods actively serving requests
status.sleepingReplicasPods in sleep state
status.nodeNode where the model pod is running
status.gpusComma-separated GPU indices (e.g. "0,1")
status.conversion.statusConversion state: Pending, Running, Complete, Failed
status.conversion.totalSizeBytesTotal size of converted model
status.distribution.availableOnNodesNodes where the model is available
status.distribution.stagingTierPer-node staging tier (system or disk)

kubectl output

kubectl get model -A
NAMESPACE            NAME                       PHASE   REPLICAS   SERVING   SLEEPING   NODE        GPUS   VRAM/GPU     URL
greenthread-system   llama-3-1-8b               Ready   1          0         1          gpu-node1   0      17.2 GiB     http://<gateway>/llama-3-1-8b/v1
greenthread-system   qwen-3-5-35b-a3b           Ready   1          1         0          gpu-node1   1      88.2 GiB     http://<gateway>/qwen-3-5-35b-a3b/v1