GreenThread Docs

Models are deployed as Model custom resources in the greenthread-system namespace. The Model CRD controls every aspect of how your model is served — from GPU memory allocation and tensor parallelism to sleep behavior, fairness policy, and LoRA adapter support.

Full spec

apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
  name: llama-3-1-8b
  namespace: greenthread-system
spec:
  modelName: meta-llama/Llama-3.1-8B-Instruct
  dtype: bfloat16
  gpuMemoryUtilization: "0.90"
  tensorParallelSize: 1
  replicas: 1
  quantization: ""
  extraArgs:
    - "--enforce-eager"
  sleep:
    idleTimeout: 5m
    drainTimeout: 30s
  fairness:
    minRuntime: 10s
    maxWaitTime: 5s
    popular: false
  staging:
    pinned: false
  lora:
    enabled: false
    maxRank: 64
    cacheDir: ""
    downloadTimeout: 5m
  kvCache:
    enabled: false
    maxSize: ""
  resources: {}

Core fields

`modelName`

modelName: meta-llama/Llama-3.1-8B-Instruct

The HuggingFace model identifier. GreenThread downloads the model, converts it to the gthread load format, and stores it on NVMe. This field is required.

`dtype`

dtype: bfloat16

Data type for model weights. Affects VRAM usage and inference precision.

Value	Description
`bfloat16`	Default. Good balance of precision and memory usage.
`float16`	Standard half precision.
`float32`	Full precision. Double the VRAM of bf16/fp16.

`gpuMemoryUtilization`

gpuMemoryUtilization: "0.90"

Fraction of GPU memory vLLM is allowed to use. Range: (0, 1]. Stored as a string to preserve decimal precision.

Value	Use case
`"0.50"` – `"0.70"`	Multi-tenant — multiple models per GPU
`"0.85"` – `"0.95"`	Single model — maximize KV cache and throughput

Lower values leave room for other models to share the same GPU via sleep/wake scheduling.

`tensorParallelSize`

tensorParallelSize: 2

Number of GPUs to shard the model across using tensor parallelism. Must be a power of 2.

Size	Use case
`1`	Models up to ~80GB VRAM (e.g. 8B, 20B models)
`2`	Models 70B+ in bf16
`4`	Very large models (100B+)
`8`	Maximum sharding for the largest models

Each replica uses tensorParallelSize GPUs. The scheduler assigns contiguous GPU indices on the same node.

`replicas`

replicas: 1

Number of serving pods to create. Each replica independently claims GPUs and handles requests.

`quantization`

quantization: mxfp4

Quantization method applied during model conversion. Must be compatible with the model weights.

Method	Description
`mxfp4`	Microscaling FP4 quantization
`awq`	Activation-aware weight quantization
`gptq`	Post-training quantization
(empty)	No quantization — full precision weights

`extraArgs`

extraArgs:
  - "--enforce-eager"
  - "--max-model-len"
  - "8192"
  - "--max-num-seqs"
  - "128"

Raw CLI arguments passed directly to the vLLM inference engine. Each argument is a separate list item.

Use with care

Extra args bypass validation. Invalid arguments will cause the vLLM process to fail on startup. Check vLLM documentation for supported flags.

Common extraArgs:

Argument	Description
`--enforce-eager`	Disable CUDA graphs. Recommended for sleep/wake compatibility.
`--trust-remote-code`	Allow custom model code from HuggingFace. Required for some architectures.
`--max-model-len <n>`	Override the maximum sequence length. Reduces VRAM usage.
`--max-num-seqs <n>`	Maximum concurrent sequences. Lower values reduce VRAM.
`--reasoning-parser <name>`	Enable reasoning/thinking token parsing (e.g. `qwen3`, `deepseek`).
`--enable-auto-tool-choice`	Enable automatic tool/function calling.
`--tool-call-parser <name>`	Tool call parser (e.g. `qwen3_coder`, `hermes`).

Sleep spec

Controls when and how models go to sleep. See Lifecycle & States for details on sleep/wake transitions.

sleep:
  idleTimeout: 5m
  drainTimeout: 30s

Field	Type	Default	Description
`idleTimeout`	duration	`5m`	Duration with no requests before the model voluntarily sleeps
`drainTimeout`	duration	`30s`	Maximum time to wait for in-flight requests to complete during sleep

When a model sleeps, GreenThread releases all VRAM, making the GPU available for other models.

Fairness spec

Controls preemption behavior when multiple models compete for GPU memory. See GPU Scheduling for the full algorithm.

fairness:
  minRuntime: 10s
  maxWaitTime: 5s
  popular: false

Field	Type	Default	Description
`minRuntime`	duration	`10s`	Minimum time a model must serve before it can be preempted
`maxWaitTime`	duration	`5s`	How long a waiting model waits before it can preempt others
`popular`	bool	`false`	If `true`, the model is never preempted — stays on GPU permanently

Staging spec

Controls how model weights are staged in CPU memory for fast wake. See Storage & Pinning for details.

staging:
  pinned: false

Field	Type	Default	Description
`pinned`	bool	`false`	If `true`, keep model weights in CPU pinned RAM after wake for ~600ms wake times. If `false`, weights are evicted to disk after GPU load (~2s with GDS, longer without).

LoRA spec

Configures LoRA adapter support. See LoRA Adapters for the integration pattern.

lora:
  enabled: true
  maxRank: 64
  cacheDir: /tmp/lora-cache
  downloadTimeout: 5m

Field	Type	Default	Description
`enabled`	bool	`false`	Enable LoRA adapter loading
`maxRank`	int	`64`	Maximum LoRA rank supported (must be >= rank of any adapter you serve)
`cacheDir`	string	`""`	Local path for downloaded LoRA adapter files
`downloadTimeout`	duration	`5m`	Maximum time to wait for adapter download from object storage

KV cache spec

Configures KV cache persistence via LMCache. When enabled, vLLM stores computed KV cache entries in Redis so that repeated prompt prefixes skip recomputation.

kvCache:
  enabled: true
  maxSize: "10Gi"

Field	Type	Default	Description
`enabled`	bool	`false`	Enable KV cache persistence
`maxSize`	string	`""`	Maximum KV cache size per replica (e.g. `"10Gi"`)

LMCache setup required

KV cache persistence requires lmcache.enabled=true in the Helm chart values to deploy Redis, and uses the LMCacheConnectorV1 vLLM connector. See KV Cache Persistence for the full setup guide, configuration details, and troubleshooting.

Pod placement

Node selector

nodeSelector:
  nvidia.com/gpu.present: "true"
  topology.kubernetes.io/zone: "us-west-2a"

Standard Kubernetes node selector to constrain which nodes the model pod can run on.

Tolerations

tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule

Standard Kubernetes tolerations for GPU node taints.

Service account

serviceAccountName: my-model-sa

Kubernetes ServiceAccount for the model pod. Used for IRSA/workload identity when LoRA adapters need to download from object storage (e.g. S3).

Image overrides

sidecarImage: registry.example.com/greenthread-sidecar:v1.2.0
vllmImage: registry.example.com/vllm:v0.8.0

Override the default sidecar or vLLM container images. If empty, the controller uses its built-in defaults.

Resource overrides

resources:
  requests:
    cpu: "4"
    memory: "32Gi"
  limits:
    cpu: "8"
    memory: "64Gi"

Standard Kubernetes resource requirements for the vLLM container.

Status fields

The controller and sidecar populate these status fields. They are read-only.

kubectl get model llama-3-1-8b -n greenthread-system -o yaml

Field	Description
`status.phase`	Current lifecycle phase: `Pending`, `Converting`, `Staging`, `Ready`, `Failed`
`status.url`	Full inference endpoint URL (e.g. `http://<gateway>/llama-3-1-8b/v1`)
`status.servingMemoryPerGPU`	Measured VRAM per GPU in bytes (set by sidecar on first boot)
`status.memory.servingHuman`	Human-readable serving VRAM (e.g. `"88.4 GiB"`)
`status.replicas`	Total pod replicas
`status.readyReplicas`	Pods that have completed boot
`status.servingReplicas`	Pods actively serving requests
`status.sleepingReplicas`	Pods in sleep state
`status.node`	Node where the model pod is running
`status.gpus`	Comma-separated GPU indices (e.g. `"0,1"`)
`status.conversion.status`	Conversion state: `Pending`, `Running`, `Complete`, `Failed`
`status.conversion.totalSizeBytes`	Total size of converted model
`status.distribution.availableOnNodes`	Nodes where the model is available
`status.distribution.stagingTier`	Per-node staging tier (`system` or `disk`)

kubectl output

kubectl get model -A

NAMESPACE            NAME                       PHASE   REPLICAS   SERVING   SLEEPING   NODE        GPUS   VRAM/GPU     URL
greenthread-system   llama-3-1-8b               Ready   1          0         1          gpu-node1   0      17.2 GiB     http://<gateway>/llama-3-1-8b/v1
greenthread-system   qwen-3-5-35b-a3b           Ready   1          1         0          gpu-node1   1      88.2 GiB     http://<gateway>/qwen-3-5-35b-a3b/v1