GreenThread Docs

Two ways to deploy a Model:

Path	When
kubectl apply a `Model` CR	GitOps, automation, you already think in YAML.
LiquidCompute UI / API	You want a Project/Service abstraction, a UI, custom domains. See LiquidCompute.

Both paths produce the same underlying Model CR — the LC API is just a thinner facade. This page covers the direct CR path.

Deploy a Model

Create a Model resource in any namespace (the engine reconciles cluster-wide):

# llama-3-8b.yaml
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
  name: llama-3-8b
  namespace: default
spec:
  modelName: meta-llama/Llama-3.1-8B-Instruct
  modelType: chat                  # chat | tts | asr | embedding | reranker
  dtype: bfloat16
  loadFormat: auto                 # auto (vLLM default) | gthread (fast DMA, requires storage server)
  endpoints:
    - chat_completions
    - completions
  parallelism:
    tensor: 1
    pipeline: 1
  replicas: 1
  fairness:
    popular: false                 # false = preemptable; true = pinned, never slept
  sleep: {}                        # use defaults: idleTimeout=10m
  product: NVIDIA-A40              # nvidia.com/gpu.product label the pod must land on

Apply it:

kubectl apply -f llama-3-8b.yaml

Gated models

HuggingFace-gated models (Llama, Gemma) need an HF token. Pass it at install time via --set controller.huggingFaceToken=hf_xxxx; the controller threads it into every conversion Job as HF_TOKEN.

Multiple models in one file

---
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata: { name: llama-3-8b, namespace: default }
spec:
  modelName: meta-llama/Llama-3.1-8B-Instruct
  modelType: chat
  dtype: bfloat16
  endpoints: [chat_completions, completions]
  parallelism: { tensor: 1, pipeline: 1 }
  replicas: 1
---
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata: { name: whisper-turbo, namespace: default }
spec:
  modelName: openai/whisper-large-v3-turbo
  modelType: asr
  dtype: bfloat16
  endpoints: [audio_transcriptions, audio_translations]
  fairness: { popular: true }      # ASR latency matters — pin it
  replicas: 1

Watch the model converge

kubectl get models -A -w

Models progress through:

Phase	Meaning
`Pending`	CR accepted, waiting for the controller to start.
`Converting`	Downloading from HuggingFace and converting weights to `gthread` load format (only when `loadFormat: gthread`).
`Staging`	Weights are on disk; pods are coming up.
`Ready`	Pods are up. Phase flips between `Serving` and `Sleeping` based on traffic.
`Failed`	Conversion or pod startup failed. Check `kubectl describe`.

See Lifecycle & States for the full state machine + sidecar sub-states.

What "running" looks like

$ kubectl get models -A
NAMESPACE   NAME                 PHASE      REPLICAS   READY   NODE            GPUS   VRAM/GPU
default     gemma-4-26b-a4b-it   Sleeping   1          1       gt-gpu-dev-01   1      42.2 GiB
default     gpt-oss-20b          Sleeping   1          1       gt-gpu-dev-01   1      41.2 GiB
default     trinity-mini         Sleeping   1          1       gt-gpu-dev-01   1      42.6 GiB
default     whisper-turbo        Ready      1          1       blackwell-0     0      20.0 GiB
default     z-image-turbo        Ready      1          1       blackwell-1     0      20.4 GiB

Sleeping Models still count as "ready" — they wake on demand and consume zero VRAM until then.

Watch the GPUs

kubectl get gpus

NAME                  NODE            INDEX   PRODUCT                         MODE        HEALTH    AVAILABLE
gpu-blackwell-0-0     blackwell-0     0       NVIDIA-RTX-PRO-4000-Blackwell   inference   Healthy   4178575360
gpu-blackwell-1-0     blackwell-1     0       NVIDIA-RTX-PRO-4000-Blackwell   inference   Healthy   3740270592
gpu-gt-gpu-dev-01-0   gt-gpu-dev-01   0       NVIDIA-A40                      inference   Healthy   2856321024
gpu-gt-gpu-dev-01-1   gt-gpu-dev-01   1       NVIDIA-A40                      inference   Healthy   51527024640

MODE=inference means at least one Model is bound to that GPU. AVAILABLE is free VRAM in bytes.

In the request body, model is the upstream model ID (the HuggingFace name, e.g. meta-llama/Llama-3.1-8B-Instruct). In the URL path, <model-name> is the Kubernetes Model.metadata.name (e.g. llama-3-8b). Don't mix them up.

Via LiquidCompute (single `/v1/*` endpoint)

If you've installed LiquidCompute, there's a single /v1/* and the request model is the Model CR name:

curl https://app.example.com/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Update a Model

Edit the YAML and re-apply:

kubectl apply -f llama-3-8b.yaml

Field changed	Effect
`fairness`, `sleep`, `extraArgs`, `endpoints`	Pod rolls on next reconcile.
`dtype`, `parallelism`, `vllmImage`	Pod rolls.
`modelName`, `loadFormat: gthread` source	Re-conversion job runs first, then pod rolls.

Delete a Model

kubectl delete model llama-3-8b -n default

The finalizer cleans up: pod terminates, GPU memory releases, Ingress/HTTPRoute removed. Converted weights on /mnt/models remain so a redeployed Model boots fast.

Real-world examples

See Recipes for ready-to-paste Models for popular setups: chat (Llama / Qwen / Gemma), TTS (Z-Image, Qwen-Omni), STT (Whisper), embeddings, rerankers.

Next steps

Model CRD Reference — every spec field documented
Lifecycle & States — phases, sub-states, conditions
Fairness Policy — preemption, popularity, multi-tenant sharing
Inference API — supported endpoints and protocols