GreenThreadDocs

Two ways to deploy a Model:

PathWhen
kubectl apply a Model CRGitOps, automation, you already think in YAML.
LiquidCompute UI / APIYou want a Project/Service abstraction, a UI, custom domains. See LiquidCompute.

Both paths produce the same underlying Model CR — the LC API is just a thinner facade. This page covers the direct CR path.

Deploy a Model

Create a Model resource in any namespace (the engine reconciles cluster-wide):

# llama-3-8b.yaml
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
  name: llama-3-8b
  namespace: default
spec:
  modelName: meta-llama/Llama-3.1-8B-Instruct
  modelType: chat                  # chat | tts | asr | embedding | reranker
  dtype: bfloat16
  loadFormat: auto                 # auto (vLLM default) | gthread (fast DMA, requires storage server)
  endpoints:
    - chat_completions
    - completions
  parallelism:
    tensor: 1
    pipeline: 1
  replicas: 1
  fairness:
    popular: false                 # false = preemptable; true = pinned, never slept
  sleep: {}                        # use defaults: idleTimeout=10m
  product: NVIDIA-A40              # nvidia.com/gpu.product label the pod must land on

Apply it:

kubectl apply -f llama-3-8b.yaml
Gated models

HuggingFace-gated models (Llama, Gemma) need an HF token. Pass it at install time via --set controller.huggingFaceToken=hf_xxxx; the controller threads it into every conversion Job as HF_TOKEN.

Multiple models in one file

---
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata: { name: llama-3-8b, namespace: default }
spec:
  modelName: meta-llama/Llama-3.1-8B-Instruct
  modelType: chat
  dtype: bfloat16
  endpoints: [chat_completions, completions]
  parallelism: { tensor: 1, pipeline: 1 }
  replicas: 1
---
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata: { name: whisper-turbo, namespace: default }
spec:
  modelName: openai/whisper-large-v3-turbo
  modelType: asr
  dtype: bfloat16
  endpoints: [audio_transcriptions, audio_translations]
  fairness: { popular: true }      # ASR latency matters — pin it
  replicas: 1

Watch the model converge

kubectl get models -A -w

Models progress through:

PhaseMeaning
PendingCR accepted, waiting for the controller to start.
ConvertingDownloading from HuggingFace and converting weights to gthread load format (only when loadFormat: gthread).
StagingWeights are on disk; pods are coming up.
ReadyPods are up. Phase flips between Serving and Sleeping based on traffic.
FailedConversion or pod startup failed. Check kubectl describe.

See Lifecycle & States for the full state machine + sidecar sub-states.

What "running" looks like

$ kubectl get models -A
NAMESPACE   NAME                 PHASE      REPLICAS   READY   NODE            GPUS   VRAM/GPU
default     gemma-4-26b-a4b-it   Sleeping   1          1       gt-gpu-dev-01   1      42.2 GiB
default     gpt-oss-20b          Sleeping   1          1       gt-gpu-dev-01   1      41.2 GiB
default     trinity-mini         Sleeping   1          1       gt-gpu-dev-01   1      42.6 GiB
default     whisper-turbo        Ready      1          1       blackwell-0     0      20.0 GiB
default     z-image-turbo        Ready      1          1       blackwell-1     0      20.4 GiB

Sleeping Models still count as "ready" — they wake on demand and consume zero VRAM until then.

Watch the GPUs

kubectl get gpus
NAME                  NODE            INDEX   PRODUCT                         MODE        HEALTH    AVAILABLE
gpu-blackwell-0-0     blackwell-0     0       NVIDIA-RTX-PRO-4000-Blackwell   inference   Healthy   4178575360
gpu-blackwell-1-0     blackwell-1     0       NVIDIA-RTX-PRO-4000-Blackwell   inference   Healthy   3740270592
gpu-gt-gpu-dev-01-0   gt-gpu-dev-01   0       NVIDIA-A40                      inference   Healthy   2856321024
gpu-gt-gpu-dev-01-1   gt-gpu-dev-01   1       NVIDIA-A40                      inference   Healthy   51527024640

MODE=inference means at least one Model is bound to that GPU. AVAILABLE is free VRAM in bytes.

Verify inference

The engine emits an Ingress / Gateway route per Model. The path is /<model-name>/v1/* on whatever hostname your cluster's Ingress controller serves.

Direct (engine only)

# Pick up the Ingress address (depends on your ingress class)
INGRESS=$(kubectl get ingress -A -l greenthread.ai/model=llama-3-8b \
  -o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}')

# Status — should show sleeping/serving
curl http://$INGRESS/llama-3-8b/status
# → {"model":"llama-3-8b","state":"sleeping","bootReady":true,"queue":{"inFlight":0,"barriered":false}}

# Fire a real request — sleeping models wake automatically
curl http://$INGRESS/llama-3-8b/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 128
  }'

# Status now serving
curl http://$INGRESS/llama-3-8b/status
# → {"model":"llama-3-8b","state":"serving",…}
Model field naming

In the request body, model is the upstream model ID (the HuggingFace name, e.g. meta-llama/Llama-3.1-8B-Instruct). In the URL path, <model-name> is the Kubernetes Model.metadata.name (e.g. llama-3-8b). Don't mix them up.

Via LiquidCompute (single /v1/* endpoint)

If you've installed LiquidCompute, there's a single /v1/* and the request model is the Model CR name:

curl https://app.example.com/v1/chat/completions \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Update a Model

Edit the YAML and re-apply:

kubectl apply -f llama-3-8b.yaml
Field changedEffect
fairness, sleep, extraArgs, endpointsPod rolls on next reconcile.
dtype, parallelism, vllmImagePod rolls.
modelName, loadFormat: gthread sourceRe-conversion job runs first, then pod rolls.

Delete a Model

kubectl delete model llama-3-8b -n default

The finalizer cleans up: pod terminates, GPU memory releases, Ingress/HTTPRoute removed. Converted weights on /mnt/models remain so a redeployed Model boots fast.

Real-world examples

See Recipes for ready-to-paste Models for popular setups: chat (Llama / Qwen / Gemma), TTS (Z-Image, Qwen-Omni), STT (Whisper), embeddings, rerankers.

Next steps