Two ways to deploy a Model:
| Path | When |
|---|---|
kubectl apply a Model CR | GitOps, automation, you already think in YAML. |
| LiquidCompute UI / API | You want a Project/Service abstraction, a UI, custom domains. See LiquidCompute. |
Both paths produce the same underlying Model CR — the LC API is just a thinner facade. This page covers the direct CR path.
Deploy a Model
Create a Model resource in any namespace (the engine reconciles cluster-wide):
# llama-3-8b.yaml
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
name: llama-3-8b
namespace: default
spec:
modelName: meta-llama/Llama-3.1-8B-Instruct
modelType: chat # chat | tts | asr | embedding | reranker
dtype: bfloat16
loadFormat: auto # auto (vLLM default) | gthread (fast DMA, requires storage server)
endpoints:
- chat_completions
- completions
parallelism:
tensor: 1
pipeline: 1
replicas: 1
fairness:
popular: false # false = preemptable; true = pinned, never slept
sleep: {} # use defaults: idleTimeout=10m
product: NVIDIA-A40 # nvidia.com/gpu.product label the pod must land on
Apply it:
kubectl apply -f llama-3-8b.yaml
HuggingFace-gated models (Llama, Gemma) need an HF token. Pass it at install time via --set controller.huggingFaceToken=hf_xxxx; the controller threads it into every conversion Job as HF_TOKEN.
Multiple models in one file
---
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata: { name: llama-3-8b, namespace: default }
spec:
modelName: meta-llama/Llama-3.1-8B-Instruct
modelType: chat
dtype: bfloat16
endpoints: [chat_completions, completions]
parallelism: { tensor: 1, pipeline: 1 }
replicas: 1
---
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata: { name: whisper-turbo, namespace: default }
spec:
modelName: openai/whisper-large-v3-turbo
modelType: asr
dtype: bfloat16
endpoints: [audio_transcriptions, audio_translations]
fairness: { popular: true } # ASR latency matters — pin it
replicas: 1
Watch the model converge
kubectl get models -A -w
Models progress through:
| Phase | Meaning |
|---|---|
Pending | CR accepted, waiting for the controller to start. |
Converting | Downloading from HuggingFace and converting weights to gthread load format (only when loadFormat: gthread). |
Staging | Weights are on disk; pods are coming up. |
Ready | Pods are up. Phase flips between Serving and Sleeping based on traffic. |
Failed | Conversion or pod startup failed. Check kubectl describe. |
See Lifecycle & States for the full state machine + sidecar sub-states.
What "running" looks like
$ kubectl get models -A
NAMESPACE NAME PHASE REPLICAS READY NODE GPUS VRAM/GPU
default gemma-4-26b-a4b-it Sleeping 1 1 gt-gpu-dev-01 1 42.2 GiB
default gpt-oss-20b Sleeping 1 1 gt-gpu-dev-01 1 41.2 GiB
default trinity-mini Sleeping 1 1 gt-gpu-dev-01 1 42.6 GiB
default whisper-turbo Ready 1 1 blackwell-0 0 20.0 GiB
default z-image-turbo Ready 1 1 blackwell-1 0 20.4 GiB
Sleeping Models still count as "ready" — they wake on demand and consume zero VRAM until then.
Watch the GPUs
kubectl get gpus
NAME NODE INDEX PRODUCT MODE HEALTH AVAILABLE
gpu-blackwell-0-0 blackwell-0 0 NVIDIA-RTX-PRO-4000-Blackwell inference Healthy 4178575360
gpu-blackwell-1-0 blackwell-1 0 NVIDIA-RTX-PRO-4000-Blackwell inference Healthy 3740270592
gpu-gt-gpu-dev-01-0 gt-gpu-dev-01 0 NVIDIA-A40 inference Healthy 2856321024
gpu-gt-gpu-dev-01-1 gt-gpu-dev-01 1 NVIDIA-A40 inference Healthy 51527024640
MODE=inference means at least one Model is bound to that GPU. AVAILABLE is free VRAM in bytes.
Verify inference
The engine emits an Ingress / Gateway route per Model. The path is /<model-name>/v1/* on whatever hostname your cluster's Ingress controller serves.
Direct (engine only)
# Pick up the Ingress address (depends on your ingress class)
INGRESS=$(kubectl get ingress -A -l greenthread.ai/model=llama-3-8b \
-o jsonpath='{.items[0].status.loadBalancer.ingress[0].ip}')
# Status — should show sleeping/serving
curl http://$INGRESS/llama-3-8b/status
# → {"model":"llama-3-8b","state":"sleeping","bootReady":true,"queue":{"inFlight":0,"barriered":false}}
# Fire a real request — sleeping models wake automatically
curl http://$INGRESS/llama-3-8b/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 128
}'
# Status now serving
curl http://$INGRESS/llama-3-8b/status
# → {"model":"llama-3-8b","state":"serving",…}
In the request body, model is the upstream model ID (the HuggingFace name, e.g. meta-llama/Llama-3.1-8B-Instruct). In the URL path, <model-name> is the Kubernetes Model.metadata.name (e.g. llama-3-8b). Don't mix them up.
Via LiquidCompute (single /v1/* endpoint)
If you've installed LiquidCompute, there's a single /v1/* and the request model is the Model CR name:
curl https://app.example.com/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Update a Model
Edit the YAML and re-apply:
kubectl apply -f llama-3-8b.yaml
| Field changed | Effect |
|---|---|
fairness, sleep, extraArgs, endpoints | Pod rolls on next reconcile. |
dtype, parallelism, vllmImage | Pod rolls. |
modelName, loadFormat: gthread source | Re-conversion job runs first, then pod rolls. |
Delete a Model
kubectl delete model llama-3-8b -n default
The finalizer cleans up: pod terminates, GPU memory releases, Ingress/HTTPRoute removed. Converted weights on /mnt/models remain so a redeployed Model boots fast.
Real-world examples
See Recipes for ready-to-paste Models for popular setups: chat (Llama / Qwen / Gemma), TTS (Z-Image, Qwen-Omni), STT (Whisper), embeddings, rerankers.
Next steps
- Model CRD Reference — every spec field documented
- Lifecycle & States — phases, sub-states, conditions
- Fairness Policy — preemption, popularity, multi-tenant sharing
- Inference API — supported endpoints and protocols
