This page covers the engine specifically — the inferencing platform that owns Model lifecycle, GPU scheduling, and sleep/wake. For the layers above (Projects / Services UI, customer-facing API keys) see LiquidCompute and AI Console. For the whole picture see Platform Overview.
Components
The engine ships as four Go binaries from a single Helm chart:
| Binary | Type | Role |
|---|---|---|
| controller | Deployment (2 replicas, leader-elected) | Reconciles Model, GPUShareClaim, GPU, StorageNode. The leader phones home to the licence server hourly. |
| webhook | Deployment (2 replicas) | Validating admission for Model + GPUShareClaim. |
| agent | DaemonSet (every GPU node) | NVML discovery, per-GPU C scheduler daemons, DRA kubelet plugin, ResourceSlice publisher, GPU CRD reporter. |
| sidecar | Injected into every Model pod | HTTP front-door, sleep/wake state machine, vLLM coordination, GPU CRD CAS for fairness / preemption. |
Routing — getting an HTTP request from the cluster edge to the right Model pod — is not owned by the engine. Use a standard Ingress controller, or layer LiquidCompute on top for a unified /v1/* proxy.
High-level flow
- You declare a
Model. The webhook validates cross-field constraints (parallelism, sleep config, role layout). - The controller renders a pod with two containers: vLLM + the GreenThread sidecar. It also renders a per-Model
ResourceClaimagainst thegreenthread.aiDeviceClass. - When the pod is scheduled, the agent's DRA kubelet plugin receives
NodePrepareResources, reserves a slot on the requested GPU, and writes a per-claim CDI spec. - containerd reads the CDI spec and injects
libgreenthread.soplus GPU device nodes into the workload. - The sidecar takes over: it owns the HTTP front-door, the sleep/wake state machine, and coordinates with the per-GPU scheduler daemon for fair multi-tenant access.
Custom resources
| Kind | Scope | Owned by | Purpose |
|---|---|---|---|
Model | Namespaced | You | Declarative spec for a model deployment (modelName, dtype, parallelism, fairness, sleep, …). |
GPUShareClaim | Namespaced | You / LiquidCompute | A simpler-than-DRA way to ask for a GPU slot. The controller compiles it into a ResourceClaimTemplate. |
GPU | Cluster | Engine | One per physical GPU. Agent owns spec + telemetry; DRA driver owns mode; sidecar owns coordination. kubectl get gpus is the canonical fleet view. |
GPUQuota | Namespaced | You | Optional per-namespace caps (maxVRAMGiB, maxGPUs). Webhook enforces at admission. |
StorageNode | Cluster | Engine | Health-marker CR for the optional per-node storage server. Legacy — see callout below. |
$ kubectl get crds | grep greenthread
gpuquota.greenthread.ai
gpus.greenthread.ai
gpushareclaims.greenthread.ai
models.greenthread.ai
storagenodes.greenthread.ai
The Model CR
Most users only ever write Model. Example from a live cluster:
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
name: whisper-turbo
namespace: default
spec:
modelName: openai/whisper-large-v3-turbo
modelType: asr # chat | tts | asr | embedding | reranker
dtype: bfloat16
loadFormat: gthread # gthread (fast DMA) | auto (vLLM default)
endpoints: [audio_transcriptions, audio_translations]
parallelism: { tensor: 1, pipeline: 1 }
replicas: 1
fairness: { popular: true } # pin to GPU; never preempted
sleep: {} # defaults: idleTimeout=10m, evacuateContext=false
product: NVIDIA-RTX-PRO-4000-Blackwell
uuids:
- GPU-f78cce19-fbed-7f66-d72f-a24cae367fc8
Status reflects live state:
status:
phase: Ready
readyReplicas: 1
replicaStatus:
- podName: whisper-turbo-0
phase: Serving # Serving | Sleeping | Loading | Waking | Failed
message: sidecar state=serving
memory:
servingHuman: "20.0 GiB"
sleepingHuman: "0 B"
node: blackwell-0
See Model CRD Reference for the full field list and Lifecycle & States for the state machine.
GPU sharing modes
The DRA driver gates each physical GPU into one of:
| Mode | Behaviour | Used by |
|---|---|---|
any | Idle. Any sharing-class can land. | Free GPUs |
off | Whole-GPU. One non-shared claim only. | SharingClass=off claims |
inference | GreenThread Models (lockless coordination). | Engine Models |
spatial | Co-tenant Models (MPS-style spatial partitioning, gated by libgreenthread). | SharingClass=spatial |
temporal | Co-tenant Models with time-slicing. | SharingClass=temporal |
Once a GPU's mode is set, only matching-class claims are eligible — the slice's CEL selectors enforce this at the scheduler. See Fairness Policy for how the per-GPU scheduler arbitrates between siblings.
Supported APIs
Every Model exposes standard endpoints. By default the controller routes them at /<model-name>/v1/* on the cluster Ingress; layer LiquidCompute for a unified /v1/* with model selection by request body.
OpenAI-compatible
| Endpoint | Method | Notes |
|---|---|---|
/v1/chat/completions | POST | Multi-turn, streaming, tool calling |
/v1/completions | POST | Text completions |
/v1/responses | POST | OpenAI Responses API |
/v1/embeddings | POST | Text embeddings |
/v1/audio/transcriptions | POST | Whisper-compatible |
/v1/audio/translations | POST | |
/v1/audio/speech | POST | TTS (modelType: tts) |
/v1/models | GET | List models |
Anthropic-compatible
| Endpoint | Method | Notes |
|---|---|---|
/v1/messages | POST | Anthropic Messages |
/v1/messages/count_tokens | POST |
Scoring / reranking / tokenization
| Endpoint | Method | Notes |
|---|---|---|
/v1/score | POST | |
/v1/rerank | POST | |
/tokenize / /detokenize | POST | |
/classify | POST |
See API Reference for full details, request/response examples, and SDK usage.
Key features
True zero VRAM sleep
When a Model sleeps, the engine releases all GPU memory — not 90%, not "most of it" — truly zero bytes. The GPU is fully available for other Models with no wasted memory.
Sub-second wake
Wake from CPU RAM in ~600 ms. Clients see slightly higher latency on the first request, but the response format is identical. Subsequent requests are at full vLLM speed.
Fair GPU sharing
The fairness policy ensures no Model starves. Models idle longest are slept first; fairness.popular: true pins a Model to its GPU and exempts it from preemption.
Per-Model routing
Each Model gets its own URL path (/<model-name>/v1/...) by default. Or use LiquidCompute for a single /v1/* with selection by request model field.
Monitoring built in
Prometheus metrics, Grafana dashboards, and alerting are available out of the box. See Monitoring for setup.
Older deployments ran a separate gthread-storage DaemonSet that did weight staging over gRPC. Weight handling has been folded into the per-Model sidecar; the new install path skips the storage DaemonSet entirely. loadFormat: gthread still works against the legacy storage server when present (some dev clusters carry it); loadFormat: auto works everywhere with no extra components.
Deployment options
GreenThread deploys on any environment with NVIDIA GPUs. Pick your platform:
| Platform | Status | Guide |
|---|---|---|
| AWS (EKS) | Available | Get started |
| Bare metal / kubeadm | Available | Prerequisites |
| Azure (AKS) | Coming soon | — |
| GCP (GKE) | Coming soon | — |
Next steps
- Prerequisites — GPU Operator + DRA + cert-manager
- Install GreenThread — Helm install and verify
- LiquidCompute — add the UI + unified
/v1/*on top - AI Console — customer-facing portal
- Deploying Models — deploy your first models
