GreenThread Docs

This page covers the engine specifically — the inferencing platform that owns Model lifecycle, GPU scheduling, and sleep/wake. For the layers above (Projects / Services UI, customer-facing API keys) see LiquidCompute and AI Console. For the whole picture see Platform Overview.

Components

The engine ships as four Go binaries from a single Helm chart:

Binary	Type	Role
controller	Deployment (2 replicas, leader-elected)	Reconciles `Model`, `GPUShareClaim`, `GPU`, `StorageNode`. The leader phones home to the licence server hourly.
webhook	Deployment (2 replicas)	Validating admission for `Model` + `GPUShareClaim`.
agent	DaemonSet (every GPU node)	NVML discovery, per-GPU C scheduler daemons, DRA kubelet plugin, `ResourceSlice` publisher, GPU CRD reporter.
sidecar	Injected into every Model pod	HTTP front-door, sleep/wake state machine, vLLM coordination, GPU CRD CAS for fairness / preemption.

Routing — getting an HTTP request from the cluster edge to the right Model pod — is not owned by the engine. Use a standard Ingress controller, or layer LiquidCompute on top for a unified /v1/* proxy.

High-level flow

You declare a Model. The webhook validates cross-field constraints (parallelism, sleep config, role layout).
The controller renders a pod with two containers: vLLM + the GreenThread sidecar. It also renders a per-Model ResourceClaim against the greenthread.ai DeviceClass.
When the pod is scheduled, the agent's DRA kubelet plugin receives NodePrepareResources, reserves a slot on the requested GPU, and writes a per-claim CDI spec.
containerd reads the CDI spec and injects libgreenthread.so plus GPU device nodes into the workload.
The sidecar takes over: it owns the HTTP front-door, the sleep/wake state machine, and coordinates with the per-GPU scheduler daemon for fair multi-tenant access.

Custom resources

Kind	Scope	Owned by	Purpose
`Model`	Namespaced	You	Declarative spec for a model deployment (modelName, dtype, parallelism, fairness, sleep, …).
`GPUShareClaim`	Namespaced	You / LiquidCompute	A simpler-than-DRA way to ask for a GPU slot. The controller compiles it into a `ResourceClaimTemplate`.
`GPU`	Cluster	Engine	One per physical GPU. Agent owns spec + telemetry; DRA driver owns `mode`; sidecar owns coordination. `kubectl get gpus` is the canonical fleet view.
`GPUQuota`	Namespaced	You	Optional per-namespace caps (`maxVRAMGiB`, `maxGPUs`). Webhook enforces at admission.
`StorageNode`	Cluster	Engine	Health-marker CR for the optional per-node storage server. Legacy — see callout below.

$ kubectl get crds | grep greenthread
gpuquota.greenthread.ai
gpus.greenthread.ai
gpushareclaims.greenthread.ai
models.greenthread.ai
storagenodes.greenthread.ai

The Model CR

Most users only ever write Model. Example from a live cluster:

apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
  name: whisper-turbo
  namespace: default
spec:
  modelName: openai/whisper-large-v3-turbo
  modelType: asr                 # chat | tts | asr | embedding | reranker
  dtype: bfloat16
  loadFormat: gthread            # gthread (fast DMA) | auto (vLLM default)
  endpoints: [audio_transcriptions, audio_translations]
  parallelism: { tensor: 1, pipeline: 1 }
  replicas: 1
  fairness: { popular: true }    # pin to GPU; never preempted
  sleep: {}                      # defaults: idleTimeout=10m, evacuateContext=false
  product: NVIDIA-RTX-PRO-4000-Blackwell
  uuids:
    - GPU-f78cce19-fbed-7f66-d72f-a24cae367fc8

Status reflects live state:

status:
  phase: Ready
  readyReplicas: 1
  replicaStatus:
    - podName: whisper-turbo-0
      phase: Serving                # Serving | Sleeping | Loading | Waking | Failed
      message: sidecar state=serving
  memory:
    servingHuman: "20.0 GiB"
    sleepingHuman: "0 B"
  node: blackwell-0

See Model CRD Reference for the full field list and Lifecycle & States for the state machine.

The DRA driver gates each physical GPU into one of:

Mode	Behaviour	Used by
`any`	Idle. Any sharing-class can land.	Free GPUs
`off`	Whole-GPU. One non-shared claim only.	`SharingClass=off` claims
`inference`	GreenThread Models (lockless coordination).	Engine Models
`spatial`	Co-tenant Models (MPS-style spatial partitioning, gated by libgreenthread).	`SharingClass=spatial`
`temporal`	Co-tenant Models with time-slicing.	`SharingClass=temporal`

Once a GPU's mode is set, only matching-class claims are eligible — the slice's CEL selectors enforce this at the scheduler. See Fairness Policy for how the per-GPU scheduler arbitrates between siblings.

Supported APIs

Every Model exposes standard endpoints. By default the controller routes them at /<model-name>/v1/* on the cluster Ingress; layer LiquidCompute for a unified /v1/* with model selection by request body.

OpenAI-compatible

Endpoint	Method	Notes
`/v1/chat/completions`	POST	Multi-turn, streaming, tool calling
`/v1/completions`	POST	Text completions
`/v1/responses`	POST	OpenAI Responses API
`/v1/embeddings`	POST	Text embeddings
`/v1/audio/transcriptions`	POST	Whisper-compatible
`/v1/audio/translations`	POST
`/v1/audio/speech`	POST	TTS (`modelType: tts`)
`/v1/models`	GET	List models

Anthropic-compatible

Endpoint	Method	Notes
`/v1/messages`	POST	Anthropic Messages
`/v1/messages/count_tokens`	POST

Scoring / reranking / tokenization

Endpoint	Method	Notes
`/v1/score`	POST
`/v1/rerank`	POST
`/tokenize` / `/detokenize`	POST
`/classify`	POST

See API Reference for full details, request/response examples, and SDK usage.

Key features

True zero VRAM sleep

When a Model sleeps, the engine releases all GPU memory — not 90%, not "most of it" — truly zero bytes. The GPU is fully available for other Models with no wasted memory.

Sub-second wake

Wake from CPU RAM in ~600 ms. Clients see slightly higher latency on the first request, but the response format is identical. Subsequent requests are at full vLLM speed.

The fairness policy ensures no Model starves. Models idle longest are slept first; fairness.popular: true pins a Model to its GPU and exempts it from preemption.

Per-Model routing

Each Model gets its own URL path (/<model-name>/v1/...) by default. Or use LiquidCompute for a single /v1/* with selection by request model field.

Monitoring built in

Prometheus metrics, Grafana dashboards, and alerting are available out of the box. See Monitoring for setup.

Storage server (legacy)

Older deployments ran a separate gthread-storage DaemonSet that did weight staging over gRPC. Weight handling has been folded into the per-Model sidecar; the new install path skips the storage DaemonSet entirely. loadFormat: gthread still works against the legacy storage server when present (some dev clusters carry it); loadFormat: auto works everywhere with no extra components.

Deployment options

GreenThread deploys on any environment with NVIDIA GPUs. Pick your platform:

Platform	Status	Guide
AWS (EKS)	Available	Get started
Bare metal / kubeadm	Available	Prerequisites
Azure (AKS)	Coming soon	—
GCP (GKE)	Coming soon	—

Next steps

Prerequisites — GPU Operator + DRA + cert-manager
Install GreenThread — Helm install and verify
LiquidCompute — add the UI + unified /v1/* on top
AI Console — customer-facing portal
Deploying Models — deploy your first models