GreenThreadDocs

This page covers the engine specifically — the inferencing platform that owns Model lifecycle, GPU scheduling, and sleep/wake. For the layers above (Projects / Services UI, customer-facing API keys) see LiquidCompute and AI Console. For the whole picture see Platform Overview.

Components

The engine ships as four Go binaries from a single Helm chart:

BinaryTypeRole
controllerDeployment (2 replicas, leader-elected)Reconciles Model, GPUShareClaim, GPU, StorageNode. The leader phones home to the licence server hourly.
webhookDeployment (2 replicas)Validating admission for Model + GPUShareClaim.
agentDaemonSet (every GPU node)NVML discovery, per-GPU C scheduler daemons, DRA kubelet plugin, ResourceSlice publisher, GPU CRD reporter.
sidecarInjected into every Model podHTTP front-door, sleep/wake state machine, vLLM coordination, GPU CRD CAS for fairness / preemption.

Routing — getting an HTTP request from the cluster edge to the right Model pod — is not owned by the engine. Use a standard Ingress controller, or layer LiquidCompute on top for a unified /v1/* proxy.

High-level flow

  1. You declare a Model. The webhook validates cross-field constraints (parallelism, sleep config, role layout).
  2. The controller renders a pod with two containers: vLLM + the GreenThread sidecar. It also renders a per-Model ResourceClaim against the greenthread.ai DeviceClass.
  3. When the pod is scheduled, the agent's DRA kubelet plugin receives NodePrepareResources, reserves a slot on the requested GPU, and writes a per-claim CDI spec.
  4. containerd reads the CDI spec and injects libgreenthread.so plus GPU device nodes into the workload.
  5. The sidecar takes over: it owns the HTTP front-door, the sleep/wake state machine, and coordinates with the per-GPU scheduler daemon for fair multi-tenant access.

Custom resources

KindScopeOwned byPurpose
ModelNamespacedYouDeclarative spec for a model deployment (modelName, dtype, parallelism, fairness, sleep, …).
GPUShareClaimNamespacedYou / LiquidComputeA simpler-than-DRA way to ask for a GPU slot. The controller compiles it into a ResourceClaimTemplate.
GPUClusterEngineOne per physical GPU. Agent owns spec + telemetry; DRA driver owns mode; sidecar owns coordination. kubectl get gpus is the canonical fleet view.
GPUQuotaNamespacedYouOptional per-namespace caps (maxVRAMGiB, maxGPUs). Webhook enforces at admission.
StorageNodeClusterEngineHealth-marker CR for the optional per-node storage server. Legacy — see callout below.
$ kubectl get crds | grep greenthread
gpuquota.greenthread.ai
gpus.greenthread.ai
gpushareclaims.greenthread.ai
models.greenthread.ai
storagenodes.greenthread.ai

The Model CR

Most users only ever write Model. Example from a live cluster:

apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
  name: whisper-turbo
  namespace: default
spec:
  modelName: openai/whisper-large-v3-turbo
  modelType: asr                 # chat | tts | asr | embedding | reranker
  dtype: bfloat16
  loadFormat: gthread            # gthread (fast DMA) | auto (vLLM default)
  endpoints: [audio_transcriptions, audio_translations]
  parallelism: { tensor: 1, pipeline: 1 }
  replicas: 1
  fairness: { popular: true }    # pin to GPU; never preempted
  sleep: {}                      # defaults: idleTimeout=10m, evacuateContext=false
  product: NVIDIA-RTX-PRO-4000-Blackwell
  uuids:
    - GPU-f78cce19-fbed-7f66-d72f-a24cae367fc8

Status reflects live state:

status:
  phase: Ready
  readyReplicas: 1
  replicaStatus:
    - podName: whisper-turbo-0
      phase: Serving                # Serving | Sleeping | Loading | Waking | Failed
      message: sidecar state=serving
  memory:
    servingHuman: "20.0 GiB"
    sleepingHuman: "0 B"
  node: blackwell-0

See Model CRD Reference for the full field list and Lifecycle & States for the state machine.

GPU sharing modes

The DRA driver gates each physical GPU into one of:

ModeBehaviourUsed by
anyIdle. Any sharing-class can land.Free GPUs
offWhole-GPU. One non-shared claim only.SharingClass=off claims
inferenceGreenThread Models (lockless coordination).Engine Models
spatialCo-tenant Models (MPS-style spatial partitioning, gated by libgreenthread).SharingClass=spatial
temporalCo-tenant Models with time-slicing.SharingClass=temporal

Once a GPU's mode is set, only matching-class claims are eligible — the slice's CEL selectors enforce this at the scheduler. See Fairness Policy for how the per-GPU scheduler arbitrates between siblings.

Supported APIs

Every Model exposes standard endpoints. By default the controller routes them at /<model-name>/v1/* on the cluster Ingress; layer LiquidCompute for a unified /v1/* with model selection by request body.

OpenAI-compatible

EndpointMethodNotes
/v1/chat/completionsPOSTMulti-turn, streaming, tool calling
/v1/completionsPOSTText completions
/v1/responsesPOSTOpenAI Responses API
/v1/embeddingsPOSTText embeddings
/v1/audio/transcriptionsPOSTWhisper-compatible
/v1/audio/translationsPOST
/v1/audio/speechPOSTTTS (modelType: tts)
/v1/modelsGETList models

Anthropic-compatible

EndpointMethodNotes
/v1/messagesPOSTAnthropic Messages
/v1/messages/count_tokensPOST

Scoring / reranking / tokenization

EndpointMethodNotes
/v1/scorePOST
/v1/rerankPOST
/tokenize / /detokenizePOST
/classifyPOST

See API Reference for full details, request/response examples, and SDK usage.

Key features

True zero VRAM sleep

When a Model sleeps, the engine releases all GPU memory — not 90%, not "most of it" — truly zero bytes. The GPU is fully available for other Models with no wasted memory.

Sub-second wake

Wake from CPU RAM in ~600 ms. Clients see slightly higher latency on the first request, but the response format is identical. Subsequent requests are at full vLLM speed.

Fair GPU sharing

The fairness policy ensures no Model starves. Models idle longest are slept first; fairness.popular: true pins a Model to its GPU and exempts it from preemption.

Per-Model routing

Each Model gets its own URL path (/<model-name>/v1/...) by default. Or use LiquidCompute for a single /v1/* with selection by request model field.

Monitoring built in

Prometheus metrics, Grafana dashboards, and alerting are available out of the box. See Monitoring for setup.

Storage server (legacy)

Older deployments ran a separate gthread-storage DaemonSet that did weight staging over gRPC. Weight handling has been folded into the per-Model sidecar; the new install path skips the storage DaemonSet entirely. loadFormat: gthread still works against the legacy storage server when present (some dev clusters carry it); loadFormat: auto works everywhere with no extra components.

Deployment options

GreenThread deploys on any environment with NVIDIA GPUs. Pick your platform:

PlatformStatusGuide
AWS (EKS)AvailableGet started
Bare metal / kubeadmAvailablePrerequisites
Azure (AKS)Coming soon
GCP (GKE)Coming soon

Next steps