GreenThreadDocs

Before deploying the GreenThread engine your Kubernetes cluster needs three things: NVIDIA drivers, a DRA driver to allocate GPUs, and an ingress controller for any HTTP traffic. The engine itself doesn't run an ingress — routing is owned by LiquidCompute and the cluster's standard Ingress controller (NGINX, Traefik, ALB, etc.).

ComponentWhyRequired for
NVIDIA GPU OperatorInstalls drivers, container toolkit, and feature discovery on GPU nodes.Engine
NVIDIA DRA DriverDynamic Resource Allocation — GA from Kubernetes 1.34. The engine's agent registers as a DRA kubelet plugin and the controller renders DRA ResourceClaimTemplates.Engine
Ingress controllerAny standard one (nginx, traefik, alb, etc.).LiquidCompute + AI Console
cert-managerTLS for the platform Ingress + custom domains.LiquidCompute + AI Console
kube-prometheus-stackProvides ServiceMonitor / PodMonitor CRDs that the engine emits.Optional — only if you want metrics

The engine alone needs only the first two. The rest become required when you layer LiquidCompute / AI Console on top.

NVIDIA GPU Operator

The GPU Operator installs NVIDIA drivers, the container toolkit, CDI specs, and GPU feature discovery. The legacy device plugin is disabled in favour of DRA.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm upgrade --install gpu-operator nvidia/gpu-operator \
  --version=v25.10.1 \
  --create-namespace --namespace gpu-operator \
  --set devicePlugin.enabled=false \
  --set driver.enabled=true \
  --set toolkit.enabled=true

Wait for all pods to reach Running:

kubectl get pods -n gpu-operator -w
Driver compilation

The NVIDIA driver DaemonSet compiles kernel modules on the node. This can take 5-10 minutes on first boot. Subsequent restarts pick up the cached modules and are instant.

NVIDIA DRA Driver

DRA replaces the legacy device-plugin model with claim-based GPU allocation. The engine's agent registers as a DRA kubelet plugin and publishes one ResourceSlice per node describing per-GPU slots; the engine's controller renders one ResourceClaimTemplate per GPUShareClaim against the cluster's DeviceClass.

Label nodes with the kubelet plugin

GPU_NODES=$(kubectl get nodes -l nvidia.com/gpu.present=true -o name)
for n in $GPU_NODES; do
  kubectl label "$n" nvidia.com/dra-kubelet-plugin=true --overwrite
done

Install the driver

Two important overrides:

  1. nvidiaDriverRoot=/run/nvidia/driver — the GPU Operator installs to /run/nvidia/driver, not the host root, so the DRA driver needs to know where to find libcuda.
  2. Controller tolerations — managed Kubernetes services (EKS, GKE, AKS) have no schedulable control plane nodes, so the DRA controller's default affinity won't find a host.
helm upgrade --install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
  --version="25.12.0" \
  --create-namespace --namespace nvidia-dra-driver-gpu \
  --set resources.gpus.enabled=true \
  --set gpuResourcesEnabledOverride=true \
  --set nvidiaDriverRoot=/run/nvidia/driver \
  --set controller.nodeSelector=null \
  --set "controller.affinity=null" \
  --set "controller.tolerations[0].key=CriticalAddonsOnly" \
  --set "controller.tolerations[0].operator=Exists"

Verify DRA

# Pods running
kubectl get pods -n nvidia-dra-driver-gpu

# DeviceClasses created
kubectl get deviceclasses
# Expected: gpu.nvidia.com, mig.nvidia.com (plus greenthread.ai once the engine installs)

# ResourceSlices showing GPUs (appears after the engine's agent boots)
kubectl get resourceslices

Ingress controller

Any one of these works. Pick whichever you already operate.

Option A: NGINX Ingress

helm upgrade --install ingress-nginx ingress-nginx \
  --repo https://kubernetes.github.io/ingress-nginx \
  --namespace ingress-nginx --create-namespace

Option B: cloud-provider ALB / Traefik / etc.

Use whatever's already running. You'll pass the IngressClass name (nginx, traefik, alb, …) to the LiquidCompute and AI Console charts as ingress.class / ingress.className.

cert-manager (for LiquidCompute + AI Console)

helm upgrade --install cert-manager cert-manager \
  --repo https://charts.jetstack.io \
  --namespace cert-manager --create-namespace \
  --set crds.enabled=true

Then create a ClusterIssuer — most installs use Let's Encrypt:

# letsencrypt.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-production
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: ops@example.com
    privateKeySecretRef:
      name: letsencrypt-production
    solvers:
      - http01:
          ingress:
            class: nginx
kubectl apply -f letsencrypt.yaml

kube-prometheus-stack (optional)

Only required if you want Prometheus metrics + Grafana dashboards. The engine emits ServiceMonitors when serviceMonitor.enabled=true.

helm upgrade --install kube-prom kube-prometheus-stack \
  --repo https://prometheus-community.github.io/helm-charts \
  --namespace monitoring --create-namespace

See Monitoring for the full Prometheus + Grafana setup.

Checklist

Before installing the engine, confirm:

  • GPU Operator pods are all Running in gpu-operator
  • DRA Driver pods are Running and kubectl get deviceclasses shows gpu.nvidia.com
  • GPU nodes are labelled with nvidia.com/dra-kubelet-plugin=true
  • (LiquidCompute / AIC) Ingress controller is up and kubectl get ingressclass shows a class
  • (LiquidCompute / AIC) cert-manager is up and the ClusterIssuer is Ready

Next steps

Continue to Install GreenThread for the Helm install.