Before deploying the GreenThread engine your Kubernetes cluster needs three things: NVIDIA drivers, a DRA driver to allocate GPUs, and an ingress controller for any HTTP traffic. The engine itself doesn't run an ingress — routing is owned by LiquidCompute and the cluster's standard Ingress controller (NGINX, Traefik, ALB, etc.).
| Component | Why | Required for |
|---|---|---|
| NVIDIA GPU Operator | Installs drivers, container toolkit, and feature discovery on GPU nodes. | Engine |
| NVIDIA DRA Driver | Dynamic Resource Allocation — GA from Kubernetes 1.34. The engine's agent registers as a DRA kubelet plugin and the controller renders DRA ResourceClaimTemplates. | Engine |
| Ingress controller | Any standard one (nginx, traefik, alb, etc.). | LiquidCompute + AI Console |
| cert-manager | TLS for the platform Ingress + custom domains. | LiquidCompute + AI Console |
| kube-prometheus-stack | Provides ServiceMonitor / PodMonitor CRDs that the engine emits. | Optional — only if you want metrics |
The engine alone needs only the first two. The rest become required when you layer LiquidCompute / AI Console on top.
NVIDIA GPU Operator
The GPU Operator installs NVIDIA drivers, the container toolkit, CDI specs, and GPU feature discovery. The legacy device plugin is disabled in favour of DRA.
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
helm upgrade --install gpu-operator nvidia/gpu-operator \
--version=v25.10.1 \
--create-namespace --namespace gpu-operator \
--set devicePlugin.enabled=false \
--set driver.enabled=true \
--set toolkit.enabled=true
Wait for all pods to reach Running:
kubectl get pods -n gpu-operator -w
The NVIDIA driver DaemonSet compiles kernel modules on the node. This can take 5-10 minutes on first boot. Subsequent restarts pick up the cached modules and are instant.
NVIDIA DRA Driver
DRA replaces the legacy device-plugin model with claim-based GPU allocation. The engine's agent registers as a DRA kubelet plugin and publishes one ResourceSlice per node describing per-GPU slots; the engine's controller renders one ResourceClaimTemplate per GPUShareClaim against the cluster's DeviceClass.
Label nodes with the kubelet plugin
GPU_NODES=$(kubectl get nodes -l nvidia.com/gpu.present=true -o name)
for n in $GPU_NODES; do
kubectl label "$n" nvidia.com/dra-kubelet-plugin=true --overwrite
done
Install the driver
Two important overrides:
nvidiaDriverRoot=/run/nvidia/driver— the GPU Operator installs to/run/nvidia/driver, not the host root, so the DRA driver needs to know where to findlibcuda.- Controller tolerations — managed Kubernetes services (EKS, GKE, AKS) have no schedulable control plane nodes, so the DRA controller's default affinity won't find a host.
helm upgrade --install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--version="25.12.0" \
--create-namespace --namespace nvidia-dra-driver-gpu \
--set resources.gpus.enabled=true \
--set gpuResourcesEnabledOverride=true \
--set nvidiaDriverRoot=/run/nvidia/driver \
--set controller.nodeSelector=null \
--set "controller.affinity=null" \
--set "controller.tolerations[0].key=CriticalAddonsOnly" \
--set "controller.tolerations[0].operator=Exists"
Verify DRA
# Pods running
kubectl get pods -n nvidia-dra-driver-gpu
# DeviceClasses created
kubectl get deviceclasses
# Expected: gpu.nvidia.com, mig.nvidia.com (plus greenthread.ai once the engine installs)
# ResourceSlices showing GPUs (appears after the engine's agent boots)
kubectl get resourceslices
Ingress controller
Any one of these works. Pick whichever you already operate.
Option A: NGINX Ingress
helm upgrade --install ingress-nginx ingress-nginx \
--repo https://kubernetes.github.io/ingress-nginx \
--namespace ingress-nginx --create-namespace
Option B: cloud-provider ALB / Traefik / etc.
Use whatever's already running. You'll pass the IngressClass name (nginx, traefik, alb, …) to the LiquidCompute and AI Console charts as ingress.class / ingress.className.
cert-manager (for LiquidCompute + AI Console)
helm upgrade --install cert-manager cert-manager \
--repo https://charts.jetstack.io \
--namespace cert-manager --create-namespace \
--set crds.enabled=true
Then create a ClusterIssuer — most installs use Let's Encrypt:
# letsencrypt.yaml
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: letsencrypt-production
spec:
acme:
server: https://acme-v02.api.letsencrypt.org/directory
email: ops@example.com
privateKeySecretRef:
name: letsencrypt-production
solvers:
- http01:
ingress:
class: nginx
kubectl apply -f letsencrypt.yaml
kube-prometheus-stack (optional)
Only required if you want Prometheus metrics + Grafana dashboards. The engine emits ServiceMonitors when serviceMonitor.enabled=true.
helm upgrade --install kube-prom kube-prometheus-stack \
--repo https://prometheus-community.github.io/helm-charts \
--namespace monitoring --create-namespace
See Monitoring for the full Prometheus + Grafana setup.
Checklist
Before installing the engine, confirm:
- GPU Operator pods are all
Runningingpu-operator - DRA Driver pods are
Runningandkubectl get deviceclassesshowsgpu.nvidia.com - GPU nodes are labelled with
nvidia.com/dra-kubelet-plugin=true - (LiquidCompute / AIC) Ingress controller is up and
kubectl get ingressclassshows a class - (LiquidCompute / AIC) cert-manager is up and the ClusterIssuer is
Ready
Next steps
Continue to Install GreenThread for the Helm install.
