GreenThread is a GPU inference platform that lets you serve dozens of large language models on shared GPUs by intelligently swapping them in and out of GPU memory. Instead of dedicating GPUs to individual models, GreenThread keeps sleeping models on fast storage and wakes them on demand — in under a second.
It ships as three composable layers — install only what you need:
| Layer | What it is | When to install |
|---|---|---|
| GreenThread | The engine. Kubernetes-native model lifecycle + GPU scheduling + sleep/wake. | Always. |
| LiquidCompute | "Render.com for GPUs" — Projects, Services UI, custom domains, unified /v1/* proxy. | When you want a UI and one OpenAI endpoint for many Models. |
| AI Console | Customer-facing portal — API keys, usage dashboards, audit log, chat / STT / TTS playgrounds. | When you want to expose the cluster to end users / customers. |
See Platform Overview for how they snap together.
How the engine works
When a request arrives for a model:
- If the model is serving, the request is forwarded directly — no overhead.
- If the model is sleeping, GreenThread wakes it (restores weights to GPU, ~600 ms from CPU RAM), then forwards the request.
- If GPUs are full, the fairness policy preempts the least-recently-used model to make room.
All of this happens transparently behind OpenAI-compatible, Anthropic-compatible, and other standard inference APIs.
When a model sleeps, GreenThread releases all GPU memory — true zero VRAM — making the GPU fully available for other models. No dedicated GPU per model, no idle waste.
Performance
For a 20B-parameter model:
| Operation | Time |
|---|---|
Traditional cold start (vLLM from auto) | ~40 s |
| GreenThread wake (from CPU RAM) | ~600 ms |
| GreenThread wake (from NVMe) | ~2 s |
| GreenThread sleep | ~90 ms |
API compatibility
Every model exposes standard inference endpoints. Two ways to address them:
| Style | Path | When |
|---|---|---|
| Per-Model URL (engine only) | /<model-name>/v1/chat/completions | Direct integration; one URL per model. |
Unified /v1/* (with LiquidCompute) | /v1/chat/completions + "model": "<name>" | OpenAI-shaped — exactly one endpoint regardless of how many models you serve. |
Either way, any SDK that works with OpenAI or Anthropic works with GreenThread.
See API Reference for the full endpoint list.
Quickstart
Point your OpenAI SDK at your GreenThread instance:
from openai import OpenAI
# Engine-only style
client = OpenAI(
base_url="http://<engine-url>/llama-3-8b/v1",
api_key="not-needed",
)
# Or, with LiquidCompute (unified /v1/*)
client = OpenAI(
base_url="https://app.example.com/v1",
api_key="not-needed",
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct", # the upstream HF id
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
If the model is sleeping, GreenThread automatically wakes it, serves the request, and returns the response — all behind the standard API.
Next steps
- Platform Overview — how the engine, LiquidCompute, and AI Console fit together
- Architecture — engine internals: controller, agent, sidecar, DRA
- Install GreenThread — get the engine running
- Deploying Models — deploy your first models
- Inference API — full endpoint reference
