GreenThread Docs

GreenThread is a GPU inference platform that lets you serve dozens of large language models on shared GPUs by intelligently swapping them in and out of GPU memory. Instead of dedicating GPUs to individual models, GreenThread keeps sleeping models on fast storage and wakes them on demand — in under a second.

It ships as three composable layers — install only what you need:

Layer	What it is	When to install
GreenThread	The engine. Kubernetes-native model lifecycle + GPU scheduling + sleep/wake.	Always.
LiquidCompute	"Render.com for GPUs" — Projects, Services UI, custom domains, unified `/v1/*` proxy.	When you want a UI and one OpenAI endpoint for many Models.
AI Console	Customer-facing portal — API keys, usage dashboards, audit log, chat / STT / TTS playgrounds.	When you want to expose the cluster to end users / customers.

See Platform Overview for how they snap together.

How the engine works

When a request arrives for a model:

If the model is serving, the request is forwarded directly — no overhead.
If the model is sleeping, GreenThread wakes it (restores weights to GPU, ~600 ms from CPU RAM), then forwards the request.
If GPUs are full, the fairness policy preempts the least-recently-used model to make room.

All of this happens transparently behind OpenAI-compatible, Anthropic-compatible, and other standard inference APIs.

When a model sleeps, GreenThread releases all GPU memory — true zero VRAM — making the GPU fully available for other models. No dedicated GPU per model, no idle waste.

Performance

For a 20B-parameter model:

Operation	Time
Traditional cold start (vLLM from `auto`)	~40 s
GreenThread wake (from CPU RAM)	~600 ms
GreenThread wake (from NVMe)	~2 s
GreenThread sleep	~90 ms

API compatibility

Every model exposes standard inference endpoints. Two ways to address them:

Style	Path	When
Per-Model URL (engine only)	`/<model-name>/v1/chat/completions`	Direct integration; one URL per model.
*Unified `/v1/`** (with LiquidCompute)	`/v1/chat/completions` + `"model": "<name>"`	OpenAI-shaped — exactly one endpoint regardless of how many models you serve.

Either way, any SDK that works with OpenAI or Anthropic works with GreenThread.

See API Reference for the full endpoint list.

Quickstart

Point your OpenAI SDK at your GreenThread instance:

from openai import OpenAI

# Engine-only style
client = OpenAI(
    base_url="http://<engine-url>/llama-3-8b/v1",
    api_key="not-needed",
)

# Or, with LiquidCompute (unified /v1/*)
client = OpenAI(
    base_url="https://app.example.com/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",   # the upstream HF id
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

If the model is sleeping, GreenThread automatically wakes it, serves the request, and returns the response — all behind the standard API.

Next steps

Platform Overview — how the engine, LiquidCompute, and AI Console fit together
Architecture — engine internals: controller, agent, sidecar, DRA
Install GreenThread — get the engine running
Deploying Models — deploy your first models
Inference API — full endpoint reference