GreenThreadDocs

GreenThread is a GPU inference platform that lets you serve dozens of large language models on shared GPUs by intelligently swapping them in and out of GPU memory. Instead of dedicating GPUs to individual models, GreenThread keeps sleeping models on fast storage and wakes them on demand — in under a second.

It ships as three composable layers — install only what you need:

LayerWhat it isWhen to install
GreenThreadThe engine. Kubernetes-native model lifecycle + GPU scheduling + sleep/wake.Always.
LiquidCompute"Render.com for GPUs" — Projects, Services UI, custom domains, unified /v1/* proxy.When you want a UI and one OpenAI endpoint for many Models.
AI ConsoleCustomer-facing portal — API keys, usage dashboards, audit log, chat / STT / TTS playgrounds.When you want to expose the cluster to end users / customers.

See Platform Overview for how they snap together.

How the engine works

When a request arrives for a model:

  1. If the model is serving, the request is forwarded directly — no overhead.
  2. If the model is sleeping, GreenThread wakes it (restores weights to GPU, ~600 ms from CPU RAM), then forwards the request.
  3. If GPUs are full, the fairness policy preempts the least-recently-used model to make room.

All of this happens transparently behind OpenAI-compatible, Anthropic-compatible, and other standard inference APIs.

When a model sleeps, GreenThread releases all GPU memory — true zero VRAM — making the GPU fully available for other models. No dedicated GPU per model, no idle waste.

Performance

For a 20B-parameter model:

OperationTime
Traditional cold start (vLLM from auto)~40 s
GreenThread wake (from CPU RAM)~600 ms
GreenThread wake (from NVMe)~2 s
GreenThread sleep~90 ms

API compatibility

Every model exposes standard inference endpoints. Two ways to address them:

StylePathWhen
Per-Model URL (engine only)/<model-name>/v1/chat/completionsDirect integration; one URL per model.
Unified /v1/* (with LiquidCompute)/v1/chat/completions + "model": "<name>"OpenAI-shaped — exactly one endpoint regardless of how many models you serve.

Either way, any SDK that works with OpenAI or Anthropic works with GreenThread.

See API Reference for the full endpoint list.

Quickstart

Point your OpenAI SDK at your GreenThread instance:

from openai import OpenAI

# Engine-only style
client = OpenAI(
    base_url="http://<engine-url>/llama-3-8b/v1",
    api_key="not-needed",
)

# Or, with LiquidCompute (unified /v1/*)
client = OpenAI(
    base_url="https://app.example.com/v1",
    api_key="not-needed",
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",   # the upstream HF id
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

If the model is sleeping, GreenThread automatically wakes it, serves the request, and returns the response — all behind the standard API.

Next steps