GreenThread exposes inference endpoints through a per-model sidecar proxy. Each model gets its own URL path through the Kubernetes Gateway, and the sidecar handles wake-on-request, request queuing, and proxying to the vLLM inference engine.
Endpoint routing
Each model is accessible at its own path prefix through the gateway:
http://<gateway-address>/<model-name>/v1/chat/completions
The <model-name> is the Kubernetes Model CRD name (e.g. llama-3-8b), not the HuggingFace model ID. The gateway strips the model name prefix before forwarding to the sidecar.
Get your gateway address:
GATEWAY_URL=$(kubectl get gateway -n greenthread-system greenthread-gateway \
-o jsonpath='{.status.addresses[0].value}')
Chat completions
/<model>/v1/chat/completionsGenerate a chat completion. Supports multi-turn conversations and streaming.
curl http://$GATEWAY_URL/llama-3-8b/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is GreenThread?"}
],
"temperature": 0.7,
"max_tokens": 512,
"stream": false
}'
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1700000000,
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "GreenThread is a GPU inference platform..."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 25,
"completion_tokens": 50,
"total_tokens": 75
}
}
The model field in the request body uses the full HuggingFace model ID (e.g. meta-llama/Llama-3.1-8B-Instruct). The URL path uses the Model CRD name (e.g. llama-3-8b).
Text completions
/<model>/v1/completionsGenerate a text completion from a prompt string.
{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "The capital of France is",
"max_tokens": 32,
"temperature": 0
}
Responses API
/<model>/v1/responsesOpenAI Responses API. Supports tool use, multi-turn agents, and structured outputs.
Messages API (Anthropic-compatible)
/<model>/v1/messagesAnthropic Messages API format. Use this with models that support the Anthropic API schema.
Token counting
/<model>/v1/messages/count_tokensCount tokens for a messages request without running inference.
Embeddings
/<model>/v1/embeddingsGenerate text embeddings. Requires a model that supports embeddings (e.g. sentence-transformers).
{
"model": "sentence-transformers/all-MiniLM-L6-v2",
"input": "The quick brown fox jumps over the lazy dog"
}
Audio
Transcriptions
/<model>/v1/audio/transcriptionsTranscribe audio to text. Requires a Whisper-compatible model.
Translations
/<model>/v1/audio/translationsTranslate audio to English text.
Scoring and reranking
Scoring
/<model>/v1/scoreScore text pairs. Also available at the root-level alias /score.
Reranking
/<model>/v1/rerankRerank documents by relevance. Also available as /v2/rerank and root-level aliases /rerank.
Tokenization
Tokenize
/<model>/tokenizeTokenize text into token IDs.
Detokenize
/<model>/detokenizeConvert token IDs back to text.
Classification
/<model>/classifyText classification.
SageMaker compatibility
/<model>/invocationsSageMaker-compatible inference endpoint.
Streaming
Set "stream": true to receive Server-Sent Events as the model generates tokens:
curl http://$GATEWAY_URL/llama-3-8b/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Hello"}],
"stream": true
}'
Each SSE event contains a chunk:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"!"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
Model status
/<model>/statusReturns the current sidecar state for a model:
{
"model": "llama-3-8b",
"state": "serving",
"bootReady": true,
"queue": {
"inFlight": 3,
"barriered": false
}
}
| Field | Description |
|---|---|
state | Current state: sleeping, pending, waking, serving, deactivating |
bootReady | true once the sidecar has completed initial boot |
queue.inFlight | Number of requests currently being processed |
queue.barriered | true during sleep drain — new requests are rejected |
List models
/<model>/v1/modelsReturns the loaded model even when sleeping.
{
"object": "list",
"data": [{
"id": "meta-llama/Llama-3.1-8B-Instruct",
"object": "model",
"created": 1773032509,
"owned_by": "vllm",
"root": "meta-llama/Llama-3.1-8B-Instruct",
"max_model_len": 131072
}]
}
Wake-on-request
If a model is sleeping when an inference request arrives, the sidecar automatically wakes it:
- The sidecar triggers a wake — the storage agent reloads weights from NVMe into GPU memory
- If the GPU is full, the fairness policy preempts the least-recently-used model first
- Once serving, the request is proxied to vLLM and the response streams back
The client sees higher latency on the first request (~600ms from CPU RAM, ~2s from disk with GDS) but the response format is identical. Subsequent requests while the model is serving have no overhead.
During a wake transition, incoming requests are queued (not rejected). During a sleep drain, the barrier activates — new requests receive 503 while in-flight requests complete.
Error responses
| Status | Meaning |
|---|---|
502 | Backend error (vLLM process failed) |
503 | Model booting, draining for sleep, or wake failed |
SDK usage
Python (OpenAI SDK)
from openai import OpenAI
client = OpenAI(
base_url="http://<gateway-url>/llama-3-8b/v1",
api_key="not-needed", # unless auth is configured
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)
Python (Anthropic SDK)
import anthropic
client = anthropic.Anthropic(
base_url="http://<gateway-url>/my-model/v1",
api_key="not-needed",
)
message = client.messages.create(
model="my-model-name",
max_tokens=1024,
messages=[{"role": "user", "content": "Hello!"}],
)
print(message.content[0].text)
curl (tool calling)
curl http://$GATEWAY_URL/qwen-3-5-35b-a3b/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3.5-35B-A3B",
"messages": [{"role": "user", "content": "What is the weather in London?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
}],
"max_tokens": 256
}'
Tool calling requires --enable-auto-tool-choice and --tool-call-parser in the Model CRD extraArgs. See Recipes for examples.
