GreenThread supports serving LoRA adapters on top of base models with on-demand loading and automatic caching. Adapters are loaded lazily on first request and cached on the node for subsequent use.
Enable LoRA in the Model CRD
LoRA support is configured in the lora section of the Model CRD:
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
name: llama-3-1-8b
namespace: greenthread-system
spec:
modelName: meta-llama/Llama-3.1-8B-Instruct
dtype: bfloat16
gpuMemoryUtilization: "0.90"
tensorParallelSize: 1
replicas: 1
extraArgs:
- "--enforce-eager"
lora:
enabled: true
maxRank: 64
cacheDir: /tmp/lora-cache
downloadTimeout: 5m
fairness: {}
kvCache: {}
sleep: {}
staging: {}
resources: {}
| Field | Type | Default | Description |
|---|---|---|---|
lora.enabled | bool | false | Enable LoRA adapter loading. Maps to vLLM --enable-lora. |
lora.maxRank | int | 64 | Maximum LoRA rank supported. Must be >= the rank of any adapter you serve. Maps to --max-lora-rank. |
lora.cacheDir | string | "" | Local path for downloaded LoRA adapter files. |
lora.downloadTimeout | duration | 5m | Maximum time to wait for adapter download from object storage. |
How it works
LoRA adapter loading uses a cache-on-demand pattern:
- You send an inference request with
lora_adapter_metadataspecifying the adapter. - If the adapter is not cached on the node, the API returns a
404. - Your proxy catches the 404, generates signed URLs for the adapter files, and retries with the
filesdict. - GreenThread downloads the adapter, caches it, and serves the request.
- Subsequent requests for the same adapter and
hashare served from cache instantly.
This means no upfront provisioning, no unnecessary S3 operations, and no per-request latency penalty once cached.
Request format
Add the lora_adapter_metadata field alongside your normal request body on the /v1/chat/completions and /v1/completions endpoints. Requests are routed through the gateway as usual:
POST http://<gateway-url>/<model-name>/v1/chat/completions
lora_adapter_metadata schema
{
"lora_adapter_metadata": {
"lora_model": {
"id": "string (required)",
"name": "string (required)",
"hash": "string (recommended)",
"files": {
"adapter_config.json": "https://...",
"adapter_model.safetensors": "https://..."
}
}
}
}
| Field | Type | Required | Description |
|---|---|---|---|
id | string | Yes | Unique identifier, used as the cache key |
name | string | Yes | Display name, returned as the model field in responses |
hash | string | Recommended | Version identifier for cache invalidation |
files | object or string[] | Conditional | Adapter files to download. Required on retry after a 404. Preferred format is a dict mapping filename to URL. A flat list of URLs is also accepted (filenames are inferred from URL paths). |
Integration pattern
The recommended pattern is a try/catch in your proxy layer:
from openai import OpenAI
client = OpenAI(
base_url="http://<gateway-url>/llama-3-1-8b/v1",
api_key="not-needed",
)
def inference_with_lora(prompt, adapter):
payload = {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": prompt}],
"lora_adapter_metadata": {
"lora_model": {
"id": adapter.id,
"name": adapter.name,
"hash": str(adapter.updated_at),
}
}
}
response = requests.post(
"http://<gateway-url>/llama-3-1-8b/v1/chat/completions",
json=payload,
)
if response.status_code == 404:
# Adapter not cached — add signed file URLs and retry
payload["lora_adapter_metadata"]["lora_model"]["files"] = {
filename: generate_signed_url(key)
for filename, key in adapter.file_keys.items()
}
response = requests.post(
"http://<gateway-url>/llama-3-1-8b/v1/chat/completions",
json=payload,
)
return response.json()
S3 access with IRSA
If your LoRA adapters are stored in S3, configure a Kubernetes ServiceAccount with IAM Roles for Service Accounts (IRSA) and reference it in the Model CRD:
spec:
serviceAccountName: lora-adapter-sa
lora:
enabled: true
The model pod will use this service account's credentials to access S3.
Cache invalidation
The hash field controls cache invalidation. When GreenThread receives a request where the hash differs from what is cached:
- The existing cached adapter is invalidated.
- If
filesare provided, the new version is downloaded and cached. - If
filesare not provided, a 404 is returned (triggering your retry flow).
Set hash to your adapter's updated_at timestamp. Whenever you retrain, the new timestamp naturally invalidates the old cache.
If hash is omitted, the cached adapter is used indefinitely.
Adapter file requirements
LoRA adapters must include:
adapter_config.json— HuggingFace PEFT configurationadapter_model.safetensors— Adapter weights in safetensors format
Any LoRA adapter trained with PEFT and exported in this format is compatible.
Sleep/wake with LoRA
LoRA adapters work seamlessly with GreenThread's sleep/wake system. When a model with LoRA enabled sleeps and wakes:
- The base model weights are checkpointed and restored via the normal sleep/wake flow
- Cached LoRA adapters persist on the node's local filesystem (
cacheDir) - After wake, adapters are re-loaded on first request from the local cache — no re-download needed
Error responses
| HTTP Code | Error Type | Meaning |
|---|---|---|
404 | not_found | Adapter not in cache, no files provided. Retry with URLs. |
400 | bad_request | Missing required fields (id or name). |
502 | bad_gateway | Failed to download adapter files from provided URLs. |
503 | service_unavailable | Model is booting or draining for sleep. |
