GreenThread Docs

GreenThread supports serving LoRA adapters on top of base models with on-demand loading and automatic caching. Adapters are loaded lazily on first request and cached on the node for subsequent use.

Enable LoRA in the Model CRD

LoRA support is configured in the lora section of the Model CRD:

apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
  name: llama-3-1-8b
  namespace: greenthread-system
spec:
  modelName: meta-llama/Llama-3.1-8B-Instruct
  dtype: bfloat16
  gpuMemoryUtilization: "0.90"
  tensorParallelSize: 1
  replicas: 1
  extraArgs:
    - "--enforce-eager"
  lora:
    enabled: true
    maxRank: 64
    cacheDir: /tmp/lora-cache
    downloadTimeout: 5m
  fairness: {}
  kvCache: {}
  sleep: {}
  staging: {}
  resources: {}

Field	Type	Default	Description
`lora.enabled`	bool	`false`	Enable LoRA adapter loading. Maps to vLLM `--enable-lora`.
`lora.maxRank`	int	`64`	Maximum LoRA rank supported. Must be >= the rank of any adapter you serve. Maps to `--max-lora-rank`.
`lora.cacheDir`	string	`""`	Local path for downloaded LoRA adapter files.
`lora.downloadTimeout`	duration	`5m`	Maximum time to wait for adapter download from object storage.

How it works

LoRA adapter loading uses a cache-on-demand pattern:

You send an inference request with lora_adapter_metadata specifying the adapter.
If the adapter is not cached on the node, the API returns a 404.
Your proxy catches the 404, generates signed URLs for the adapter files, and retries with the files dict.
GreenThread downloads the adapter, caches it, and serves the request.
Subsequent requests for the same adapter and hash are served from cache instantly.

This means no upfront provisioning, no unnecessary S3 operations, and no per-request latency penalty once cached.

Request format

Add the lora_adapter_metadata field alongside your normal request body on the /v1/chat/completions and /v1/completions endpoints. Requests are routed through the gateway as usual:

POST http://<gateway-url>/<model-name>/v1/chat/completions

`lora_adapter_metadata` schema

{
  "lora_adapter_metadata": {
    "lora_model": {
      "id": "string (required)",
      "name": "string (required)",
      "hash": "string (recommended)",
      "files": {
        "adapter_config.json": "https://...",
        "adapter_model.safetensors": "https://..."
      }
    }
  }
}

Field	Type	Required	Description
`id`	string	Yes	Unique identifier, used as the cache key
`name`	string	Yes	Display name, returned as the `model` field in responses
`hash`	string	Recommended	Version identifier for cache invalidation
`files`	object or string[]	Conditional	Adapter files to download. Required on retry after a 404. Preferred format is a dict mapping filename to URL. A flat list of URLs is also accepted (filenames are inferred from URL paths).

Integration pattern

The recommended pattern is a try/catch in your proxy layer:

from openai import OpenAI

client = OpenAI(
    base_url="http://<gateway-url>/llama-3-1-8b/v1",
    api_key="not-needed",
)

def inference_with_lora(prompt, adapter):
    payload = {
        "model": "meta-llama/Llama-3.1-8B-Instruct",
        "messages": [{"role": "user", "content": prompt}],
        "lora_adapter_metadata": {
            "lora_model": {
                "id": adapter.id,
                "name": adapter.name,
                "hash": str(adapter.updated_at),
            }
        }
    }

    response = requests.post(
        "http://<gateway-url>/llama-3-1-8b/v1/chat/completions",
        json=payload,
    )

    if response.status_code == 404:
        # Adapter not cached — add signed file URLs and retry
        payload["lora_adapter_metadata"]["lora_model"]["files"] = {
            filename: generate_signed_url(key)
            for filename, key in adapter.file_keys.items()
        }
        response = requests.post(
            "http://<gateway-url>/llama-3-1-8b/v1/chat/completions",
            json=payload,
        )

    return response.json()

S3 access with IRSA

If your LoRA adapters are stored in S3, configure a Kubernetes ServiceAccount with IAM Roles for Service Accounts (IRSA) and reference it in the Model CRD:

spec:
  serviceAccountName: lora-adapter-sa
  lora:
    enabled: true

The model pod will use this service account's credentials to access S3.

Cache invalidation

The hash field controls cache invalidation. When GreenThread receives a request where the hash differs from what is cached:

The existing cached adapter is invalidated.
If files are provided, the new version is downloaded and cached.
If files are not provided, a 404 is returned (triggering your retry flow).

Use timestamps as hashes

Set hash to your adapter's updated_at timestamp. Whenever you retrain, the new timestamp naturally invalidates the old cache.

If hash is omitted, the cached adapter is used indefinitely.

Adapter file requirements

LoRA adapters must include:

adapter_config.json — HuggingFace PEFT configuration
adapter_model.safetensors — Adapter weights in safetensors format

Any LoRA adapter trained with PEFT and exported in this format is compatible.

Sleep/wake with LoRA

LoRA adapters work seamlessly with GreenThread's sleep/wake system. When a model with LoRA enabled sleeps and wakes:

The base model weights are checkpointed and restored via the normal sleep/wake flow
Cached LoRA adapters persist on the node's local filesystem (cacheDir)
After wake, adapters are re-loaded on first request from the local cache — no re-download needed

Error responses

HTTP Code	Error Type	Meaning
`404`	`not_found`	Adapter not in cache, no `files` provided. Retry with URLs.
`400`	`bad_request`	Missing required fields (`id` or `name`).
`502`	`bad_gateway`	Failed to download adapter files from provided URLs.
`503`	`service_unavailable`	Model is booting or draining for sleep.