GreenThread Docs

KV cache persistence lets vLLM store computed KV cache entries in Redis via LMCache. When a prompt shares a prefix with a previous request, vLLM reuses the cached KV entries instead of recomputing them — reducing time-to-first-token and GPU compute.

This is particularly useful for:

System prompts — long system prompts are computed once and reused across all requests
Multi-turn conversations — prior turns are already cached
Shared context — multiple users querying with the same document/context prefix

Setup

There are two parts: deploying the Redis backend, and enabling KV cache on your models.

1. Deploy Redis

Enable the LMCache Redis StatefulSet in your Helm values:

helm upgrade greenthread \
  oci://licence.greenthread.ai/greenthread/charts/greenthread \
  --namespace greenthread-system \
  --set lmcache.enabled=true

This deploys a Redis StatefulSet (lmcache-redis) with a headless service. All models with KV cache enabled share this Redis instance.

Redis configuration

Value	Default	Description
`lmcache.enabled`	`false`	Deploy the Redis StatefulSet
`lmcache.replicas`	`3`	Number of Redis replicas
`lmcache.maxMemory`	`80gb`	Redis `maxmemory` setting. Must use Redis-compatible units (`kb`, `mb`, `gb`) — not Kubernetes units (`Gi`)
`lmcache.resources.requests.memory`	`64Gi`	Kubernetes memory request for each Redis pod
`lmcache.resources.limits.memory`	`80Gi`	Kubernetes memory limit for each Redis pod
`lmcache.storage.storageClassName`	`fast-ssd`	StorageClass for Redis PVCs
`lmcache.storage.size`	`100Gi`	PVC size per replica

Memory units

The maxMemory field is passed directly to Redis as --maxmemory. Redis uses its own unit format (kb, mb, gb) — not Kubernetes-style units (Ki, Mi, Gi). Using Kubernetes units will cause Redis to fail on startup.

2. Enable on a model

Add kvCache.enabled: true to your Model spec:

apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
  name: llama-3-1-8b
  namespace: greenthread-system
spec:
  modelName: meta-llama/Llama-3.1-8B-Instruct
  tensorParallelSize: 1
  replicas: 1
  kvCache:
    enabled: true

When enabled, the controller automatically configures the vLLM pod with:

Environment variables:

Variable	Value	Description
`LMCACHE_LOCAL_CPU`	`False`	Disable CPU-local caching (use Redis only)
`LMCACHE_CHUNK_SIZE`	`256`	KV cache chunk size in tokens
`LMCACHE_REMOTE_URL`	`redis://lmcache-redis.greenthread-system.svc.cluster.local:6379`	Redis connection URL
`LMCACHE_REMOTE_SERDE`	`naive`	Serialization format for KV cache entries
`LMCACHE_USE_EXPERIMENTAL`	`True`	Enable experimental LMCache features (required for vLLM v1 connector)

vLLM argument:

--kv-transfer-config={"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}

This tells vLLM to use the LMCache connector for both reading and writing KV cache entries.

How it works

On the first request, vLLM computes the KV cache for the full prompt and writes it to Redis in chunks
On subsequent requests that share a prefix, vLLM loads the matching KV cache chunks from Redis and only computes the new tokens
Redis uses an LRU eviction policy (allkeys-lru) — when memory is full, the least recently used cache entries are evicted automatically

Verifying it works

Check that the LMCache env vars are set on a model pod:

kubectl exec -n greenthread-system <model-pod> -c vllm -- env | grep LMCACHE

Expected output:

LMCACHE_LOCAL_CPU=False
LMCACHE_CHUNK_SIZE=256
LMCACHE_REMOTE_URL=redis://lmcache-redis.greenthread-system.svc.cluster.local:6379
LMCACHE_REMOTE_SERDE=naive
LMCACHE_USE_EXPERIMENTAL=True

Verify Redis is running and accessible:

kubectl exec -n greenthread-system lmcache-redis-0 -- redis-cli ping

Expected output: PONG

lmcache:
  maxMemory: 80gb   # correct — Redis format
  # maxMemory: 80Gi  # wrong — Kubernetes format

Redis not reachable from vLLM pods

Verify the headless service exists and has endpoints:

kubectl get svc lmcache-redis -n greenthread-system
kubectl get endpoints lmcache-redis -n greenthread-system