KV cache persistence lets vLLM store computed KV cache entries in Redis via LMCache. When a prompt shares a prefix with a previous request, vLLM reuses the cached KV entries instead of recomputing them — reducing time-to-first-token and GPU compute.
This is particularly useful for:
- System prompts — long system prompts are computed once and reused across all requests
- Multi-turn conversations — prior turns are already cached
- Shared context — multiple users querying with the same document/context prefix
Setup
There are two parts: deploying the Redis backend, and enabling KV cache on your models.
1. Deploy Redis
Enable the LMCache Redis StatefulSet in your Helm values:
helm upgrade greenthread \
oci://licence.greenthread.ai/greenthread/charts/greenthread \
--namespace greenthread-system \
--set lmcache.enabled=true
This deploys a Redis StatefulSet (lmcache-redis) with a headless service. All models with KV cache enabled share this Redis instance.
Redis configuration
| Value | Default | Description |
|---|---|---|
lmcache.enabled | false | Deploy the Redis StatefulSet |
lmcache.replicas | 3 | Number of Redis replicas |
lmcache.maxMemory | 80gb | Redis maxmemory setting. Must use Redis-compatible units (kb, mb, gb) — not Kubernetes units (Gi) |
lmcache.resources.requests.memory | 64Gi | Kubernetes memory request for each Redis pod |
lmcache.resources.limits.memory | 80Gi | Kubernetes memory limit for each Redis pod |
lmcache.storage.storageClassName | fast-ssd | StorageClass for Redis PVCs |
lmcache.storage.size | 100Gi | PVC size per replica |
The maxMemory field is passed directly to Redis as --maxmemory. Redis uses its own unit format (kb, mb, gb) — not Kubernetes-style units (Ki, Mi, Gi). Using Kubernetes units will cause Redis to fail on startup.
2. Enable on a model
Add kvCache.enabled: true to your Model spec:
apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
name: llama-3-1-8b
namespace: greenthread-system
spec:
modelName: meta-llama/Llama-3.1-8B-Instruct
tensorParallelSize: 1
replicas: 1
kvCache:
enabled: true
When enabled, the controller automatically configures the vLLM pod with:
Environment variables:
| Variable | Value | Description |
|---|---|---|
LMCACHE_LOCAL_CPU | False | Disable CPU-local caching (use Redis only) |
LMCACHE_CHUNK_SIZE | 256 | KV cache chunk size in tokens |
LMCACHE_REMOTE_URL | redis://lmcache-redis.greenthread-system.svc.cluster.local:6379 | Redis connection URL |
LMCACHE_REMOTE_SERDE | naive | Serialization format for KV cache entries |
LMCACHE_USE_EXPERIMENTAL | True | Enable experimental LMCache features (required for vLLM v1 connector) |
vLLM argument:
--kv-transfer-config={"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}
This tells vLLM to use the LMCache connector for both reading and writing KV cache entries.
How it works
- On the first request, vLLM computes the KV cache for the full prompt and writes it to Redis in chunks
- On subsequent requests that share a prefix, vLLM loads the matching KV cache chunks from Redis and only computes the new tokens
- Redis uses an LRU eviction policy (
allkeys-lru) — when memory is full, the least recently used cache entries are evicted automatically
Verifying it works
Check that the LMCache env vars are set on a model pod:
kubectl exec -n greenthread-system <model-pod> -c vllm -- env | grep LMCACHE
Expected output:
LMCACHE_LOCAL_CPU=False
LMCACHE_CHUNK_SIZE=256
LMCACHE_REMOTE_URL=redis://lmcache-redis.greenthread-system.svc.cluster.local:6379
LMCACHE_REMOTE_SERDE=naive
LMCACHE_USE_EXPERIMENTAL=True
Verify Redis is running and accessible:
kubectl exec -n greenthread-system lmcache-redis-0 -- redis-cli ping
Expected output: PONG
Troubleshooting
ModuleNotFoundError: No module named 'lmcache'
Your vLLM image doesn't include the lmcache Python package. Ensure you're running vLLM image tag v0.17.0-0.0.72 or later, which includes lmcache pre-installed.
Redis crash: argument must be a memory value
The lmcache.maxMemory value is using Kubernetes-style units (Gi) instead of Redis units (gb). Update your Helm values:
lmcache:
maxMemory: 80gb # correct — Redis format
# maxMemory: 80Gi # wrong — Kubernetes format
Redis not reachable from vLLM pods
Verify the headless service exists and has endpoints:
kubectl get svc lmcache-redis -n greenthread-system
kubectl get endpoints lmcache-redis -n greenthread-system