GreenThreadDocs

KV cache persistence lets vLLM store computed KV cache entries in Redis via LMCache. When a prompt shares a prefix with a previous request, vLLM reuses the cached KV entries instead of recomputing them — reducing time-to-first-token and GPU compute.

This is particularly useful for:

  • System prompts — long system prompts are computed once and reused across all requests
  • Multi-turn conversations — prior turns are already cached
  • Shared context — multiple users querying with the same document/context prefix

Setup

There are two parts: deploying the Redis backend, and enabling KV cache on your models.

1. Deploy Redis

Enable the LMCache Redis StatefulSet in your Helm values:

helm upgrade greenthread \
  oci://licence.greenthread.ai/greenthread/charts/greenthread \
  --namespace greenthread-system \
  --set lmcache.enabled=true

This deploys a Redis StatefulSet (lmcache-redis) with a headless service. All models with KV cache enabled share this Redis instance.

Redis configuration

ValueDefaultDescription
lmcache.enabledfalseDeploy the Redis StatefulSet
lmcache.replicas3Number of Redis replicas
lmcache.maxMemory80gbRedis maxmemory setting. Must use Redis-compatible units (kb, mb, gb) — not Kubernetes units (Gi)
lmcache.resources.requests.memory64GiKubernetes memory request for each Redis pod
lmcache.resources.limits.memory80GiKubernetes memory limit for each Redis pod
lmcache.storage.storageClassNamefast-ssdStorageClass for Redis PVCs
lmcache.storage.size100GiPVC size per replica
Memory units

The maxMemory field is passed directly to Redis as --maxmemory. Redis uses its own unit format (kb, mb, gb) — not Kubernetes-style units (Ki, Mi, Gi). Using Kubernetes units will cause Redis to fail on startup.

2. Enable on a model

Add kvCache.enabled: true to your Model spec:

apiVersion: greenthread.ai/v1alpha1
kind: Model
metadata:
  name: llama-3-1-8b
  namespace: greenthread-system
spec:
  modelName: meta-llama/Llama-3.1-8B-Instruct
  tensorParallelSize: 1
  replicas: 1
  kvCache:
    enabled: true

When enabled, the controller automatically configures the vLLM pod with:

Environment variables:

VariableValueDescription
LMCACHE_LOCAL_CPUFalseDisable CPU-local caching (use Redis only)
LMCACHE_CHUNK_SIZE256KV cache chunk size in tokens
LMCACHE_REMOTE_URLredis://lmcache-redis.greenthread-system.svc.cluster.local:6379Redis connection URL
LMCACHE_REMOTE_SERDEnaiveSerialization format for KV cache entries
LMCACHE_USE_EXPERIMENTALTrueEnable experimental LMCache features (required for vLLM v1 connector)

vLLM argument:

--kv-transfer-config={"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}

This tells vLLM to use the LMCache connector for both reading and writing KV cache entries.

How it works

  1. On the first request, vLLM computes the KV cache for the full prompt and writes it to Redis in chunks
  2. On subsequent requests that share a prefix, vLLM loads the matching KV cache chunks from Redis and only computes the new tokens
  3. Redis uses an LRU eviction policy (allkeys-lru) — when memory is full, the least recently used cache entries are evicted automatically

Verifying it works

Check that the LMCache env vars are set on a model pod:

kubectl exec -n greenthread-system <model-pod> -c vllm -- env | grep LMCACHE

Expected output:

LMCACHE_LOCAL_CPU=False
LMCACHE_CHUNK_SIZE=256
LMCACHE_REMOTE_URL=redis://lmcache-redis.greenthread-system.svc.cluster.local:6379
LMCACHE_REMOTE_SERDE=naive
LMCACHE_USE_EXPERIMENTAL=True

Verify Redis is running and accessible:

kubectl exec -n greenthread-system lmcache-redis-0 -- redis-cli ping

Expected output: PONG

Troubleshooting

ModuleNotFoundError: No module named 'lmcache'

Your vLLM image doesn't include the lmcache Python package. Ensure you're running vLLM image tag v0.17.0-0.0.72 or later, which includes lmcache pre-installed.

Redis crash: argument must be a memory value

The lmcache.maxMemory value is using Kubernetes-style units (Gi) instead of Redis units (gb). Update your Helm values:

lmcache:
  maxMemory: 80gb   # correct — Redis format
  # maxMemory: 80Gi  # wrong — Kubernetes format

Redis not reachable from vLLM pods

Verify the headless service exists and has endpoints:

kubectl get svc lmcache-redis -n greenthread-system
kubectl get endpoints lmcache-redis -n greenthread-system