The Batch API lets you submit large sets of inference requests for asynchronous processing. It follows the OpenAI Batch API format — upload a JSONL file of requests, create a batch, and retrieve results when complete.
Batch requests are processed opportunistically when models are already serving and have spare capacity, so they don't impact interactive latency. Optionally, you can configure quiet windows to process batch work during off-peak hours.
Quick start
# 1. Upload your input file
curl -s -X POST https://$INGRESS_URL/v1/files \
-F "file=@batch_input.jsonl" \
-F "purpose=batch" | jq .
# 2. Create a batch
curl -s -X POST https://$INGRESS_URL/v1/batches \
-H "Content-Type: application/json" \
-d '{
"input_file_id": "file-abc123def456",
"endpoint": "/v1/chat/completions",
"completion_window": "24h"
}' | jq .
# 3. Poll for completion
curl -s https://$INGRESS_URL/v1/batches/batch-xyz789abc012 | jq .
# 4. Download results
curl -s https://$INGRESS_URL/v1/files/$OUTPUT_FILE_ID/content
Batch API endpoints are served by the GreenThread ingress, not the per-model gateway. Use the ingress service URL (e.g. https://models.example.com), not the /<model>/v1/... gateway paths.
Input file format
The input file is a JSONL file where each line is a request object:
{"custom_id": "req-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "What is Kubernetes?"}], "max_tokens": 256}}
{"custom_id": "req-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Write a haiku."}], "max_tokens": 128}}
Each line must contain:
| Field | Type | Description |
|---|---|---|
custom_id | string | Your unique identifier for this request |
method | string | Always "POST" |
url | string | The endpoint path, e.g. "/v1/chat/completions" |
body | object | The request payload (same as the corresponding API endpoint) |
A single batch can contain requests for different models. The batch worker groups requests by model and processes them as each model becomes available.
API reference
Upload input file
/v1/filesUpload a JSONL file as a multipart form.
| Field | Type | Description |
|---|---|---|
file | file | The JSONL file to upload |
purpose | string | Must be "batch" |
curl -X POST https://$INGRESS_URL/v1/files \
-F "file=@batch_input.jsonl" \
-F "purpose=batch"
Response:
{
"id": "file-abc123def456",
"object": "file",
"bytes": 1234,
"created_at": 1700000000,
"filename": "batch_input.jsonl",
"purpose": "batch"
}
Get file metadata
/v1/files/:file_idReturns metadata about an uploaded or output file.
Download file content
/v1/files/:file_id/contentStream the raw file content. Use this to download both input files and output/error result files.
Create batch
/v1/batches| Field | Type | Required | Description |
|---|---|---|---|
input_file_id | string | Yes | File ID from the upload step |
endpoint | string | Yes | /v1/chat/completions, /v1/completions, or /v1/embeddings |
completion_window | string | Yes | Time limit, e.g. "24h" |
metadata | object | No | Key-value pairs for your own tracking |
curl -X POST https://$INGRESS_URL/v1/batches \
-H "Content-Type: application/json" \
-d '{
"input_file_id": "file-abc123def456",
"endpoint": "/v1/chat/completions",
"completion_window": "24h",
"metadata": {
"description": "nightly eval run"
}
}'
Response:
{
"id": "batch-xyz789abc012",
"object": "batch",
"endpoint": "/v1/chat/completions",
"input_file_id": "file-abc123def456",
"status": "validating",
"created_at": 1700000000,
"request_counts": {
"total": 0,
"completed": 0,
"failed": 0
}
}
Check batch status
/v1/batches/:batch_idPoll this endpoint to track progress. The request_counts update as requests complete.
{
"id": "batch-xyz789abc012",
"object": "batch",
"endpoint": "/v1/chat/completions",
"status": "in_progress",
"input_file_id": "file-abc123def456",
"output_file_id": null,
"error_file_id": null,
"created_at": 1700000000,
"request_counts": {
"total": 100,
"completed": 42,
"failed": 1
}
}
Batch lifecycle
| Status | Description |
|---|---|
validating | Input file being parsed and validated |
in_progress | Requests being processed through the ingress |
finalizing | Writing output and error files to storage |
completed | All requests processed, output file ready |
failed | Batch failed (e.g. invalid input file) |
expired | Batch exceeded its completion window |
cancelling | Cancellation in progress |
cancelled | Batch was cancelled, partial results available |
List batches
/v1/batchesReturns all batches. Supports ?limit= and ?after= for pagination.
Cancel a batch
/v1/batches/:batch_id/cancelCancels a batch that is validating or in_progress. Requests already completed are preserved in the output file.
Output format
When status is completed (or cancelled with partial results), the output_file_id field contains the ID of the results file. Download it with:
curl -s https://$INGRESS_URL/v1/files/$OUTPUT_FILE_ID/content
Each line is a JSONL response object:
{
"id": "resp-001",
"custom_id": "req-1",
"response": {
"status_code": 200,
"request_id": "req-abc",
"body": {
"id": "chatcmpl-xyz",
"object": "chat.completion",
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"message": {"role": "assistant", "content": "Kubernetes is..."},
"finish_reason": "stop"
}
],
"usage": {"prompt_tokens": 12, "completion_tokens": 50, "total_tokens": 62}
}
}
}
If a request failed, the line contains an error field instead:
{
"id": "resp-002",
"custom_id": "req-2",
"error": {
"code": "model_unavailable",
"message": "model not found: nonexistent-model"
}
}
If any requests produced errors, the error_file_id field on the batch points to a separate JSONL file containing only the error responses.
Scheduling
The batch worker processes requests opportunistically — it sends requests through the ingress when models are already serving and have spare capacity. This means batch work never impacts interactive request latency.
How it works
- The worker polls for active batches and groups requests by model
- For each model, it checks: is the model currently serving? Is the sidecar queue depth below the threshold?
- If yes, it sends requests through the ingress (reusing all routing, load balancing, and wake/sleep logic)
- If a model is sleeping, the worker skips it (unless in a quiet window)
- Progress is tracked per-request — the batch status updates as each request completes
Concurrency
The worker limits how many batch requests are in-flight per model, controlled by concurrencyPerModel (default: 4). It also checks the sidecar's queue depth against maxQueueDepth (default: 2) to avoid flooding models that are busy with interactive traffic.
Quiet windows
By default, the batch worker only processes requests for models that are already awake. To allow proactive processing during off-peak hours, configure quiet windows:
batch:
scheduling:
quietWindows:
- start: "02:00"
end: "06:00"
timezone: "UTC"
- start: "22:00"
end: "23:00"
timezone: "US/Pacific"
During a quiet window, the batch worker will send requests even for sleeping models — the ingress will wake them on demand. Outside quiet windows, only models already in serving state receive batch requests.
Enabling batch processing
Batch processing is disabled by default. Enable it in your Helm values:
batch:
enabled: true
This deploys:
- Batch worker — a Deployment that watches for Batch CRDs and processes requests
- MinIO — an S3-compatible object store for file storage (can be replaced with external S3)
Using external S3 storage
To use your own S3-compatible storage instead of the built-in MinIO:
batch:
enabled: true
minio:
enabled: false
s3:
endpoint: "https://s3.amazonaws.com"
bucket: "my-batch-bucket"
region: "us-east-1"
accessKey: "AKIA..."
secretKey: "..."
forcePathStyle: false # true for MinIO, false for AWS S3
Full configuration reference
batch:
enabled: false
scheduling:
# Max concurrent batch requests per model
concurrencyPerModel: 4
# Skip model if sidecar queue depth exceeds this
maxQueueDepth: 2
# How often the worker polls for batch work
pollInterval: 10s
# Time windows when the worker may wake sleeping models
quietWindows: []
s3:
endpoint: ""
bucket: "gthread-batch"
region: "us-east-1"
accessKey: ""
secretKey: ""
forcePathStyle: true
minio:
enabled: true
storage:
storageClassName: ""
size: 50Gi
Python SDK usage
The Batch API is compatible with the OpenAI Python SDK:
from openai import OpenAI
client = OpenAI(
base_url="https://models.example.com/v1",
api_key="not-needed",
)
# Upload input file
with open("batch_input.jsonl", "rb") as f:
input_file = client.files.create(file=f, purpose="batch")
# Create batch
batch = client.batches.create(
input_file_id=input_file.id,
endpoint="/v1/chat/completions",
completion_window="24h",
)
# Poll until complete
import time
while batch.status not in ("completed", "failed", "cancelled", "expired"):
time.sleep(10)
batch = client.batches.retrieve(batch.id)
print(f"Status: {batch.status} "
f"Progress: {batch.request_counts.completed}/{batch.request_counts.total}")
# Download results
if batch.output_file_id:
content = client.files.content(batch.output_file_id)
with open("results.jsonl", "wb") as f:
f.write(content.read())
Dashboard
Batch status is visible in the GreenThread dashboard under the Batches tab. Click any batch to see its full details, progress, file links, and errors.
The Grafana dashboard (provisioned when monitoring.enabled: true) includes a dedicated Batch Processing dashboard with request throughput, success rates, in-flight counts, and worker health metrics.
