GreenThreadDocs

The Batch API lets you submit large sets of inference requests for asynchronous processing. It follows the OpenAI Batch API format — upload a JSONL file of requests, create a batch, and retrieve results when complete.

Batch requests are processed opportunistically when models are already serving and have spare capacity, so they don't impact interactive latency. Optionally, you can configure quiet windows to process batch work during off-peak hours.

Quick start

# 1. Upload your input file
curl -s -X POST https://$INGRESS_URL/v1/files \
  -F "file=@batch_input.jsonl" \
  -F "purpose=batch" | jq .

# 2. Create a batch
curl -s -X POST https://$INGRESS_URL/v1/batches \
  -H "Content-Type: application/json" \
  -d '{
    "input_file_id": "file-abc123def456",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h"
  }' | jq .

# 3. Poll for completion
curl -s https://$INGRESS_URL/v1/batches/batch-xyz789abc012 | jq .

# 4. Download results
curl -s https://$INGRESS_URL/v1/files/$OUTPUT_FILE_ID/content
Ingress URL

Batch API endpoints are served by the GreenThread ingress, not the per-model gateway. Use the ingress service URL (e.g. https://models.example.com), not the /<model>/v1/... gateway paths.

Input file format

The input file is a JSONL file where each line is a request object:

{"custom_id": "req-1", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "What is Kubernetes?"}], "max_tokens": 256}}
{"custom_id": "req-2", "method": "POST", "url": "/v1/chat/completions", "body": {"model": "Qwen/Qwen3-4B", "messages": [{"role": "user", "content": "Write a haiku."}], "max_tokens": 128}}

Each line must contain:

FieldTypeDescription
custom_idstringYour unique identifier for this request
methodstringAlways "POST"
urlstringThe endpoint path, e.g. "/v1/chat/completions"
bodyobjectThe request payload (same as the corresponding API endpoint)
Mixed models

A single batch can contain requests for different models. The batch worker groups requests by model and processes them as each model becomes available.

API reference

Upload input file

POST/v1/files

Upload a JSONL file as a multipart form.

FieldTypeDescription
filefileThe JSONL file to upload
purposestringMust be "batch"
curl -X POST https://$INGRESS_URL/v1/files \
  -F "file=@batch_input.jsonl" \
  -F "purpose=batch"

Response:

{
  "id": "file-abc123def456",
  "object": "file",
  "bytes": 1234,
  "created_at": 1700000000,
  "filename": "batch_input.jsonl",
  "purpose": "batch"
}

Get file metadata

GET/v1/files/:file_id

Returns metadata about an uploaded or output file.

Download file content

GET/v1/files/:file_id/content

Stream the raw file content. Use this to download both input files and output/error result files.

Create batch

POST/v1/batches
FieldTypeRequiredDescription
input_file_idstringYesFile ID from the upload step
endpointstringYes/v1/chat/completions, /v1/completions, or /v1/embeddings
completion_windowstringYesTime limit, e.g. "24h"
metadataobjectNoKey-value pairs for your own tracking
curl -X POST https://$INGRESS_URL/v1/batches \
  -H "Content-Type: application/json" \
  -d '{
    "input_file_id": "file-abc123def456",
    "endpoint": "/v1/chat/completions",
    "completion_window": "24h",
    "metadata": {
      "description": "nightly eval run"
    }
  }'

Response:

{
  "id": "batch-xyz789abc012",
  "object": "batch",
  "endpoint": "/v1/chat/completions",
  "input_file_id": "file-abc123def456",
  "status": "validating",
  "created_at": 1700000000,
  "request_counts": {
    "total": 0,
    "completed": 0,
    "failed": 0
  }
}

Check batch status

GET/v1/batches/:batch_id

Poll this endpoint to track progress. The request_counts update as requests complete.

{
  "id": "batch-xyz789abc012",
  "object": "batch",
  "endpoint": "/v1/chat/completions",
  "status": "in_progress",
  "input_file_id": "file-abc123def456",
  "output_file_id": null,
  "error_file_id": null,
  "created_at": 1700000000,
  "request_counts": {
    "total": 100,
    "completed": 42,
    "failed": 1
  }
}

Batch lifecycle

StatusDescription
validatingInput file being parsed and validated
in_progressRequests being processed through the ingress
finalizingWriting output and error files to storage
completedAll requests processed, output file ready
failedBatch failed (e.g. invalid input file)
expiredBatch exceeded its completion window
cancellingCancellation in progress
cancelledBatch was cancelled, partial results available

List batches

GET/v1/batches

Returns all batches. Supports ?limit= and ?after= for pagination.

Cancel a batch

POST/v1/batches/:batch_id/cancel

Cancels a batch that is validating or in_progress. Requests already completed are preserved in the output file.

Output format

When status is completed (or cancelled with partial results), the output_file_id field contains the ID of the results file. Download it with:

curl -s https://$INGRESS_URL/v1/files/$OUTPUT_FILE_ID/content

Each line is a JSONL response object:

{
  "id": "resp-001",
  "custom_id": "req-1",
  "response": {
    "status_code": 200,
    "request_id": "req-abc",
    "body": {
      "id": "chatcmpl-xyz",
      "object": "chat.completion",
      "model": "meta-llama/Llama-3.1-8B-Instruct",
      "choices": [
        {
          "index": 0,
          "message": {"role": "assistant", "content": "Kubernetes is..."},
          "finish_reason": "stop"
        }
      ],
      "usage": {"prompt_tokens": 12, "completion_tokens": 50, "total_tokens": 62}
    }
  }
}

If a request failed, the line contains an error field instead:

{
  "id": "resp-002",
  "custom_id": "req-2",
  "error": {
    "code": "model_unavailable",
    "message": "model not found: nonexistent-model"
  }
}

If any requests produced errors, the error_file_id field on the batch points to a separate JSONL file containing only the error responses.

Scheduling

The batch worker processes requests opportunistically — it sends requests through the ingress when models are already serving and have spare capacity. This means batch work never impacts interactive request latency.

How it works

  1. The worker polls for active batches and groups requests by model
  2. For each model, it checks: is the model currently serving? Is the sidecar queue depth below the threshold?
  3. If yes, it sends requests through the ingress (reusing all routing, load balancing, and wake/sleep logic)
  4. If a model is sleeping, the worker skips it (unless in a quiet window)
  5. Progress is tracked per-request — the batch status updates as each request completes

Concurrency

The worker limits how many batch requests are in-flight per model, controlled by concurrencyPerModel (default: 4). It also checks the sidecar's queue depth against maxQueueDepth (default: 2) to avoid flooding models that are busy with interactive traffic.

Quiet windows

By default, the batch worker only processes requests for models that are already awake. To allow proactive processing during off-peak hours, configure quiet windows:

batch:
  scheduling:
    quietWindows:
      - start: "02:00"
        end: "06:00"
        timezone: "UTC"
      - start: "22:00"
        end: "23:00"
        timezone: "US/Pacific"

During a quiet window, the batch worker will send requests even for sleeping models — the ingress will wake them on demand. Outside quiet windows, only models already in serving state receive batch requests.

Enabling batch processing

Batch processing is disabled by default. Enable it in your Helm values:

batch:
  enabled: true

This deploys:

  • Batch worker — a Deployment that watches for Batch CRDs and processes requests
  • MinIO — an S3-compatible object store for file storage (can be replaced with external S3)

Using external S3 storage

To use your own S3-compatible storage instead of the built-in MinIO:

batch:
  enabled: true
  minio:
    enabled: false
  s3:
    endpoint: "https://s3.amazonaws.com"
    bucket: "my-batch-bucket"
    region: "us-east-1"
    accessKey: "AKIA..."
    secretKey: "..."
    forcePathStyle: false  # true for MinIO, false for AWS S3

Full configuration reference

batch:
  enabled: false
  scheduling:
    # Max concurrent batch requests per model
    concurrencyPerModel: 4
    # Skip model if sidecar queue depth exceeds this
    maxQueueDepth: 2
    # How often the worker polls for batch work
    pollInterval: 10s
    # Time windows when the worker may wake sleeping models
    quietWindows: []
  s3:
    endpoint: ""
    bucket: "gthread-batch"
    region: "us-east-1"
    accessKey: ""
    secretKey: ""
    forcePathStyle: true
  minio:
    enabled: true
    storage:
      storageClassName: ""
      size: 50Gi

Python SDK usage

The Batch API is compatible with the OpenAI Python SDK:

from openai import OpenAI

client = OpenAI(
    base_url="https://models.example.com/v1",
    api_key="not-needed",
)

# Upload input file
with open("batch_input.jsonl", "rb") as f:
    input_file = client.files.create(file=f, purpose="batch")

# Create batch
batch = client.batches.create(
    input_file_id=input_file.id,
    endpoint="/v1/chat/completions",
    completion_window="24h",
)

# Poll until complete
import time
while batch.status not in ("completed", "failed", "cancelled", "expired"):
    time.sleep(10)
    batch = client.batches.retrieve(batch.id)
    print(f"Status: {batch.status}  "
          f"Progress: {batch.request_counts.completed}/{batch.request_counts.total}")

# Download results
if batch.output_file_id:
    content = client.files.content(batch.output_file_id)
    with open("results.jsonl", "wb") as f:
        f.write(content.read())

Dashboard

Batch status is visible in the GreenThread dashboard under the Batches tab. Click any batch to see its full details, progress, file links, and errors.

The Grafana dashboard (provisioned when monitoring.enabled: true) includes a dedicated Batch Processing dashboard with request throughput, success rates, in-flight counts, and worker health metrics.