What is prompt caching?

It lets a provider store a large, stable prompt prefix (a system prompt, instructions, a document) and charge a sharply reduced rate when you reuse it, instead of paying full input price on every call.

How much do batch APIs save?

Many providers offer a batch endpoint at roughly half the synchronous price in exchange for asynchronous processing within a window. It is ideal for evals, backfills, and bulk extraction or classification.

Should I cache or switch to a cheaper model first?

Try caching and batching first, since they cut cost without changing quality. Then compare your model's price across providers in the rankings and downgrade only where quality allows.

Prompt Caching and Batch APIs: Cut Your LLM Bill in Half

Q: Can I use prompt caching and batch together?

Yes, and they compound. A background batch job over documents can use a cached prefix and the batch discount at the same time, multiplying the savings.

Two underused features, prompt caching and batch APIs, can cut LLM costs dramatically without changing your model. Here is how each works and when to use them in 2026.

Most teams try to cut their LLM bill by switching to a cheaper model. Two features can save just as much without changing the model at all: prompt caching and batch APIs. Here is how each works, when to use it, and how much it can save in 2026.

For per-model prices, see the rankings; for the broader playbook, see the cheapest way to run LLMs.

Prompt caching: stop paying for the same context

Many applications send a large, stable block of context on every call: a system prompt, instructions, a document, few-shot examples. Without caching, you pay full input price for that block every single time. Prompt caching lets the provider store that prefix and charge a sharply reduced rate when you reuse it.

It helps most when:

You have a large, unchanging prefix reused across many calls (RAG context, long instructions, a knowledge base).
Your traffic is bursty enough that cached prefixes stay warm.

The mechanics differ by provider (cache lifetime, minimum size, discount), so check the provider's docs, but the pattern is the same: structure your prompt so the stable part comes first and is cacheable.

Batch APIs: trade latency for a big discount

A batch API lets you submit a large set of requests to be processed asynchronously, usually within a window, in exchange for a substantial discount off the synchronous price (commonly around half). It is ideal for work that does not need an immediate response:

Evals and test runs.
Bulk extraction, classification, summarization and tagging.
Embeddings backfills.

Many major providers offer a batch endpoint at roughly half the standard rate; our per-model trackers note where a Batch tier exists, for example the GPT-5.4 tracker.

Use them together

The two stack. A nightly batch job over documents can use a cached prefix and the batch discount at once, compounding the savings. A simple rule:

Interactive, latency-sensitive traffic: synchronous, with prompt caching on the stable prefix.
Background, non-urgent traffic: the batch API, also with caching where it applies.

A short checklist

Identify your largest reused prompt prefix and make it cacheable.
Move every non-interactive job (evals, backfills, bulk processing) to a batch endpoint.
Cap output length and trim prompts; cheaper tokens still beat expensive ones.
Re-compare your model's price per provider in the rankings.

The bottom line

Before you downgrade a model, cache your stable context and batch your background work. Together they can halve a bill without touching quality. Compare providers in the rankings and create an account to track them.