Most teams try to cut their LLM bill by switching to a cheaper model. Two features can save just as much without changing the model at all: prompt caching and batch APIs. Here is how each works, when to use it, and how much it can save in 2026.
For per-model prices, see the rankings; for the broader playbook, see the cheapest way to run LLMs.
Prompt caching: stop paying for the same context
Many applications send a large, stable block of context on every call: a system prompt, instructions, a document, few-shot examples. Without caching, you pay full input price for that block every single time. Prompt caching lets the provider store that prefix and charge a sharply reduced rate when you reuse it.
It helps most when:
- You have a large, unchanging prefix reused across many calls (RAG context, long instructions, a knowledge base).
- Your traffic is bursty enough that cached prefixes stay warm.
The mechanics differ by provider (cache lifetime, minimum size, discount), so check the provider's docs, but the pattern is the same: structure your prompt so the stable part comes first and is cacheable.
Batch APIs: trade latency for a big discount
A batch API lets you submit a large set of requests to be processed asynchronously, usually within a window, in exchange for a substantial discount off the synchronous price (commonly around half). It is ideal for work that does not need an immediate response:
- Evals and test runs.
- Bulk extraction, classification, summarization and tagging.
- Embeddings backfills.
Many major providers offer a batch endpoint at roughly half the standard rate; our per-model trackers note where a Batch tier exists, for example the GPT-5.4 tracker.
Use them together
The two stack. A nightly batch job over documents can use a cached prefix and the batch discount at once, compounding the savings. A simple rule:
- Interactive, latency-sensitive traffic: synchronous, with prompt caching on the stable prefix.
- Background, non-urgent traffic: the batch API, also with caching where it applies.
A short checklist
- Identify your largest reused prompt prefix and make it cacheable.
- Move every non-interactive job (evals, backfills, bulk processing) to a batch endpoint.
- Cap output length and trim prompts; cheaper tokens still beat expensive ones.
- Re-compare your model's price per provider in the rankings.
The bottom line
Before you downgrade a model, cache your stable context and batch your background work. Together they can halve a bill without touching quality. Compare providers in the rankings and create a free account to track them.
Related: the cheapest way to run LLMs and OpenRouter vs going direct.