What is the cheapest text-to-speech API?

For most workloads it is an open TTS model on a low-cost inference host, which usually runs well below premium hosted voices per minute of audio. Reserve premium expressive or cloned voices for content that needs them. Because rates move, compare current per-model prices in the Perkstack rankings before committing.

Is there a free TTS API?

Yes. Several voice-AI providers offer a monthly free text-to-speech allowance, and others give signup credits you can spend on TTS. A free tier can cover a low-volume app indefinitely. The current, dated list is in the Perkstack catalog.

Why do TTS prices look so different between providers?

Because providers meter different things. Some bill per character of input text, some per second or minute of output audio, and some use abstract credits. Normalize every quote to one unit (a rough guide is fifteen to twenty characters per second of speech) before comparing.

How do I lower my text-to-speech costs?

Cache repeated audio instead of re-synthesizing it, use a standard or open voice for the bulk of your volume, trim and clean the input text you pay for, batch non-interactive jobs instead of streaming, and match the audio format to the channel.

The Cheapest Text-to-Speech (TTS) API in 2026

Compare the cheapest text-to-speech APIs in 2026, how per-character and per-second billing differ, and the free TTS tiers and credits to start with.

Text-to-speech is one of the cheapest AI features to add and one of the easiest to overpay for at scale. The same minute of generated audio can cost very different amounts depending on the model you pick, how the provider meters output, and whether you use a premium hosted voice or an open model. Here is how to find the cheapest text-to-speech API in 2026, and how to keep the bill down as your volume grows.

For the live, per-model price comparison across providers, see the Perkstack rankings. This guide is the playbook behind it. (If you also need transcription, the companion piece is the cheapest speech-to-text API.)

How TTS pricing actually works

TTS pricing looks simple until you compare two providers and find they meter different things. There are three common billing units, and mixing them up is the most common way to misjudge cost:

Per character. You pay for the length of the input text, usually per thousand or per million characters. This is the most widespread model for hosted neural voices.
Per second or per minute of audio. You pay for the length of the output you generate. Two providers can quote wildly different numbers simply because one bills input characters and the other bills output seconds.
Per credit or per request. Some platforms abstract usage into credits, where a credit maps to some amount of characters or audio. These are the hardest to compare directly.

Before you compare any two providers, normalize everything to one unit. A rough rule of thumb is that natural speech runs somewhere around fifteen to twenty characters per second of audio, so you can convert a per-character price into an approximate per-minute price (and vice versa) to put quotes side by side. Exact rates move, so treat published numbers as a starting point and defer to the current figures in the rankings and the catalog.

What drives the price

Two levers decide what you pay per minute of speech:

The model and voice tier. Premium, highly expressive voices (the kind used for audiobooks, characters, or branded assistants) cost the most per character. Standard neural voices are noticeably cheaper, and they are good enough for notifications, IVR prompts, and most product narration.
Hosted voice versus open model. Frontier hosted voices from dedicated voice-AI vendors carry a premium for quality and features. Open TTS models served on general inference hosts are usually far cheaper per minute, with a quality gap that has narrowed a lot.

Open TTS models are the cost lever

The biggest single saving, once you are past free credits, is usually moving suitable workloads to an open TTS model on a cheap inference host. A small, fast open model is often more than good enough for system prompts, alerts, and straightforward narration, and it can cost a fraction of a premium hosted voice per minute.

Reserve the premium hosted voices for the cases that genuinely need them: emotional range, fine control over pacing and emphasis, or a specific cloned or branded voice. For everything else, an open model on a low-cost host is frequently the cheapest text-to-speech API for the job. Because most TTS endpoints are a simple HTTP call, switching a workload from one provider to another is usually a few lines of code, not a rewrite.

Our per-model price trackers show the cheapest verified endpoint for the voices we follow, normalized to a common unit and re-pulled regularly, so you can see at a glance where to route each job.

Watch the features that quietly raise the bill

TTS has a few cost multipliers that are easy to miss when you only compare the headline rate:

Voice cloning and custom voices. Creating or hosting a custom or cloned voice often carries a separate fee or a higher per-character rate than the stock voices.
Streaming versus batch. Low-latency streaming synthesis (for live agents) can be priced higher than generating a finished file in one batch call. If you do not need real-time playback, batch is usually cheaper.
Audio format and quality. Higher sample rates and lossless formats produce larger files. The synthesis price may be the same, but your storage and bandwidth costs are not.
Re-synthesis. Every time you regenerate the same line, you pay again. For static or repeated phrases, this is pure waste.

Cut the cost per minute

Price per minute of audio is not fixed. You can lower it without changing providers:

Cache aggressively. The same text in the same voice produces the same audio. Store generated clips for anything repeated (UI prompts, standard responses, common phrases) instead of re-synthesizing on every request.
Right-size the voice. Use a standard neural voice where a premium one adds nothing the user will notice, and reserve the expensive expressive voices for content that benefits.
Trim and clean input. You pay for characters or output length, so strip markup, normalize whitespace, and avoid synthesizing text the user will not hear.
Batch where you can. Generate non-interactive audio (digests, summaries, scheduled content) in batches rather than as real-time streaming calls when latency does not matter.
Pick the right format. Match the audio format to the delivery channel rather than defaulting to the largest one.

Start on free TTS credits and tiers

Before you pay anything, several providers let you generate speech for free, either through a monthly free allowance or through signup credits you can spend on text-to-speech. A monthly free TTS tier can cover a low-volume app indefinitely, and a one-time signup credit is enough to build and evaluate a feature before you commit. The always-current, dated list of voice-AI credits, with the per-provider details, is in the catalog. For the broader picture of free voice-AI options alongside transcription, see the cheapest speech-to-text API and free AI API credits.

Bottom line

The cheapest text-to-speech API in 2026 is rarely the first vendor you reach for. Normalize every quote to one unit, start on a free tier or signup credit, use a standard or open voice for the bulk of your volume, cache repeated audio, and reserve premium hosted voices for the lines that need them. Compare providers per model in the rankings, grab the current credits from the catalog, and create a free Perkstack account to unlock the apply links.