Originally published on AI School — free AI & ML courses, no signup. Full guide: What Does the Claude API Actually Cost?
Per-token prices are public, but your bill is determined by three multipliers most teams ignore: caching, batching, and model routing. Here is the real math, with four fully worked scenarios.
Prices verified June 2026 — always confirm at anthropic.com/pricing.
The List Prices (June 2026)
Claude is billed per million tokens (MTok), with separate rates for input (what you send) and output (what the model generates). A token is roughly ¾ of an English word.
| Model | Input / MTok | Output / MTok | Context | Sweet spot |
|---|---|---|---|---|
| Claude Opus 4.8 | $5.00 | $25.00 | 1M tokens | Agents, hard reasoning, long-horizon coding |
| Claude Sonnet 4.6 | $3.00 | $15.00 | 1M tokens | Most production workloads |
| Claude Haiku 4.5 | $1.00 | $5.00 | 200K tokens | Classification, routing, real-time chat |
Two structural facts shape everything below:
- Output costs 5× input. An app that generates long answers pays mostly for output; an app that reads long documents pays mostly for input. Know which one you are.
- Input is re-billed every call. In a 20-turn conversation, turn 20 re-sends (and re-pays for) everything from turns 1–19 — unless you cache it.
The Three Multipliers
1. Prompt caching: reads at 0.1×, writes at 1.25×
Any stable prefix of your request (system prompt, documents, conversation history) can be cached. Cached tokens are re-read at 10% of the input price; writing them to cache costs a one-time 25% premium (or 2× for the 1-hour cache lifetime instead of the default 5 minutes).
| Model | Base input | Cache write (5-min) | Cache read |
|---|---|---|---|
| Opus 4.8 | $5.00 | $6.25 | $0.50 |
| Sonnet 4.6 | $3.00 | $3.75 | $0.30 |
| Haiku 4.5 | $1.00 | $1.25 | $0.10 |
Break-even is fast: with the 5-minute cache, the second request already saves money (1.25× + 0.1× = 1.35× vs 2× uncached).
⚠️ The silent minimum-size gotcha. Prefixes below a model-specific minimum silently refuse to cache — no error, you just pay full price forever. The minimum is 4,096 tokens on Opus 4.8 and Haiku 4.5 and 2,048 on Sonnet 4.6. A tidy 3,000-token system prompt on Haiku never caches. Check
usage.cache_read_input_tokensin responses: if it stays 0, your "cached" prompt isn't.
2. Batch API: everything at 50% off
Jobs that can wait up to an hour (most finish faster) can run through the Message Batches API at half price on all tokens — and batching stacks with caching.
3. Model routing: a 5× lever before you optimize anything
Haiku input is 5× cheaper than Opus, output 5× cheaper. The standard production pattern is a cascade: Haiku handles the easy 80%, escalates the hard 20% to Sonnet or Opus. Optimize cost per successful task, not cost per token — a cheap model that fails and retries can out-spend an expensive one that succeeds first try.
Scenario 1 — Support Chatbot (Haiku 4.5)
Assumptions: 100,000 messages/month; 5,000-token system prompt (instructions + few-shot examples — deliberately above Haiku's 4,096 caching minimum); average 1,500 tokens of conversation history + 100-token user message per call; 300-token replies.
| Per message | Per month (100K msgs) | |
|---|---|---|
| No caching: 6,600 in × $1 + 300 out × $5 | $0.0081 | $810 |
| System prompt cached: 5,000 read × $0.10 + 1,600 in × $1 + 300 out × $5 | $0.0036 | $360 |
One cache_control breakpoint cuts the bill by 56%. Caching the conversation history too (the standard multi-turn pattern) pushes savings further on longer chats.
Scenario 2 — RAG Document Q&A (Sonnet 4.6)
Assumptions: a 50,000-token document loaded into context; users ask 20 questions per document session; 500-token questions, 800-token answers.
| Cost per 20-question session | |
|---|---|
| No caching: every question re-sends the document at $3/MTok | $3.27 |
| Document cached: one $0.19 write, then 19 reads at $0.30/MTok | $0.74 |
That is 77% off, and the cached version also responds faster — the model doesn't reprocess 50K tokens per question. At 1,000 document sessions a month, caching is the difference between $3,270 and $740.
Scenario 3 — Autonomous Coding Agent (Opus 4.8)
Agents are where costs explode, because the context is re-sent on every tool call. Assumptions: one task = 40 model calls; context grows from 20K to 150K tokens across the run (average 85K per call); ~500 output tokens per call.
| Per task | 50 tasks/day | |
|---|---|---|
| No caching: 40 calls × ~85K input at $5/MTok + 20K output | $17.50 | $875/day |
| Incremental caching: each call re-reads the prefix at $0.50/MTok, only the ~3K new tokens pay the write premium | ≈$2.95 | ≈$148/day |
~83% off. For agentic workloads, prompt caching is not an optimization — it's the difference between a viable product and an impossible one. (Anthropic's own agent products rely on exactly this pattern.)
Scenario 4 — Nightly Classification Job (Haiku + Batch)
Assumptions: 100,000 records classified overnight; 400 input + 10 output tokens each.
| Per night | Per year | |
|---|---|---|
| Real-time API: 40M in × $1 + 1M out × $5 | $45.00 | $16,425 |
| Batch API (50% off everything) | $22.50 | $8,213 |
If those records share a cacheable instruction prefix, batch + caching stack — many classification jobs land under $15/night.
Estimate Your Own Workload
Token counting is free — you can price a workload before spending anything:
# pip install anthropic
import anthropic
client = anthropic.Anthropic()
count = client.messages.count_tokens(
model="claude-sonnet-4-6",
system=MY_SYSTEM_PROMPT,
messages=[{"role": "user", "content": SAMPLE_REQUEST}],
)
IN_PRICE, OUT_PRICE = 3.00, 15.00 # Sonnet 4.6, $/MTok
est_output = 600 # your average reply length
per_call = (count.input_tokens * IN_PRICE + est_output * OUT_PRICE) / 1_000_000
print(f"{count.input_tokens} input tokens -> ${per_call:.4f} per call")
print(f"At 10K calls/day: ${per_call * 10_000:.2f}/day")
Then check the usage object on real responses — input_tokens, output_tokens, cache_read_input_tokens — to verify your assumptions against reality.
The Checklist
- Know your shape: input-heavy (RAG, agents) or output-heavy (generation)? Optimize the expensive side first.
-
Cache anything stable over the minimum size (4,096 tokens on Opus/Haiku, 2,048 on Sonnet) that's reused within 5 minutes — and verify with
cache_read_input_tokens. - Batch anything that can wait an hour — flat 50% off, stacks with caching.
- Route by difficulty: Haiku first, escalate on failure. Measure cost per successful task.
-
Cap output: set
max_tokensdeliberately and prompt for concise answers — output is the 5×-priced direction. - Re-price quarterly: model prices and caching mechanics change; the math here is June 2026.
Sources: Anthropic pricing · Prompt caching docs · Batch API docs. All scenario math uses list prices as of June 4, 2026; assumptions are stated inline so you can re-run them with your own numbers.
I write these as part of AI School, a free learning platform (no signup, no paywall). The cost-control techniques above are covered in depth in the free Token Optimization course — context engineering, output control, and cost governance.
Top comments (0)