DEV Community

toshanthi-stack
toshanthi-stack

Posted on • Originally published at lillytechsystems.com

What Does the Claude API Actually Cost? (June 2026)

Originally published on AI School — free AI & ML courses, no signup. Full guide: What Does the Claude API Actually Cost?

Per-token prices are public, but your bill is determined by three multipliers most teams ignore: caching, batching, and model routing. Here is the real math, with four fully worked scenarios.

Prices verified June 2026 — always confirm at anthropic.com/pricing.

The List Prices (June 2026)

Claude is billed per million tokens (MTok), with separate rates for input (what you send) and output (what the model generates). A token is roughly ¾ of an English word.

Model Input / MTok Output / MTok Context Sweet spot
Claude Opus 4.8 $5.00 $25.00 1M tokens Agents, hard reasoning, long-horizon coding
Claude Sonnet 4.6 $3.00 $15.00 1M tokens Most production workloads
Claude Haiku 4.5 $1.00 $5.00 200K tokens Classification, routing, real-time chat

Two structural facts shape everything below:

  • Output costs 5× input. An app that generates long answers pays mostly for output; an app that reads long documents pays mostly for input. Know which one you are.
  • Input is re-billed every call. In a 20-turn conversation, turn 20 re-sends (and re-pays for) everything from turns 1–19 — unless you cache it.

The Three Multipliers

1. Prompt caching: reads at 0.1×, writes at 1.25×

Any stable prefix of your request (system prompt, documents, conversation history) can be cached. Cached tokens are re-read at 10% of the input price; writing them to cache costs a one-time 25% premium (or 2× for the 1-hour cache lifetime instead of the default 5 minutes).

Model Base input Cache write (5-min) Cache read
Opus 4.8 $5.00 $6.25 $0.50
Sonnet 4.6 $3.00 $3.75 $0.30
Haiku 4.5 $1.00 $1.25 $0.10

Break-even is fast: with the 5-minute cache, the second request already saves money (1.25× + 0.1× = 1.35× vs 2× uncached).

⚠️ The silent minimum-size gotcha. Prefixes below a model-specific minimum silently refuse to cache — no error, you just pay full price forever. The minimum is 4,096 tokens on Opus 4.8 and Haiku 4.5 and 2,048 on Sonnet 4.6. A tidy 3,000-token system prompt on Haiku never caches. Check usage.cache_read_input_tokens in responses: if it stays 0, your "cached" prompt isn't.

2. Batch API: everything at 50% off

Jobs that can wait up to an hour (most finish faster) can run through the Message Batches API at half price on all tokens — and batching stacks with caching.

3. Model routing: a 5× lever before you optimize anything

Haiku input is 5× cheaper than Opus, output 5× cheaper. The standard production pattern is a cascade: Haiku handles the easy 80%, escalates the hard 20% to Sonnet or Opus. Optimize cost per successful task, not cost per token — a cheap model that fails and retries can out-spend an expensive one that succeeds first try.

Scenario 1 — Support Chatbot (Haiku 4.5)

Assumptions: 100,000 messages/month; 5,000-token system prompt (instructions + few-shot examples — deliberately above Haiku's 4,096 caching minimum); average 1,500 tokens of conversation history + 100-token user message per call; 300-token replies.

Per message Per month (100K msgs)
No caching: 6,600 in × $1 + 300 out × $5 $0.0081 $810
System prompt cached: 5,000 read × $0.10 + 1,600 in × $1 + 300 out × $5 $0.0036 $360

One cache_control breakpoint cuts the bill by 56%. Caching the conversation history too (the standard multi-turn pattern) pushes savings further on longer chats.

Scenario 2 — RAG Document Q&A (Sonnet 4.6)

Assumptions: a 50,000-token document loaded into context; users ask 20 questions per document session; 500-token questions, 800-token answers.

Cost per 20-question session
No caching: every question re-sends the document at $3/MTok $3.27
Document cached: one $0.19 write, then 19 reads at $0.30/MTok $0.74

That is 77% off, and the cached version also responds faster — the model doesn't reprocess 50K tokens per question. At 1,000 document sessions a month, caching is the difference between $3,270 and $740.

Scenario 3 — Autonomous Coding Agent (Opus 4.8)

Agents are where costs explode, because the context is re-sent on every tool call. Assumptions: one task = 40 model calls; context grows from 20K to 150K tokens across the run (average 85K per call); ~500 output tokens per call.

Per task 50 tasks/day
No caching: 40 calls × ~85K input at $5/MTok + 20K output $17.50 $875/day
Incremental caching: each call re-reads the prefix at $0.50/MTok, only the ~3K new tokens pay the write premium ≈$2.95 ≈$148/day

~83% off. For agentic workloads, prompt caching is not an optimization — it's the difference between a viable product and an impossible one. (Anthropic's own agent products rely on exactly this pattern.)

Scenario 4 — Nightly Classification Job (Haiku + Batch)

Assumptions: 100,000 records classified overnight; 400 input + 10 output tokens each.

Per night Per year
Real-time API: 40M in × $1 + 1M out × $5 $45.00 $16,425
Batch API (50% off everything) $22.50 $8,213

If those records share a cacheable instruction prefix, batch + caching stack — many classification jobs land under $15/night.

Estimate Your Own Workload

Token counting is free — you can price a workload before spending anything:

# pip install anthropic
import anthropic

client = anthropic.Anthropic()

count = client.messages.count_tokens(
    model="claude-sonnet-4-6",
    system=MY_SYSTEM_PROMPT,
    messages=[{"role": "user", "content": SAMPLE_REQUEST}],
)

IN_PRICE, OUT_PRICE = 3.00, 15.00   # Sonnet 4.6, $/MTok
est_output = 600                     # your average reply length

per_call = (count.input_tokens * IN_PRICE + est_output * OUT_PRICE) / 1_000_000
print(f"{count.input_tokens} input tokens -> ${per_call:.4f} per call")
print(f"At 10K calls/day: ${per_call * 10_000:.2f}/day")
Enter fullscreen mode Exit fullscreen mode

Then check the usage object on real responses — input_tokens, output_tokens, cache_read_input_tokens — to verify your assumptions against reality.

The Checklist

  • Know your shape: input-heavy (RAG, agents) or output-heavy (generation)? Optimize the expensive side first.
  • Cache anything stable over the minimum size (4,096 tokens on Opus/Haiku, 2,048 on Sonnet) that's reused within 5 minutes — and verify with cache_read_input_tokens.
  • Batch anything that can wait an hour — flat 50% off, stacks with caching.
  • Route by difficulty: Haiku first, escalate on failure. Measure cost per successful task.
  • Cap output: set max_tokens deliberately and prompt for concise answers — output is the 5×-priced direction.
  • Re-price quarterly: model prices and caching mechanics change; the math here is June 2026.

Sources: Anthropic pricing · Prompt caching docs · Batch API docs. All scenario math uses list prices as of June 4, 2026; assumptions are stated inline so you can re-run them with your own numbers.


I write these as part of AI School, a free learning platform (no signup, no paywall). The cost-control techniques above are covered in depth in the free Token Optimization course — context engineering, output control, and cost governance.

Top comments (0)