LLM Cost Calculator
Compare what your AI workload would cost across 12 current models from OpenAI, Anthropic, Google, xAI, Mistral, and open-model hosts. Enter average input and output tokens per request and your requests per day — or pick a preset scenario — and see daily, monthly, and annual cost for every model side by side, with the cheapest option badged. Sort any column, drag the volume slider to see the cost curve, and copy or export the comparison as CSV. Free, in your browser, no signup.
How to Use This Tool
- Enter typical input and output tokens for a single request — or click a preset (Chatbot, Content Gen, Summarization) to fill realistic values instantly. Not sure of your token counts? Use our AI Token Counter first.
- Set your requests per day by typing a number or dragging the slider to watch the cost scale in real time.
- Read the comparison table. Eleven models are ranked with input/output prices and daily, monthly, and annual cost. The cheapest is badged.
- Sort to compare — click any column heading (Daily, Monthly, Annual, or the price columns) to re-rank; click again to reverse.
- Copy or export. Copy the table as text for a doc or ticket, or export a CSV for your budget spreadsheet.
- Apply the savings tips below to cut the numbers you see — output capping, model routing, caching, and batch APIs are the big levers.
About LLM Pricing & Cost Comparison
AI API costs are deceptively simple to price and surprisingly easy to get wrong. Every major provider charges per token, with a separate rate for the tokens you send (input) and the tokens the model generates (output). Multiply by your request volume and you have a bill — but the spread between models is enormous. The same workload can cost thirty times more on a flagship model than on a small one, and the “obvious” choice is rarely the cheapest one that actually does the job. This calculator removes the guesswork: put in your real token counts and volume, and it ranks twelve models — from GPT-5.5 and Claude Opus 4.8 at the premium end to Gemini Flash-Lite, Claude Haiku 4.5, and DeepSeek at the budget end — by daily, monthly, and annual cost.
The most important thing to understand about LLM pricing is the input/output asymmetry. Output tokens cost three to five times more than input tokens because generating text is more expensive than reading it. GPT-5.5 is $5 per million input tokens but $30 per million output; Claude Opus 4.8 is $5 and $25. This means your cost is driven less by how much you send and more by how much the model writes back. A summarization tool (long input, short output) has a very different cost profile from a content generator (short input, long output), even at the same request count. Because this calculator prices the two sides separately at each model's real rates, it captures that asymmetry — which is exactly why the preset scenarios produce such different rankings.
Scale is the other half of the story. A cost that's trivial per request — a fraction of a cent — becomes a real budget line when multiplied by thousands of requests a day and then by 365 days. That's why the table projects daily, monthly, and annual figures and why the volume slider matters: dragging it reveals the cost curve and the point at which a model that's fine for a prototype becomes unaffordable in production. It's also where model choice compounds: a 10× per-token difference becomes a 10× difference in your annual cloud bill, so the few minutes spent comparing here can be worth tens of thousands of dollars at scale.
The figures here are a starting point, not a final invoice. They use standard real-time list prices and deliberately exclude the things that move real bills in both directions: prompt caching and batch APIs can each cut costs by 50% or more, while retries, conversation history, RAG context, and agentic multi-call loops can push them well above a naive per-request estimate. Prices also change frequently as providers compete, so always confirm current rates before committing a budget. Treat the comparison as a reliable relative ranking — which model is cheapest for your shape of workload — and as a solid order-of-magnitude absolute estimate, then layer in your own caching, batching, and overhead assumptions.
Choosing the right model is the single biggest lever on AI cost, but a fully optimized system — caching, model routing, output controls, and batching working together — is where the real savings live. Our AI-Powered Marketing team architects cost-efficient LLM systems end to end, routinely cutting AI bills by 50% or more without sacrificing quality. Pair this calculator with the AI Token Counter to measure your real token usage, the AI Prompt Builder to write leaner prompts, and the ROI Calculator to weigh the cost against the value it generates.
Frequently Asked Questions
Is GPT-4 or Claude cheaper?
It depends entirely on which versions you compare. At the top end they are similar: GPT-5.5 ($5/$30 per 1M tokens) and Claude Opus 4.8 ($5/$25) sit near each other, while mid-tiers like GPT-4.1 ($2/$8) and Claude Sonnet 4.6 ($3/$15) trade blows. At the cheap end, Claude Haiku 4.5 ($1/$5) and GPT-5.4 mini ($0.75/$4.50) are inexpensive, and budget options like Google's Gemini 3.1 Flash-Lite ($0.25/$1.50) and DeepSeek V4-Flash ($0.14/$0.28) undercut almost everything. The honest answer is that 'GPT-4 vs Claude' is the wrong comparison — you should compare the specific model tiers that can actually do your task, at your real input/output token mix. This calculator does exactly that: enter your tokens and volume and it ranks all twelve models by daily, monthly, and annual cost so the cheapest capable option is obvious.
Why are input and output tokens priced differently?
Generating text is more computationally expensive than reading it, so providers charge more for output tokens than input tokens — often three to five times more. For example, GPT-5.5 is $5 per million input tokens but $30 per million output; Claude Opus 4.8 is $5 input and $25 output. This has a big practical consequence: a workload that produces long answers from short prompts can cost far more than one with long prompts and short answers, even at the same total token count. The biggest cost lever for most applications is therefore controlling output length — setting max_tokens, asking for concise responses, and avoiding the model padding answers. This calculator prices input and output separately at each model's real rates so your comparison reflects your actual input/output ratio, not a blended average.
What are prompt caching discounts?
Prompt caching lets you reuse a large, unchanging chunk of context — a long system prompt, a document, a set of few-shot examples — across many requests at a steep discount. The first call pays to process and cache it; subsequent calls that reuse the cached prefix pay a fraction of the normal input rate for those tokens (often 75–90% off). Both OpenAI and Anthropic offer it, with somewhat different mechanics and cache lifetimes. For chatbots and RAG systems that send the same instructions or context on every turn, caching can cut input costs dramatically. This calculator shows uncached list prices, so treat its input-cost figures as an upper bound — if a big share of your input is a stable cached prefix, your real input bill will be meaningfully lower.
Are there bulk or volume API discounts?
Yes, in a few forms. Batch APIs (OpenAI Batch, Anthropic Message Batches) give roughly 50% off standard rates for work you can wait on — you submit a job and get results within a window (often up to 24 hours) instead of in real time, which is perfect for bulk summarization, evals, or content generation. Beyond that, large customers can negotiate committed-use or enterprise pricing with providers, and the rates can be materially better than list. Some models also have cheaper tiers for smaller context. This calculator uses standard real-time list prices, so if a meaningful share of your volume could run through a batch endpoint, your effective cost could be roughly half of what the table shows — a saving worth modeling separately.
Is self-hosting an open model cheaper than an API?
Sometimes — but the break-even is higher than people expect. Open models like Llama and DeepSeek are free to license, but self-hosting means paying for GPUs (rented or owned), engineering time to deploy and maintain serving infrastructure, scaling, monitoring, and idle capacity when traffic is low. For low or spiky volume, a pay-per-token API is almost always cheaper and far less work. Self-hosting starts to win at high, steady volume where you can keep expensive GPUs busy, or when data residency, latency, or customization requirements rule out APIs. A middle path is hosted open-model inference (Groq, Together, Fireworks), which gives you Llama-class models at low per-token prices with no infrastructure — that's the Llama 3.3 70B row in this calculator. Model the API cost here first; it's the number self-hosting has to beat after all-in overhead.
When should I switch to a cheaper model?
Switch when a cheaper model passes your quality bar on your actual task — not before, and not never. The disciplined approach is to define a small evaluation set of real inputs and expected outputs, run your candidate models against it, and pick the cheapest one that meets your quality threshold. Many teams over-pay by routing everything to a flagship model when a mid-tier or small model handles 80% of requests perfectly well. A common pattern is model routing: use a cheap, fast model (Haiku 4.5, Gemini Flash-Lite, GPT-5.4 mini, DeepSeek) as the default and escalate only the hard cases to a premium model, which can cut costs by half or more with little quality loss. Use this calculator to quantify the prize — see exactly what each model would cost at your volume — then validate quality before committing.
What are the best token optimization tips?
Cut output first, since it's the priciest: set max_tokens, ask explicitly for brevity, and request structured formats (JSON, bullets) that don't ramble. Then trim input: remove boilerplate and redundant instructions, compress or reduce few-shot examples once the model behaves, summarize long documents before sending them, and avoid resending full conversation history when a running summary will do. Use prompt caching for any large stable prefix, and cache or memoize outputs for repeated identical requests so you don't pay twice for the same answer. Finally, right-size the model — don't use a flagship for classification a small model nails. Each of these is a direct multiplier on the numbers in this calculator: halve your output tokens and the output-cost column halves with it.
What are the hidden costs of AI APIs?
The per-token price is only part of the bill. Watch for: retries and failures (timeouts and rate-limit retries that still consume tokens), conversation history (resending the whole thread each turn multiplies input tokens as a chat grows), system prompts and RAG context (large fixed prefixes added to every call), agentic loops (multi-step agents that make many model calls per user action), and embeddings and other auxiliary models you call alongside the main one. There's also engineering and monitoring time, and the cost of evals and experimentation during development. None of these show up in a naive per-request estimate, so real bills often run higher than a first calculation suggests. Use this calculator for the core per-request cost, then add a buffer for history, retries, and multi-call workflows when you budget.