LLM API Cost Calculator
Last updated: April 2026 — Prices change frequently. Verify at provider pricing pages.
| Model | Input / 1M | Output / 1M | Per Request | Daily | Monthly |
|---|---|---|---|---|---|
| Gemini 1.5 Flash | $0.075 | $0.300 | $0.000225 | $0.0225 | $0.67 |
| GPT-4o mini | $0.150 | $0.600 | $0.000450 | $0.0450 | $1.35 |
| Llama 3.3 70B (Groq) | $0.590 | $0.790 | $0.000985 | $0.0985 | $2.96 |
| Claude 3.5 Haiku | $0.800 | $4.000 | $0.002800 | $0.2800 | $8.40 |
| Gemini 1.5 Pro | $1.250 | $5.000 | $0.003750 | $0.3750 | $11.25 |
| Mistral Large | $2.000 | $6.000 | $0.005000 | $0.5000 | $15.00 |
| GPT-4o | $2.500 | $10.000 | $0.007500 | $0.7500 | $22.50 |
| o1-mini | $3.000 | $12.000 | $0.009000 | $0.9000 | $27.00 |
| Claude 3.5 Sonnet | $3.000 | $15.000 | $0.010500 | $1.0500 | $31.50 |
| GPT-4 Turbo | $10.000 | $30.000 | $0.025000 | $2.5000 | $75.00 |
| o1 | $15.000 | $60.000 | $0.045000 | $4.5000 | $135.00 |
| Claude 3 Opus | $15.000 | $75.000 | $0.052500 | $5.2500 | $157.50 |
Prices change frequently. Always verify current pricing at OpenAI, Anthropic, Google, and other provider pages.
How LLM API Pricing Works
Large language model APIs are priced per token — the fundamental unit of text that models process. Every provider splits billing into two categories: input tokens (what you send) and output tokens (what the model generates). Prices are quoted per million tokens, making even expensive models practical for most applications.
As of April 2026, pricing spans two orders of magnitude — from $0.075 per million input tokens (Gemini 1.5 Flash) to $15.00 per million (Claude 3 Opus, OpenAI o1). The wide range reflects real differences in capability, context window size, and reasoning depth. Choosing the right model for your workload is the single most impactful cost optimization.
Input Tokens vs. Output Tokens
The input/output distinction is the most important concept in LLM cost modeling:
| Token Type | What it includes | Relative cost |
|---|---|---|
| Input | System prompt, conversation history, user message, retrieved documents (RAG) | 1x (baseline) |
| Output | Model's response text, tool call arguments, structured output JSON | 3–5x input price |
For most chat applications, outputs are 20–40% of total tokens but 60–80% of total cost because of this price differential. Applications with long outputs (code generation, document drafting) are dramatically more expensive than classification or extraction tasks where outputs are short.
Context Window and Its Cost Implications
Every LLM has a context window — the maximum number of tokens it can process in a single request (input + output combined). Larger context windows cost more per request when filled: sending a 100,000-token document to Claude 3 Opus costs $1.50 in input tokens alone (100,000 × $15 / 1,000,000). Context management is therefore a core engineering concern in production applications.
Strategies to reduce context costs: summarize conversation history instead of replaying it verbatim, use RAG (Retrieval-Augmented Generation) to send only the relevant document chunks, and leverage prompt caching for repeated prefixes. A well-engineered RAG pipeline can reduce per-request token counts by 70–90% compared to naive full-document approaches.
Prompt Caching — The Hidden Cost Saver
Prompt caching is a mechanism where the API stores a processed representation of a repeated prompt prefix. If you always start requests with the same 10,000-token system prompt, you can cache it and pay only a fraction (typically 10% of normal input price) on subsequent requests that hit the cache.
Anthropic charges $3.75 per million tokens to write to cache and $0.30 per million for cache reads (vs $3.00 for standard input). OpenAI offers automatic prompt caching with similar economics. For applications with large system prompts — customer support bots with product documentation, coding assistants with codebase context — caching can cut input costs by 80–90%.
Cost Optimization Strategies
The most effective cost levers in LLM applications, roughly in order of impact:
| Strategy | Potential Saving | Complexity |
|---|---|---|
| Right-size the model (use Flash/mini for simple tasks) | 50–95% | Low |
| Prompt caching for repeated prefixes | 40–80% | Medium |
| RAG instead of full-document context | 50–90% | High |
| Shorten system prompts | 10–30% | Low |
| Batch API (async, lower priority) | 50% | Low |
| Response streaming with early termination | 5–20% | Medium |
The most common mistake is defaulting to a frontier model for all tasks. A simple classification or intent-detection task that costs $0.002 per request on GPT-4o costs $0.000015 on GPT-4o mini — over 100x cheaper with nearly identical accuracy for that task class. Always evaluate which model tier a given task actually requires.
Model Tiers and When to Use Each
Providers now offer distinct capability tiers. Matching the tier to the task is the fundamental cost optimization:
| Tier | Examples | Best for |
|---|---|---|
| Economy | GPT-4o mini, Gemini Flash, Claude Haiku, Llama 70B | Classification, extraction, summarization, simple Q&A, high-volume pipelines |
| Mid-range | GPT-4o, Claude Sonnet, Gemini Pro | Chat, code assistance, analysis, most production workloads |
| Frontier | o1, Claude Opus, GPT-4 Turbo | Complex reasoning, long-horizon planning, specialized research tasks |
Frequently Asked Questions
What is the difference between input tokens and output tokens?
Input tokens are the tokens in your prompt — everything you send to the model including system prompts, conversation history, and the current user message. Output tokens are the tokens in the model's response. Output tokens are consistently more expensive than input tokens because generating text is computationally more intensive than reading it. For example, GPT-4o charges $2.50 per million input tokens but $10.00 per million output tokens — a 4x difference.
Why do output tokens cost more than input tokens?
Generating tokens requires the model to run sequentially — each token is produced one at a time, with the model attending to all previous context at every step. Input processing can be done in parallel across the sequence. This autoregressive generation process is computationally intensive, which is why providers charge 3–5x more for output tokens than input tokens.
What is prompt caching and how does it reduce costs?
Prompt caching allows providers to store a computed representation of repeated prompt prefixes (like a long system prompt or document) so they don't need to reprocess them on every request. Anthropic charges $0.30 per million tokens for cache reads (vs $3.00 for standard input), a 90% reduction. OpenAI offers similar discounts. For applications with large, repeated system prompts, caching can reduce costs by 60–80%.
How do I estimate my token usage before building?
A rough rule of thumb: 1 token ≈ 0.75 English words, or 4 characters. A typical paragraph (75 words) is about 100 tokens. A 2,000-word article is roughly 2,600 tokens. For structured outputs like JSON, tokens run a bit higher. Most providers offer tokenizer tools — OpenAI's tiktoken library and Anthropic's Claude token counter let you measure exact token counts for your specific prompts.
Which LLM is most cost-effective for high-volume applications?
For high-volume, cost-sensitive applications: Gemini 1.5 Flash ($0.075 input / $0.30 output per million tokens) and GPT-4o mini ($0.15 / $0.60) are the most economical while still offering strong performance. For tasks requiring maximum capability, GPT-4o ($2.50 / $10.00) and Claude 3.5 Sonnet ($3.00 / $15.00) offer the best quality-to-cost ratio among frontier models. Claude 3 Opus ($15.00 / $75.00) and o1 ($15.00 / $60.00) are best reserved for tasks where accuracy is paramount and cost is secondary.
Related Calculators
- YouTube Earnings Calculator — Estimate channel ad revenue by views and niche
- Meeting Cost Calculator — Calculate the real cost of meetings in salary time
- ROI Calculator — Calculate return on investment for any cost
- Break-Even Calculator — Find the point where AI costs are covered by value generated