The Cost of Context: How to Optimize LLM Usage Without Breaking the Bank
Large Language Models are incredible at reasoning with context: emails, docs, notes, knowledge bases. But that context is not free—every token you send (and receive) is billed. For teams and individuals experimenting with AI assistants, costs can rise quietly and then all at once.
This guide shows how to control spend without sacrificing quality. It’s the pragmatic playbook we use when building and operating AI assistants.
The Three Levers of LLM Cost
- Input tokens (prompt/context)
- Everything you send to the model: system prompt, instructions, history, files.
- Output tokens (response)
- The model’s reply length; often overlooked but meaningful in summaries and drafts.
- Model choice
- Larger models often cost more per token. Smaller or distilled models are cheaper and can be just as effective with the right context strategy.
Aim to reduce tokens before you change models. Great context discipline multiplies the savings of every other change.
A Cost-Aware Prompt Stack
- Right-size the system prompt
- Keep permanent rules compact. Move situational instructions into the user prompt.
- Externalize long policies (e.g., privacy, tone) as short references or IDs.
- Structured inputs over storytelling
- Use bullet lists, headings, and key-value fields. Fewer tokens, clearer intent.
- Response constraints
- Set explicit limits: word counts, JSON schemas, or “return only keys present”.
Example constraint:
{
"summary": "<=120 words",
"tasks": [ {"title": "string", "due": "ISO8601?"} ],
"next_action": "one_of: [email, schedule, draft]"
}
Context Budgeting: Send Only What Matters
- Truncate conversation history
- Keep the last N turns or compress older turns into a short state summary.
- Semantic retrieval (RAG)
- Index knowledge and fetch only the top-k relevant chunks (e.g., k=3–6) instead of entire docs.
- Task-specific reducers
- For emails, keep subject + the last thread turn; for docs, keep headings + relevant section.
- Window caps
- Set a hard token ceiling per request (e.g., 8k). Drop or summarize excess before sending.
Summarize to Save (Without Losing Fidelity)
- Rolling summaries
- After each few turns, ask the model to produce a brief “state” object capturing decisions, entities, and pending tasks. Use that instead of raw history.
- Hierarchical notes
- Summarize a folder into file summaries, and files into section summaries. Retrieve at the smallest useful level.
- Loss-aware compression
- Tag what cannot be compressed (numbers, quotes, legal text). Summarize everything else.
Caching: Your Cost Multiplier
- Prompt+context cache
- If inputs are identical or near-identical, reuse the last response. Hash normalized prompts.
- Tool result cache
- Cache expensive tool calls (search, vector queries, calendar reads) with TTLs.
- Intermediate steps
- For multi-step agents, cache sub-results (e.g., entity extraction) and reuse across runs.
Tip: Track cache hit rate per route. A 30–60% hit rate is common and translates directly into savings.
Choose the Right Model for Each Step
- Decompose the workflow
- Not every step needs the largest model. Try:
- Routing/classification: small model
- Retrieval/query drafting: small/medium
- Final synthesis/checks: medium/large when needed
- Not every step needs the largest model. Try:
- Calibration runs
- Measure quality vs. cost across candidates using a fixed eval set (20–50 tasks is enough to compare).
Output Discipline
- Short, useful answers
- Ask for “bullets, not paragraphs”, or a strict JSON schema. Reward brevity.
- Stop sequences
- Use stop tokens to cut rambling.
- Two-pass draft → prune
- Generate a long draft once, then prune to <30% length for the final user-facing output.
Guardrails Against Runaway Costs
- Max token quotas per user/team/day; reject or degrade gracefully.
- Backpressure when the queue grows; defer non-urgent tasks.
- Timeouts & retries tuned to your provider; avoid infinite loops in agents.
A Simple Cost Calculator (Back-of-the-Envelope)
Let:
- p = input tokens per request
- r = output tokens per request
- c_in, c_out = $/1K tokens for your model
- n = requests per day
Daily cost ≈ n × [(p/1000) × c_in + (r/1000) × c_out]
Reduce p and r by 30% each and you’ll often cut spend by ~50% when combined with caching.
Real-World Playbook (Order of Operations)
- Baseline metrics: avg p, r, model mix, n, cache hit rate.
- Add response constraints and JSON outputs.
- Introduce retrieval (k=4) and drop conversation history >2 turns.
- Add rolling summaries; cap window at 8k.
- Implement prompt and tool-result caching.
- Split workflow across small/medium/large models.
- Tune k, ceilings, and stop sequences; iterate.
What This Looks Like in an AI Assistant
- Email triage: keep subject + last turn; cache templates; generate 120-word drafts max.
- Meeting notes: retrieve agenda + action items schema; summarize recording → 5 bullets.
- Document Q&A: retrieve 4 chunks (512–1024 tokens each); answer in 150 words + citations.
Key Takeaways
- Context is a superpower—and a line item. Treat it like a budget.
- Retrieval and summaries beat giant prompts.
- Caching and right-sized models multiply savings.
- Guardrails prevent surprise bills.
If you want a cost-aware, reliable personal AI assistant, try Astro AI and download the iOS app: Download on iOS.
Want more articles like this? Explore the rest of our posts on the Astro AI Blog.