The Cost of Context: How to Optimize LLM Usage Without Breaking the Bank

Large Language Models are incredible at reasoning with context: emails, docs, notes, knowledge bases. But that context is not free—every token you send (and receive) is billed. For teams and individuals experimenting with AI assistants, costs can rise quietly and then all at once.

This guide shows how to control spend without sacrificing quality. It’s the pragmatic playbook we use when building and operating AI assistants.

The Three Levers of LLM Cost

Input tokens (prompt/context)
- Everything you send to the model: system prompt, instructions, history, files.
Output tokens (response)
- The model’s reply length; often overlooked but meaningful in summaries and drafts.
Model choice
- Larger models often cost more per token. Smaller or distilled models are cheaper and can be just as effective with the right context strategy.

Aim to reduce tokens before you change models. Great context discipline multiplies the savings of every other change.

A Cost-Aware Prompt Stack

Right-size the system prompt
- Keep permanent rules compact. Move situational instructions into the user prompt.
- Externalize long policies (e.g., privacy, tone) as short references or IDs.
Structured inputs over storytelling
- Use bullet lists, headings, and key-value fields. Fewer tokens, clearer intent.
Response constraints
- Set explicit limits: word counts, JSON schemas, or “return only keys present”.

Example constraint:

{
  "summary": "<=120 words",
  "tasks": [ {"title": "string", "due": "ISO8601?"} ],
  "next_action": "one_of: [email, schedule, draft]"
}

Context Budgeting: Send Only What Matters

Truncate conversation history
- Keep the last N turns or compress older turns into a short state summary.
Semantic retrieval (RAG)
- Index knowledge and fetch only the top-k relevant chunks (e.g., k=3–6) instead of entire docs.
Task-specific reducers
- For emails, keep subject + the last thread turn; for docs, keep headings + relevant section.
Window caps
- Set a hard token ceiling per request (e.g., 8k). Drop or summarize excess before sending.

Summarize to Save (Without Losing Fidelity)

Rolling summaries
- After each few turns, ask the model to produce a brief “state” object capturing decisions, entities, and pending tasks. Use that instead of raw history.
Hierarchical notes
- Summarize a folder into file summaries, and files into section summaries. Retrieve at the smallest useful level.
Loss-aware compression
- Tag what cannot be compressed (numbers, quotes, legal text). Summarize everything else.

Caching: Your Cost Multiplier

Prompt+context cache
- If inputs are identical or near-identical, reuse the last response. Hash normalized prompts.
Tool result cache
- Cache expensive tool calls (search, vector queries, calendar reads) with TTLs.
Intermediate steps
- For multi-step agents, cache sub-results (e.g., entity extraction) and reuse across runs.

Tip: Track cache hit rate per route. A 30–60% hit rate is common and translates directly into savings.

Choose the Right Model for Each Step

Decompose the workflow
- Not every step needs the largest model. Try:
  - Routing/classification: small model
  - Retrieval/query drafting: small/medium
  - Final synthesis/checks: medium/large when needed
Calibration runs
- Measure quality vs. cost across candidates using a fixed eval set (20–50 tasks is enough to compare).

Output Discipline

Short, useful answers
- Ask for “bullets, not paragraphs”, or a strict JSON schema. Reward brevity.
Stop sequences
- Use stop tokens to cut rambling.
Two-pass draft → prune
- Generate a long draft once, then prune to <30% length for the final user-facing output.

Guardrails Against Runaway Costs

Max token quotas per user/team/day; reject or degrade gracefully.
Backpressure when the queue grows; defer non-urgent tasks.
Timeouts & retries tuned to your provider; avoid infinite loops in agents.

A Simple Cost Calculator (Back-of-the-Envelope)

Let:

p = input tokens per request
r = output tokens per request
c_in, c_out = $/1K tokens for your model
n = requests per day

Daily cost ≈ n × [(p/1000) × c_in + (r/1000) × c_out]

Reduce p and r by 30% each and you’ll often cut spend by ~50% when combined with caching.

Real-World Playbook (Order of Operations)

Baseline metrics: avg p, r, model mix, n, cache hit rate.
Add response constraints and JSON outputs.
Introduce retrieval (k=4) and drop conversation history >2 turns.
Add rolling summaries; cap window at 8k.
Implement prompt and tool-result caching.
Split workflow across small/medium/large models.
Tune k, ceilings, and stop sequences; iterate.

What This Looks Like in an AI Assistant

Email triage: keep subject + last turn; cache templates; generate 120-word drafts max.
Meeting notes: retrieve agenda + action items schema; summarize recording → 5 bullets.
Document Q&A: retrieve 4 chunks (512–1024 tokens each); answer in 150 words + citations.

Key Takeaways

Context is a superpower—and a line item. Treat it like a budget.
Retrieval and summaries beat giant prompts.
Caching and right-sized models multiply savings.
Guardrails prevent surprise bills.

If you want a cost-aware, reliable personal AI assistant, try Astro AI and download the iOS app: Download on iOS.

Want more articles like this? Explore the rest of our posts on the Astro AI Blog.

The Cost of Context: How to Optimize LLM Usage Without Breaking the Bank

The Cost of Context: How to Optimize LLM Usage Without Breaking the Bank

The Three Levers of LLM Cost

A Cost-Aware Prompt Stack

Context Budgeting: Send Only What Matters

Summarize to Save (Without Losing Fidelity)

Caching: Your Cost Multiplier

Choose the Right Model for Each Step

Output Discipline

Guardrails Against Runaway Costs

A Simple Cost Calculator (Back-of-the-Envelope)

Real-World Playbook (Order of Operations)

What This Looks Like in an AI Assistant

Key Takeaways

Continue Reading

Discover more articles

Download on iOS