Cost-Engineering an LLM App From $400/day to $60/day
Same product. Same quality bar. 85% cheaper. Here are the five changes that did it.
The starting point
Customer support assistant. ~12,000 user messages a day. Built with the most capable model the team had access to, no caching, no model cascade, naive context management. Daily spend: roughly $400 on the model API.
Change 1 — small-model-first cascade
About 60% of incoming messages are simple intent classifications: "is this a refund request, a status check, or something else?" That's a small-model task. Routed all incoming messages to a 1B-3B fine-tuned model first. Frontier model only for the 40% that needed reasoning.
Impact: ~45% drop in frontier-model calls.
Change 2 — prompt caching
The system prompt was 1,800 tokens. It was identical across calls. Anthropic prompt caching applied to the system prompt cuts repeat-input cost by ~85% on the cached portion.
Impact: ~30% drop in remaining frontier-model spend.
Change 3 — context trimming
The team was sending the last 20 messages of conversation history every turn. For most turns, the last 4-6 messages was enough. Built a relevance filter that picks the top-N most relevant prior turns by embedding similarity.
Impact: average input token count down ~55%, no measured quality drop.
Change 4 — batch where possible
Async, non-realtime workloads (overnight quality scoring of conversations, embedding refresh) moved to batch APIs at half the per-token rate.
Impact: small but consistent — ~$15/day.
Change 5 — eval gates on every change
This is the meta-change. Every cost optimisation above was validated against the same eval set. We blocked any change that dropped quality by >2% on the eval. Two changes got rejected this way and reworked.
Impact: nothing shipped that hurt the product.
Where the money landed
| Stage | Daily spend |
|---|---|
| Starting | ~$400 |
| + small-model cascade | ~$220 |
| + prompt caching | ~$140 |
| + context trimming | ~$80 |
| + batch where possible | ~$60 |
Numbers are illustrative — your model choice, traffic shape and prompt size will land you somewhere different on each line — but the ordering of impact has been consistent across three projects I've done this on.
What didn't help
- Switching to a "cheaper" frontier model. Quality dropped on edge cases enough that we needed retries, which ate the savings.
- Aggressive context compression. Past a point, summarisation lost critical detail. Trim by relevance, don't compress to summary.
- Self-hosting the small-model layer too early. Below ~500k requests/month, hosted is cheaper after ops cost.