Cost-Engineering an LLM App From $400/day to $60/day

By OrionAI Build Editorial · Published 2026-05-10 · // build

Same product. Same quality bar. 85% cheaper. Here are the five changes that did it.

The starting point

Customer support assistant. ~12,000 user messages a day. Built with the most capable model the team had access to, no caching, no model cascade, naive context management. Daily spend: roughly $400 on the model API.

Change 1 — small-model-first cascade

About 60% of incoming messages are simple intent classifications: "is this a refund request, a status check, or something else?" That's a small-model task. Routed all incoming messages to a 1B-3B fine-tuned model first. Frontier model only for the 40% that needed reasoning.

Impact: ~45% drop in frontier-model calls.

Change 2 — prompt caching

The system prompt was 1,800 tokens. It was identical across calls. Anthropic prompt caching applied to the system prompt cuts repeat-input cost by ~85% on the cached portion.

Impact: ~30% drop in remaining frontier-model spend.

Change 3 — context trimming

The team was sending the last 20 messages of conversation history every turn. For most turns, the last 4-6 messages was enough. Built a relevance filter that picks the top-N most relevant prior turns by embedding similarity.

Impact: average input token count down ~55%, no measured quality drop.

Change 4 — batch where possible

Async, non-realtime workloads (overnight quality scoring of conversations, embedding refresh) moved to batch APIs at half the per-token rate.

Impact: small but consistent — ~$15/day.

Change 5 — eval gates on every change

This is the meta-change. Every cost optimisation above was validated against the same eval set. We blocked any change that dropped quality by >2% on the eval. Two changes got rejected this way and reworked.

Impact: nothing shipped that hurt the product.

Where the money landed

Stage	Daily spend
Starting	~$400
+ small-model cascade	~$220
+ prompt caching	~$140
+ context trimming	~$80
+ batch where possible	~$60

Numbers are illustrative — your model choice, traffic shape and prompt size will land you somewhere different on each line — but the ordering of impact has been consistent across three projects I've done this on.

What didn't help

Switching to a "cheaper" frontier model. Quality dropped on edge cases enough that we needed retries, which ate the savings.
Aggressive context compression. Past a point, summarisation lost critical detail. Trim by relevance, don't compress to summary.
Self-hosting the small-model layer too early. Below ~500k requests/month, hosted is cheaper after ops cost.

Model APIs — vetted picks

Anthropic OpenAI ElevenLabs Cartesia Together AI Groq

GPU & compute — vetted picks

RunPod Vast.ai Modal Replicate Lambda Labs Hetzner

Dev tools — vetted picks

Cursor Aider Continue GitHub Copilot