Cost and latency — Practical AI Dev

Engineers optimize models after launch; product managers anchor on experience. The gap shows up as surprise invoices, churn, and emergency refactors. The fix is to write non-functional requirements alongside user stories: p95 latency ceilings, maximum cost per successful task, timeout budgets, and explicit fallback behavior when limits are exceeded—connected to telemetry you can actually graph.

Tiered intelligence

Not every query needs the largest model. Route simple intents to smaller models or cached responses; escalate when confidence or complexity thresholds cross—documented so support can explain outcomes, a theme in fluency vs. truth and hybrid edge designs.

Capacity and bursts

LLM traffic is bursty and context-length heavy. Model peak tokens per minute with marketing spikes in mind. Self-hosted pools need queueing and cold-start analysis—topics that surface during ownership and on-call.

Cheap offline evals are still not free at scale—budget judge runs like any other compute workload.

Unit economics that product can read

Translate tokens and GPU hours into cost per successful user task—not only per request—so roadmap debates compare apples to apples. When a feature doubles context length, show the impact on margin and on device feasibility in the same slide.

Degradation as a designed behavior

When budgets are exceeded, define what happens: truncate context, switch to a smaller model, queue requests, or show a transparent wait state. Undocumented degradation trains users to distrust the assistant—another angle on fluency vs. honesty. Document choices alongside on-call playbooks so incidents have predictable mitigations.

Caching and staleness risk

Response caching cuts cost dramatically until answers go stale after a corpus update. Key caches by retrieval fingerprint + model version, and define TTLs per intent—financial data may need minutes; internal FAQs may tolerate hours. Surface staleness in UI when user-facing freshness matters, tying back to contracts.

Batch vs. interactive workloads

Overnight summarization jobs can soak spare capacity; chat must stay within tight SLAs. Split pools or quotas so marketing batch jobs cannot starve interactive users—metrics should separate the two paths clearly.

FinOps conversations that stick

Translate experiments into dollars per thousand active users, not only raw token spend. When a PM proposes a larger context window, show the marginal cost and the quality delta on the same slide—finance and product stay aligned on trade-offs instead of debating abstract “AI costs.”

If finance learns about LLM costs from the invoice, product has already lost the narrative—price belongs in the spec.

Cost and latency as product requirements