Edge inference — Practical AI Dev

Running models locally removes per-token API bills and can keep sensitive text off third-party networks. It also makes you responsible for GPU drivers, quantization accuracy, numerical drift between chip vendors, and crash reports from devices you have never held. This is the same trade space as centralized vs. edge—choose hybrid routing deliberately, not by accident.

Tiering and fallbacks

Document which intents stay on-device, which call cloud, and what users see when a tier is unavailable. That clarity is part of your truth contract narrative: users should not infer a capability because the model sounds confident on-device.

Updates and thermals

Edge means you ship binaries—see ownership for rollback and release coordination. Mobile teams must profile latency and battery together; a model that fits in RAM can still destroy UX if it runs hot.

Observability is harder: use privacy-preserving crash metadata and on-device counters, linked where possible to the tracing ideas in structured logging.

Quantization and quality cliffs

Moving from fp16 to int8 (or lower) saves memory until it does not: certain prompts or languages fall off a cliff first. Characterize quality per locale and per task type before you lock a quantization profile—then lock it in evals the same way you would a model version bump in the cloud.

Network fallbacks users understand

When on-device inference fails, the product should degrade gracefully: shorter answers, cached responses, or a clear “connect to continue” path. Those behaviors belong in your truth contract so users are not misled about freshness—especially when hybrid routing is opaque in the UI.

On-device RAG and storage budgets

Embedding corpora locally competes with photos and apps for disk. Prune aggressively, version indexes with app builds, and surface “searching your files” progress so users know why latency spiked. Pair with latency budgets that include embedding and rerank steps—not only generation.

Security: model files as attack surface

Signed binaries, integrity checks on load, and protection against tampered weights reduce supply-chain risk. Document how you verify model artifacts in CI—similar to how you verify golden outputs—so incident response knows whether a bad answer came from poisoned weights or user input.

Accessibility and on-device UX

Screen readers and voice input change how users phrase requests; local models may see different error rates for those modalities. Test with assistive tech early—failures here are both UX and policy exposure when regulated industries require equitable access.

Edge inference is logistics: binaries, thermals, and matrices—glamorous only in slide decks, expensive in real life.

Local and edge inference without romanticizing it