Practical AI Dev

Field note

Shipping weekly without shipping blind

The velocity vs. verification tension—and how thin vertical slices beat big-bang QA.

Roadmaps reward visible features: new tools, new prompts, new models. Regression harnesses are invisible until the night a silent behavior change costs money. The squeeze is that both sides are right—shipping learns faster, and verification prevents learning the same incident twice.

Thin slices that survive rewrites

Instead of a monolithic “eval project,” attach small gates to the paths that already change often: a handful of golden tasks per critical flow, schema checks for tool JSON, and a single trace ID requirement documented alongside your logging plan. That keeps cost proportional to velocity—you pay a little tax on every merge, not a lump sum before launch.

When speed wins anyway

Exploratory work in early exploration can tolerate fuzzier gates if the audience is internal and the blast radius is bounded. The handoff to hardening is where verification becomes non-negotiable: you are about to widen trust or traffic.

Cost and latency budgets also interact with velocity: if every experiment ignores per-session envelopes, verification becomes a firefight after the invoice arrives—another form of debt.

Branching strategies for model artifacts

Treat prompts, adapters, and retrieval configs like code: version them, diff them, and require review when they touch high-risk flows. Pair every merge with a minimal slice of layer-one checks so “small” changes cannot silently widen what the model is allowed to assert—especially where fluency could mask policy drift.

When verification becomes the product

In regulated or safety-critical surfaces, the harness is not overhead—it is the user experience. That shifts resourcing: you staff for continuous suite maintenance, not a one-time audit. Connect that mindset to human oversight so operators are not left as the last line of defense for every ambiguous case.

Flaky tests are worse than no tests

Non-deterministic evals teach teams to ignore red builds. Stabilize sampling temperature, pin model versions in CI, and separate “flaky” from “real drift” using retries and quarantine queues—patterns that align with layered eval design. A suite that cries wolf burns trust faster than no suite at all.

Cross-team contracts for merge velocity

ML, backend, and design each change prompts, tools, and UI copy. Define interfaces: who owns the golden set for a flow, who approves corpus updates, and how long a breaking change can sit unreleased. Without those agreements, verification becomes a bottleneck owned by whoever shouts loudest in chat—rather than a shared truth contract artifact.

What to measure weekly

Track build pass rate on evals, median time to green after a failure, and count of production incidents that lacked a corresponding test afterward. Rising incident-without-test counts signal that ownership is reactive. Pair with cost dashboards so verification spend stays visible to leadership—not buried in engineering-only spreadsheets.

Velocity without gates is motion; verification turns motion into progress you can repeat.

Related notes