HTTP status codes and stack traces were never enough for LLM pipelines. A single user turn may touch embedding services, vector stores, rerankers, tool executors, and safety filters—each with its own latency distribution and failure mode. You need a single trace ID propagated through the chain, aligned with the accountability story in truth vs. fluency: when something reads well but is wrong, you must know whether retrieval or generation failed.
What to log (and what not to)
Prefer structured fields: model version, retrieval query hash, top chunk IDs, tool names, policy outcomes, token timings. Full prompt logging often violates compliance—use sampling and redaction policies, and replay synthetic fixtures for debugging. Connect this discipline to index versioning so you can correlate regressions with corpus changes.
Dashboards that matter
Track p95 latency by model tier, retrieval empty-rate, tool error rate, human escalation rate, and cost per successful task—themes that overlap economic requirements and oversight metrics. Vanity token counts rarely explain incidents.
During pilots, align dashboards with the product questions you actually want answered—otherwise you will stare at green graphs while users churn.
Sampling and privacy budgets
Full prompt logging is rarely compliant; stratified sampling with role-based access is the usual compromise. Define who can replay which sessions, how long you retain them, and how redaction works—then audit access the same way you audit model changes. This ties directly to policy and to trust in high-stakes answers.
Alerting on semantic drift
Sudden spikes in retrieval empty-rate, tool errors, or mean output length often precede user-visible quality issues. Set thresholds with product context, not only infra defaults—paired with offline suites so you know whether to roll back a model, an index, or a client build.
OpenTelemetry-style spans for LLM chains
Model one span per stage: embed, retrieve, rerank, generate, tool call, safety filter. Propagate baggage with user/session IDs where policy allows. That structure makes distributed traces readable in standard backends—no bespoke “AI only” viewer required unless you want richer token visualizations on top.
Debugging without reproducing PII
Engineers need enough context to fix bugs; compliance needs minimization. Use hashed query fingerprints, synthetic replays with scrubbed fixtures, and environment-specific log levels. Align with oversight teams on who may unlock sampled full prompts and under what audit trail.
SLOs for model-backed APIs
Error rate, latency percentiles, and saturation on GPU pools should sit beside traditional HTTP SLOs. When GPUs are throttled, cascading timeouts often look like “bad model quality”—correlate queue depth with user complaints to avoid misdiagnosing a capacity issue as a prompt bug.
Good LLM observability answers “which stage lied” in one trace—not “the model was slow.”