Monitoring & Evaluation

Getting Started

An agent in production without monitoring is a black box. You cannot improve what you cannot measure, and you cannot debug what you cannot observe. Monitoring for agents goes beyond traditional application metrics because agent behavior is non-deterministic — the same input can produce different reasoning paths, tool calls, and outputs.

The three pillars of agent observability are tracing (what happened), metrics (how well it performed), and evaluation (how correct the output was). Tools like Langfuse, LangSmith, and Arize Phoenix provide purpose-built platforms, but you can start with simple instrumentation.

Key Concepts

Tracing records the full execution path of an agent request. Every LLM call, tool invocation, and decision point gets a span with timing, inputs, and outputs. This is your primary debugging tool:

import time
from dataclasses import dataclass, field

@dataclass
class Span:
    name: str
    start_time: float = field(default_factory=time.time)
    end_time: float = 0.0
    metadata: dict = field(default_factory=dict)

    def end(self):
        self.end_time = time.time()

    @property
    def duration_ms(self) -> float:
        return (self.end_time - self.start_time) * 1000

class Tracer:
    def __init__(self):
        self.spans: list[Span] = []

    def start_span(self, name: str, **metadata) -> Span:
        span = Span(name=name, metadata=metadata)
        self.spans.append(span)
        return span

    def summary(self) -> dict:
        return {
            "total_ms": sum(s.duration_ms for s in self.spans),
            "steps": [
                {"name": s.name, "ms": s.duration_ms}
                for s in self.spans
            ],
        }

Metrics dashboards aggregate traces into actionable numbers: P50/P95 latency, token cost per request, tool call success rates, and task completion rates. Set alerts on regressions — if your P95 latency doubles or your success rate drops below a threshold, you need to know immediately.

A/B testing lets you compare prompt variations, model versions, or architecture changes with statistical rigor. Route a percentage of traffic to the variant, collect metrics for both, and measure whether the change actually improves outcomes rather than relying on intuition.

Human feedback loops close the gap between automated metrics and actual quality. Build a simple thumbs-up/thumbs-down mechanism into your agent's interface, store feedback alongside the trace, and use it to identify patterns in failures that automated evaluation misses.

Hands-On Practice

Instrument an existing agent with the Tracer class above. Wrap each step — LLM calls, tool invocations, result processing — in a span. After running 20 or more queries through the agent, analyze the trace data: which steps are slowest, which fail most often, and where token usage is highest. Use these insights to identify your first optimization target.

Getting Started

Key Concepts

Hands-On Practice

Exercises

Build an Agent Metrics Dashboard

Knowledge Check

Resources

From My Writing

Milestone Project