Spans & the Latency Breakdown
Goal: measure where a turn spends its time, not just that it was slow. “The turn took 4 seconds” is not an actionable signal — a latency or cost loop needs to know which phase burned the time. You will build a small span timer that records each phase of a turn on a monotonic clock, classifies spans into phases (setup, context, routing, inference, tools, synthesis), and emits a latency breakdown as one joinable record.
Where this fits: the last of the sensing units (1–3). It uses Unit 1’s trace_id (one timer
per trace) and Unit 2’s event vocabulary (the breakdown is emitted as a request_timing event).
Together, Units 1–3 are the lens — the observability substrate every loop from Unit 4 onward
reads. After this, you start closing loops.
A duration is not a diagnosis
A single number — total latency — tells you a turn was slow but not why, and “why” is the only thing you can act on. Was it the model? A slow tool? Context assembly? Each points to a different fix and a different feedback loop. To separate them you record spans: named, timed segments of the turn.
Two details make spans trustworthy:
- Use a monotonic clock. Wall-clock time (
time.time()) can jump backward when the system clock is adjusted, producing negative or absurd durations.time.monotonic_ns()only moves forward — it exists for exactly this. - Time inline, as the turn runs. You could reconstruct timing afterward by scanning logs,
but some phases — a context-window assembly, a DB lookup — emit no log of their own, so they would
be invisible.
personal_agent’srequest_timer.pysays this directly: unlike “the post-hoc log-scanning approach in metrics.py, this captures precise monotonic-clock measurements including phases that may not emit their own log events.”
A context manager makes spans painless:
@contextmanager
def span(self, name):
start = time.monotonic_ns()
try:
yield
finally:
end = time.monotonic_ns()
self._seq += 1
self._spans.append(Span(name, self._seq, classify(name),
offset_ms=(start - self._start) / 1e6,
duration_ms=(end - start) / 1e6))
This is the shape of the harness’s RequestTimer.span() — a monotonic clock, a sequence counter,
and a span recorded on exit. (Reference:
examples/03/spans_and_latency.py
.)
Classify spans into phases
Individual span names (llm_call:primary, tool_execution:search) are precise but too granular to
aggregate. So each span name is classified into a phase — a coarse category you can sum over.
A small prefix map does it, specific prefixes before generic ones:
PHASE_MAP = [
("llm_call", "llm_inference"),
("tool_execution", "tool_execution"),
("context", "context"),
("routing", "routing"),
("synthesis", "synthesis"),
("session", "setup"),
]
The harness’s _PHASE_MAP carries the full production pipeline — setup → context → routing →
llm_inference → tool_execution → synthesis → persistence — and the comment “Order matters: more
specific prefixes … before generic ones” is load-bearing: llm_call:router must classify as
routing before the generic llm_call: catches it.
Read the breakdown
Sort the spans by offset and append a total, and the turn tells you where it went. Running the example:
span phase offset_ms duration_ms
session_lookup setup 0.0 7.5
context_window context 7.6 15.0
routing_decision routing 22.6 3.0
llm_call:primary llm_inference 25.6 86.4
tool_execution:search tool_execution 112.0 30.0
synthesis synthesis 142.1 6.0
TOTAL total 0.0 148.1
slowest phase: llm_inference (86.4 ms of 148.2 ms)
In this sample, inference dominates and tools are second — a common shape, but not a law: a retrieval-heavy or tool-heavy turn can put the time somewhere else entirely, which is exactly why you measure per turn instead of assuming. Now the signal is actionable — a latency budget loop watches the phase breakdown, not the bare total, and here it knows that shaving context assembly would be wasted effort:
flowchart LR
SET["setup<br/>7.5 ms"] --> CTX["context<br/>15 ms"] --> RT["routing<br/>3 ms"]
RT --> LLM["llm_inference<br/>86 ms ◄ dominant"] --> TOOL["tool_execution<br/>30 ms"] --> SYN["synthesis<br/>6 ms"]
LLM --> DEC["latency loop acts on the<br/>dominant phase, not the total"]The harness exports this two ways: to_breakdown() (every span, for drill-down) and
to_trace_summary() (per-phase totals, for dashboards) — the same data at two zoom levels.
Emit it as one joinable record
The breakdown is only useful if it joins back to the run, so emit it as one request_timing event
(Unit 2’s vocabulary) stamped with the trace’s tuple (Unit 1):
log_event(trace, "request_timing", total_ms=timer.total_ms(), phases=by_phase)
Now the three sensing units compose: a turn has a trace_id (Unit 1), its operations are named
events (Unit 2), and its timing is emitted as a named request_timing event whose payload carries
the per-phase totals (Unit 3). That is the full lens. Everything after this acts on it.
Security: span names and durations are safe, low-cardinality signal; resist the urge to attach raw inputs to them. A timing breakdown is also a side channel — durations can leak whether a cache hit, whether a record existed, whether a branch ran — so treat externally-visible timing as information disclosure when the phase boundaries map to secret-dependent work.
Observe: this unit emits a per-phase latency breakdown as a joinable
request_timingrecord. The loop it closes is “where is the time (or cost) going, and is that acceptable?” — the signal a latency budget or a cache-aware scheduler acts on. A bare total can be logged; only a phase breakdown can be acted on, because only it names the phase a loop should target.
Challenges
- Catch the wall-clock bug. Build the timer twice — once with
time.time(), once withtime.monotonic_ns(). Success: you can explain a scenario (a clock adjustment) where the wall-clock version reports a negative duration and the monotonic one cannot. - Find the dominant phase. Instrument a real §23 agent turn (or the example) and print the per-phase totals. Success: you can name the phase that dominates and one change that would — and one that would not — reduce total latency.
- Summarize for a dashboard. Add a
to_trace_summary()that returns per-phase{duration_ms, steps}plus the total. Success: two views of the same turn — full spans for drill-down, phase totals for a dashboard — from one timer.
Recap
- A total duration is not a diagnosis; a feedback loop needs to know which phase spent the time, because each phase points to a different fix.
- Record spans on a monotonic clock, inline as the turn runs — so you also capture phases
that emit no log of their own (the point of
personal_agent’sRequestTimerover log-scanning). - Classify spans into phases (setup → context → routing → inference → tools → synthesis → persistence) so you can aggregate; order specific prefixes before generic ones.
- Emit the breakdown as one joinable
request_timingrecord. Units 1–3 now compose into the full lens: joined, named, and timed. Loops start in Unit 4.
Next
Unit 4 — The First Closed Loop: a Runtime Gate: you have the lens; now close a loop. You will build a finite-state gate that watches an agent’s tool calls and blocks a runaway — the reflex-tier loop that, in Unit 0’s war story, was the one thing that worked. Sense → decide → act, in a single turn, emitting its own verdict.