Measuring the Window
Goal: before you can manage a budget, you have to read it. In this unit you build a context meter: a small tool that counts the tokens in your prompt and attributes them to where they came from — system prompt, tool definitions, history, and tool outputs — so you can see what is filling the window, not just that it is full. You will count without a tokenizer, check that count against the server, and emit the first joinable telemetry line of this course.
Where this fits: Unit 0 said the window is a budget you already overspend. This unit measures
it. It leans on §4 (tokens, and the course’s rule: ask the server, do not guess), §10 (the
joinable session_id/trace_id/step log line), §13 (the messages list), and §23 (tool
schemas). It is also the first unit with an Observe note that builds something, not just
points forward — the meter you write here is the baseline every later compaction is measured
against.
Count without a tokenizer
You need a token count to know how full the window is. The exact count depends on the model’s
tokenizer — but this course takes no Hugging Face downloads and no tiktoken (§4). For
budgeting that is fine, because budgeting does not need the exact number; it needs a number
good enough to answer “are we getting close?”
A simple heuristic does the job: English text runs roughly four characters per token, and each message costs a few extra tokens of structure (its role, and the delimiters the chat template wraps around it) that the content itself does not show. Sum that over the messages:
_CHARS_PER_TOKEN = 4
_PER_MESSAGE_OVERHEAD = 4 # role + chat-template framing, per message
def estimate_tokens(messages):
return sum(len(m["content"]) // _CHARS_PER_TOKEN + _PER_MESSAGE_OVERHEAD for m in messages)
This is a rule of thumb, not a tokenizer — it drifts with code, other languages, and unusual
whitespace, and it is typically within 10–20% of the truth. That is close enough to decide when
to act, and cheap enough to run on every turn without a network call. (Reference:
examples/common_context.py
, which also handles structured
message content.)
Ask the server when it matters
Sometimes 10–20% is not good enough — you are right at the limit, or a decision turns on the
exact number. Then do what §4 taught: ask the server. Every completion already reports the
exact input count in usage.prompt_tokens, so the smallest possible call returns the truth:
def server_prompt_tokens(client, messages, model):
r = client.chat.completions.create(model=model, messages=messages, max_tokens=1)
return r.usage.prompt_tokens # exactly how many input tokens the model saw
(If your endpoint exposes vLLM’s /tokenize, that returns the count without generating
anything.) The division of labour is the point: the heuristic for budgeting — fast, free,
every turn — and the server for precision — exact, but a real call. The course leans on the
first and reaches for the second only when a choice depends on it.
Where the budget goes
A single total tells you the window is filling. It does not tell you what to cut. For that, attribute the tokens to their source — and the sources are exactly the parts Unit 0 listed. Two of them are mostly fixed each turn; two grow without limit, which is where the meter earns its keep:
flowchart LR
subgraph FIXED["Mostly fixed each turn"]
SP["System prompt — small"]
TD["Tool definitions (§23)"]
RC["Retrieved context (§20)"]
end
subgraph GROWS["Grows without limit"]
H["Conversation history (§13)"]
TO["Tool outputs — grow fastest;<br/>one file read can be larger<br/>than the whole conversation"]
end
FIXED --> BUD["The context budget<br/>(one window, resent every turn)"]
GROWS --> BUD
BUD --> METER["Context meter: attribute tokens to<br/>each source, to see what to cut first"]breakdown = {
"system": estimate_tokens([m for m in messages if m["role"] == "system"]),
"tools": estimate_tokens(json.dumps(tools)), # schemas, resent every turn (§23)
"history": estimate_tokens([m for m in messages if m["role"] in ("user", "assistant")]),
"tool_outputs": estimate_tokens([m for m in messages if m["role"] == "tool"]),
}
Print that as a budget bar and the shape of the problem appears immediately. Run the meter on a
short session whose agent has read one file (Reference:
examples/01/meter.py
):
context meter (budget 8000 tokens)
system 32 1%
tools 97 3% #
history 53 2% #
tool_outputs 2704 94% #####################################
TOTAL 2886 36% of budget
Six messages, and 94% of the tokens are one tool output — a single file the agent read. The system prompt, the tools, and the entire conversation together are a rounding error beside it. This is the lesson Unit 0 promised, now measured: history and tool outputs grow without limit, and tool outputs grow fastest. When you later decide what to compress, the meter has already told you where to look first — and it tells you per turn, so you watch the shape change as the session runs.
Instrument it: the meter is the first telemetry line
A meter you read once and forget is not observability. The course’s discipline is to emit every measurement as a structured, joinable line, so the whole run reconstructs from a shared key (§10). The meter is where that through-line begins:
session_id, trace_id = uuid.uuid4().hex[:8], uuid.uuid4().hex[:8]
log_event(session_id, trace_id, 0, "context_meter",
budget=BUDGET, total=total, fraction=round(total / BUDGET, 3), **breakdown)
{"operation": "context_meter", "session_id": "88194aab", "trace_id": "3317d685", "step": 0,
"budget": 8000, "total": 2886, "fraction": 0.361, "system": 32, "tools": 97,
"history": 53, "tool_outputs": 2704}
That one line is the baseline. Every later unit performs a compaction and logs another line
with the same tuple — so you can replay a session and see the window fill, a compaction fire,
and the total drop, all joined by trace_id. You cannot manage a budget you cannot see; from
here on, you can see it. (log_event lives in
examples/common_context.py
and is reused by every unit.)
Security: log the shape of the context, never its content. The meter records token counts and percentages — safe, high-value metadata — but a message body can carry secrets or personal data, so it must not land in a log without a policy (§10). The meter also doubles as a tripwire: the context-overflow attack from Unit 0 — a hostile tool result or pasted document padding the window to push your system prompt out — shows up here as a sudden spike in
tool_outputs. You will not catch that by eye in a 100-turn run; you will catch it in the telemetry.
Observe: this unit is the start of the through-line. It emits one joinable
context_meterline per measurement —total,fractionof budget, and the per-source breakdown — using the foundations §10 tuple. The loop it closes is the most basic one in the course: you now have a number for “how full, and with what?”, which every later unit needs to decide whether to compress and to prove the compaction actually helped.
Challenges
- Build the meter. Write
estimate_tokensand the four-part breakdown in your ownwork/folder, and run it on a conversation you build. Success: a budget bar that names which source is largest — and, for the example session, it is the tool output by a wide margin. - Trust, then verify. If you have an endpoint, compare your heuristic total to
server_prompt_tokens. Success: you can state your heuristic’s error as a percentage, and say whether it over- or under-counts for your text. - Watch it climb. Append turns (or a second big tool output) and re-run the meter each
time, capturing the
context_meterlines with2>> run.jsonl. Success: a JSONL file whosetotalrises turn by turn — the budget filling, recorded.
Recap
- For budgeting you do not need an exact token count, so the course uses a heuristic
(~4 chars/token + a small per-message overhead) — no
tiktoken, no downloads (§4). - When precision matters, ask the server:
usage.prompt_tokensis the exact input count. - A total is not enough; attribute tokens to their source (system / tools / history / tool outputs). Tool outputs are usually by far the largest — the meter makes that visible.
- Emit every measurement as a joinable
context_meterline (§10 tuple). This is the baseline the rest of the course measures compaction against — observability starts in Unit 1, not at the end. - Log the context’s shape, not its content — and let the meter double as an overflow tripwire.
Next
Unit 2 — The Cheapest Compression Is None: now that you can read the budget, the first question is when not to spend effort. We look at the real cost of compressing too early — to answer quality and to the prompt cache — and set the course’s opening rule: under budget, do nothing.