Context Compression on FrenchForet

The Context Problem

Mon, 01 Jan 0001 00:00:00 +0000

Goal: understand the problem this course solves, and why it is harder than “the window is full.” A long-running agent keeps adding to the message list it resends every turn (§13), and eventually that list will not fit in the model’s context window. That is the obvious ceiling. The harder truth is a second, softer ceiling: a model uses a full window worse than a short one. So you compress not only to fit, but to keep the model accurate — and every choice about what to drop, summarize, or keep has a cost you must measure. This course is about managing that budget without losing the things the agent still needs.

Measuring the Window

Mon, 01 Jan 0001 00:00:00 +0000

Goal: before you can manage a budget, you have to read it. In this unit you build a context meter: a small tool that counts the tokens in your prompt and attributes them to where they came from — system prompt, tool definitions, history, and tool outputs — so you can see what is filling the window, not just that it is full. You will count without a tokenizer, check that count against the server, and emit the first joinable telemetry line of this course.

The Cheapest Compression Is None

Mon, 01 Jan 0001 00:00:00 +0000

Goal: learn when not to compress. Now that the meter (Unit 1) tells you how full the window is, the first question is not “how do I compress?” but “should I compress at all?” The answer, most of the time, is no. Compressing early costs you twice — it throws away answer quality you did not need to spend, and it throws away your prompt cache — to solve a problem you do not have yet. This unit sets the opening rule of the whole course: under budget, do nothing, and it puts numbers on what you lose when you ignore that rule.

Drop & Window: The Safe Baseline

Mon, 01 Jan 0001 00:00:00 +0000

Goal: build the cheapest compaction that actually frees tokens. Unit 2 said do nothing while you are under budget; this unit is what to do the moment you cross the line. The answer is the oldest and simplest method, and still the right first move: drop the oldest turns. You will build a sliding window that anchors the parts you must never lose, drops the stale middle until the prompt fits again, and records every drop — the safe baseline that every smarter mechanism later in the course has to beat.

Summarizing Evicted Turns

Mon, 01 Jan 0001 00:00:00 +0000

Goal: stop throwing evicted turns away. Unit 3 dropped the oldest middle turns outright; that frees tokens but deletes whatever those turns held. This unit keeps a lossy structured trace of them instead — a short, schema’d recap that costs a fraction of the tokens but keeps the identifiers a later turn may still need. You will build a cheap compressor with a graceful fallback, learn where the recap is allowed to live in the transcript, and meet the reason production does not re-insert it every turn.

Head, Middle, Tail

Mon, 01 Jan 0001 00:00:00 +0000

Goal: stop compressing the turns the model is still using. Unit 3 anchored the head and dropped from the front; Unit 4 summarized what it evicted. Both worked from one end. But the recent tail — the last few turns, the file open right now — is as load-bearing as the task at the head, and a front-only window will eventually reach it. This unit makes the rule explicit and symmetric: keep the head and the tail verbatim, and only ever compress the middle.

Cheap Before Smart: The Deterministic Pre-Pass

Mon, 01 Jan 0001 00:00:00 +0000

Goal: shrink the middle for free before you pay to summarize it. Unit 5 isolated the middle as the only region you compress; Unit 4 handed it to an LLM summarizer. But most of the time the middle is not a subtle conversation that needs an intelligent summary — it is one or two enormous tool outputs (a file read, a search dump) surrounded by a few short messages. Those you can collapse deterministically, with no model call at all. This unit builds that pre-pass and states the rule it teaches: cheap before smart — do the free, mechanical compression first, and only use the paid, intelligent one if you still need it.

When to Fire: Triggers & Async Compression

Mon, 01 Jan 0001 00:00:00 +0000

Goal: decide when compaction runs, and get it off the critical path. Units 3–6 built the what — drop, summarize, head/middle/tail, the deterministic pre-pass — but left the timing open. This unit builds the when: a soft threshold that fires compaction in the background while the turn keeps going, a hard threshold that blocks because the window is genuinely tight, and a re-fire cursor that stops the soft trigger from firing again every single turn. The theme is latency: a user should not wait on a summarizer they did not ask for.

Offloading & Paging: Gist Memory

Mon, 01 Jan 0001 00:00:00 +0000

Goal: keep a giant artifact without keeping it in the window. Some things are too big to leave in context and too important to summarize — the 2,000-line file the agent is about to edit, the full API response it will need three turns from now. Dropping it (Unit 3) deletes it; summarizing it (Unit 4) loses the exact bytes. This unit builds the third option: offload the bytes to storage, leave a compact reference in the window, and page the exact bytes back on demand. Unlike every mechanism so far, this one is lossless — and that is the whole point.

Cache-Aware Compaction

Mon, 01 Jan 0001 00:00:00 +0000

Goal: make the prompt cache the subject, not a side note. Every unit so far has treated the cache as a warning — Unit 2 named it, Unit 4 showed re-inserting a recap breaks it, Units 5 and 7 kept deferring the rewritten layout to “a scheduled reset.” This is that unit. You will see why compaction breaks the cache, the byte-identity invariant the cache depends on entirely, the frozen append-only layout that keeps it alive, and the cost-optimal schedule (L* = √(2R/c)) that decides how often to pay for a rebuild. This is the best-measured win in this course’s production reference — and, tellingly, it is not “compress harder,” it is “stop touching the prefix.”

Prompt-Level Compression

Mon, 01 Jan 0001 00:00:00 +0000

Goal: compress inside the text, not just at the level of whole messages. Every mechanism so far has worked on messages — keep one, drop one, summarize a slice, offload a blob. This unit goes a level down: shrink the tokens within a prompt by removing the ones that carry the least information. That is what perplexity-based methods like LLMLingua do, and what trimming a bloated system prompt does by hand. It is real savings — and the unit that most needs the course’s honesty rule, because aggressive token-dropping can quietly cost you the answer.

Measuring Compression Quality

Mon, 01 Jan 0001 00:00:00 +0000

Goal: turn the through-line into a tool. Every unit since Unit 1 has emitted a joinable record — a meter reading, a compaction with its before/after tokens, a decision, a page-in, a ratio. On their own they are a pile of log lines. This unit reads them back as a timeline and answers the question every record was secretly for: did the compression cost us anything we needed? You will build a quality harness that measures the feedback loop — did a compaction drop something a later turn referenced? — draws the before/after token curve, and exposes a no-regression gate you can run in CI.

The Measured Default

Mon, 01 Jan 0001 00:00:00 +0000

Goal: assemble the whole course into one defensible default. You have built every branch of the decision tree from Unit 0 — measuring, doing nothing, dropping, summarizing, head/tail, pre-pass, triggers, offloading, cache-aware scheduling, and a quality gate. This unit wires them into a single policy that does the least that works each turn, surfaces what it did to the user with a session meter, and ends on the honest move the whole course has been circling: the cheapest tokens are the ones you never generate, so when a turn is too big to compress, decompose the task instead.