Conversation State & Memory
Goal: understand that the API is stateless — it remembers nothing between calls — and build a multi-turn conversation yourself by keeping the history. Then learn to keep that history inside the token budget with windowing and summarization.
Where this fits: every call so far was one-shot. Real assistants hold a conversation, and you are responsible for the memory. This underpins tools (Section 13), RAG (Section 19), and agents (Section 22) — they all manage a growing message list.
The server remembers nothing
This surprises people: there is no “conversation” on the server. Each
chat.completions.create(...) is independent. To make the model aware of earlier turns,
you resend the entire history every time — as a growing messages list.
The loop is always the same:
- Append the new
usermessage to your history. - Send the whole history.
- Append the
assistantreply to your history. - Repeat.
Build it. Create work/chat.py (scripted turns, so it runs without typing):
from common import get_client, MODEL
client = get_client()
history = [{"role": "system", "content": "You are a concise travel assistant."}]
user_turns = [
"I'm planning a trip to Japan.",
"What's the best season to visit?", # 'visit' relies on remembering Japan
"And what should I pack for that season?", # relies on remembering the season
]
for turn in user_turns:
history.append({"role": "user", "content": turn})
response = client.chat.completions.create(model=MODEL, messages=history)
reply = response.choices[0].message.content
history.append({"role": "assistant", "content": reply}) # <-- don't forget this
print(f"USER: {turn}\nASSISTANT: {reply}")
print(f" (prompt_tokens: {response.usage.prompt_tokens})\n")
python work/chat.py
Two things to notice:
- The later turns (“visit”, “that season”) only make sense because the earlier turns are
in the history. Delete the
history.append(reply)line and the model goes amnesiac. prompt_tokensclimbs every turn — you resend (and pay for) the whole conversation each time. That’s the problem the rest of this lesson solves. (Reference:examples/12/chat_loop.py.)
Want it interactive? Replace the
for turn in user_turnsloop withwhile True: turn = input("you: "). The mechanics are identical — append, send, append.
Drop the reasoning. When you append the assistant turn, keep only its
content, not itsreasoning_content(Section 5). The thinking was scratch work; resending it just wastes context and money.
Keeping history inside the budget
Because input + output ≤ the context window (Section 3), an ever-growing history
eventually crowds out the answer — and gets expensive (Section 10). Two standard fixes,
both in work/trim.py:
1. Sliding window — keep the system message plus the last N turns, drop the rest:
from common import get_client, MODEL
client = get_client()
def sliding_window(history, keep_turns=4):
system = [m for m in history if m["role"] == "system"]
rest = [m for m in history if m["role"] != "system"]
return system + rest[-keep_turns:]
Simple and cheap, but the model truly forgets anything older than the window.
2. Summarize — replace old turns with a short model-written summary, so the facts survive even though the verbatim turns don’t:
def summarize(messages):
transcript = "\n".join(f"{m['role']}: {m['content']}" for m in messages)
r = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content":
"Summarize this conversation in 2 sentences, keeping key facts:\n"
+ transcript}],
max_tokens=120,
)
return r.choices[0].message.content.strip()
A common production pattern combines them: keep the last few turns verbatim, and replace
everything older with a running summary stored as a system message.
python work/trim.py
(Reference: examples/12/trim_history.py
.)
“Memory” beyond one conversation. Persisting facts across sessions (a user’s preferences, past orders) is a database problem, not a model one: store the facts, then retrieve and inject the relevant ones — which is exactly the retrieval pattern of Section 19.
Security: Conversation history is attacker-reachable: a user can put anything into it, and you replay it every turn. Sanitize what you persist, and don’t trust an earlier turn just because it’s “in the history.”
Challenges
- Prove statelessness. In
work/chat.py, comment out thehistory.append(reply)line. Success: the model can no longer answer the follow-ups — it has no memory of Japan. - Bound the cost. Apply
sliding_window(history, keep_turns=2)before each call in a longer conversation. Success:prompt_tokensstops growing without breaking recent context. - Summarize-and-continue. Build a loop that, once the history passes ~8 messages, summarizes the oldest turns into a system note and keeps only the last 2. Success: the assistant still “remembers” an early fact via the summary after many turns.
Recap
- The API is stateless: to hold a conversation you keep a
messageslist and resend it whole every turn (append user → send → append assistant). - Resending the history makes
prompt_tokens— and cost — grow each turn. - Keep it bounded with a sliding window (drop old turns) and/or summarization (compress old turns into facts).
- Don’t resend
reasoning_content; cross-session memory is a retrieval problem (Section 19).
Next
Section 13 — Tool / Function Calling: you’ll let the model call your code — defining a tool, watching the model ask to use it, running it, and feeding the result back.