Reasoning / "Thinking" Models
Goal: open up the thing gpt-oss-120b has been doing since Section 1 — thinking
before it answers. You’ll write scripts that reveal the model’s private reasoning, count
the reasoning tokens you pay for, and turn the reasoning_effort dial up and down to
see the trade-off.
Where this fits: this ties together the harmony template (Section 1), the usage
block (Section 2), and the token budget (Section 3). Reasoning tokens are why your
completion_tokens were sometimes bigger than the visible answer.
What a reasoning model does
A normal model maps your prompt straight to an answer. A reasoning model first generates a private chain of thought — working the problem, weighing options, checking itself — and then writes the final answer. You usually see only the answer, but the thinking happened, token by token.
Recall the harmony format from Section 1: that’s exactly what its channels are for.
gpt-oss-120b writes its reasoning into an analysis channel and its reply into a
final channel. The server separates them, usually surfacing the reasoning as
response.choices[0].message.reasoning_content (str) alongside the normal
response.choices[0].message.content (str).
This section needs a reasoning model on a reasoning-aware endpoint. It’s written for
gpt-oss-120bserved by vLLM. If you point the course at a plain (non-reasoning) chat model, or at an endpoint that doesn’t surface the reasoning channel, the scripts still run — butreasoning_contentcomes back empty andreasoning_tokensreads0/n/a, andreasoning_effortmay be ignored or rejected. That’s expected, not a bug: the code handles all of these (note thegetattr(...)andtry/exceptbelow). To see the behavior this lesson describes, you need an actual reasoning model. Everything else in the course works the same on any OpenAI-compatible endpoint.
See the thinking
Create work/reasoning.py with a classic trick question (the intuitive answer is
wrong — it’s $0.05, not $0.10):
from common import get_client, MODEL
client = get_client()
response = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content":
"A bat and a ball cost $1.10 in total. The bat costs $1.00 more than the ball. "
"How much does the ball cost?"}],
)
message = response.choices[0].message
reasoning = getattr(message, "reasoning_content", None)
print("=== reasoning (private thinking) ===")
print(reasoning or "(this endpoint did not expose reasoning_content)")
print("\n=== final answer ===")
print(message.content)
print("\n=== usage ===", response.usage)
Run it:
python work/reasoning.py
You’ll see the model talk itself through “if the ball is x, the bat is x + 1.00, so 2x + 1.00 = 1.10…” and then give the correct $0.05. That self-correction is the whole point of a reasoning model.
Whether you can see the reasoning depends on the endpoint — some expose
reasoning_content, some return only the final answer. Thegetattr(...)handles both. Either way, the thinking still happens, and you still pay for it. (Reference:examples/05/reasoning.py.)
Reasoning tokens cost real money and budget
The reasoning is generated text, so:
- It counts toward
response.usage.completion_tokens(Section 2). Where the endpoint breaks it out, you’ll find it asresponse.usage.completion_tokens_details.reasoning_tokens(int). - It spends the output side of the context budget (Section 3).
- It is billed like output (Section 10).
So a reasoning model is a trade: better answers on hard problems, at the cost of more tokens, latency, and money. The skill is spending that effort where it pays off.
The reasoning_effort dial
gpt-oss lets you tune how hard it thinks with reasoning_effort (str): "low",
"medium", or "high". Lower = fewer reasoning tokens (faster, cheaper); higher = more
thorough (better on hard problems). Create work/effort.py:
from common import get_client, MODEL
client = get_client()
problem = ("Three friends split a bill. Ana pays twice what Ben pays, and Ben pays $4 "
"less than Cara. If the total is $59, how much did each pay?")
def ask(effort: str):
r = client.chat.completions.create(
model=MODEL,
messages=[{"role": "user", "content": problem}],
extra_body={"reasoning_effort": effort}, # forward an extra server field
)
details = getattr(r.usage, "completion_tokens_details", None)
rtoks = getattr(details, "reasoning_tokens", "n/a") if details else "n/a"
return rtoks, r.choices[0].message.content.strip()
for effort in ("low", "high"):
try:
rtoks, answer = ask(effort)
print(f"=== {effort} === reasoning_tokens={rtoks}\n{answer[:160]}\n")
except Exception as err:
print(f"reasoning_effort={effort} not supported here: {err}\n")
python work/effort.py
Expect "high" to spend more reasoning tokens than "low". Match the dial to the task:
low for simple lookups, high for multi-step math, logic, and tricky code.
extra_bodyis a useful escape hatch. The SDK only has named parameters for fields it knows. vLLM accepts extras —reasoning_effort,top_k,min_p— andextra_bodypasses them straight through. Unsupported fields cause a server error, which is why the example wraps the call intry/except. (Reference:examples/05/reasoning_effort.py.)
Don’t feed the thinking back in
A rule for multi-turn conversations (Section 12): when you continue a conversation, send
back the final answer, not the reasoning. The reasoning was scratch work — it’s
large, it’s not meant to be re-consumed, and resending it just burns context and money.
Keep the assistant turn’s content; drop its reasoning_content.
Security: Reasoning tokens can contain intermediate guesses, scratch work, even quoted secrets. Don’t render raw reasoning to end users — show the final answer.
Challenges
- Catch a self-correction. Run
work/reasoning.pyon a different trick question and find where the reasoning considers and rejects a wrong answer. Success: you can quote the moment it course-corrects. - Measure the tax. Print
completion_tokensnext tolen(message.content)for a hard prompt. Success: you can state roughly how many tokens were thinking vs answer. - Match effort to task. Add
"medium"towork/effort.py, and also run a trivial prompt (“What’s 2+2?”) at"high". Success: you can argue which effort each task should use, and what the wrong choice cost.
Recap
- A reasoning model thinks first (the harmony analysis channel from Section 1), then
answers; the thinking often surfaces as
response.choices[0].message.reasoning_content. - Reasoning is generated text: it’s in
completion_tokens, spends the context budget (Section 3), and is billed (Section 10). reasoning_effort(str: low/medium/high) trades answer quality against cost and latency.- In multi-turn chats, send back the final answer, not the reasoning.
Next
Section 6 — Handling & Validating Responses: text you can read is not data you can trust. You’ll make the model return structured JSON, constrain it to a schema, and validate it with Pydantic before your code relies on it.