Sampling Parameters (Seeing the Effect)
Goal: understand the knobs that control how the model chooses each word — and
watch them work by writing the experiments yourself. By the end you’ll know what
temperature, top_p, and seed do, when to use each, and why cranking temperature up
invites hallucination.
Where this fits: Sections 1–4 were about what you send and receive. This is the first lesson where you shape the model’s behavior.
How a model picks the next token
Here’s the idea everything hangs on. At each step the model doesn’t output a word — it
outputs a probability distribution over all possible next tokens: maybe "blue" at
60%, "dark" at 12%, "quiet" at 8%, and a long tail of everything else.
Sampling turns that distribution into an actual choice. The parameters below all change how adventurous the choice is — from “always take the most likely token” to “happily pick from the unlikely tail.”
The knobs
temperature (the main one)
Temperature (float, usually 0.0–2.0) rescales the distribution before sampling:
0.0— greedy. Always take the most likely token. Deterministic. Best for facts, extraction, classification, code.~0.7— balanced. Some variety, still coherent. Good for chat and general writing.>1.0— flattens the distribution, making unlikely tokens more probable. More creative — and, far enough, incoherent. This is where the model wanders and makes things up.
The rest, briefly
top_p(float, 0.0–1.0) — nucleus sampling: keep the smallest set of tokens whose probabilities sum top, then sample from those. An alternative lever to temperature — change one or the other, not both.top_k(int) — keep only thekmost likely tokens.min_p(float) — keep tokens above a minimum probability relative to the top one.frequency_penalty/presence_penalty(float) — push the model away from repeating tokens it already used.seed(int) — pin the random choices so a run is reproducible (below).
Practical default: start at temperature=0 whenever correctness matters; raise it
only when you want variety. Change one knob, observe, repeat.
Watch temperature work
Reading about it isn’t the same as seeing it. Create work/temperature.py:
from openai import BadRequestError
from common import get_client, MODEL
client = get_client()
prompt = [{"role": "user", "content": "In one sentence, describe a city at night."}]
def generate(temperature: float) -> str:
try:
r = client.chat.completions.create(
# 120 tokens of room so a reasoning model still has space for a
# visible sentence after it finishes thinking (Sections 2-4).
model=MODEL, messages=prompt, temperature=temperature, max_tokens=120,
)
except BadRequestError:
# Servers disagree on the maximum temperature: some accept up to 2.0, some
# higher, some reject anything above 1.0. Don't crash -- report and move on.
return f"(this server rejected temperature={temperature})"
# content can be None/empty on a reasoning model that spent the budget thinking.
return (r.choices[0].message.content or "").strip()
for temp in (0.0, 0.7, 1.3):
print(f"\n=== temperature = {temp} ===")
print("1:", generate(temp)) # two samples each, so you can compare
print("2:", generate(temp))
Run it (a few times):
python work/temperature.py
What to look for:
- At 0.0, the two samples are identical — greedy decoding is deterministic.
- At 0.7, they differ but both read well.
- At 1.3, they differ a lot, and you may see odd word choices or the sentence losing the thread. That drift is the same mechanism behind hallucination: higher temperature means more willingness to pick a low-probability (often wrong) token. When people say “lower the temperature to reduce hallucinations,” this is what they mean.
The maximum temperature is server-specific. The OpenAI API caps
temperatureat2.0; some servers accept more, and a few reject anything above1.0. That’s whygenerate()above catches the error instead of crashing — runpython scripts/preflight.pyto see your endpoint’s ceiling. Stay at or below it.
Reasoning-model note. Temperature affects both
gpt-oss-120b’s thinking and its final text. For factual tasks, low temperature plus the model’s own reasoning (Section 6) is a strong combination. (Reference:examples/05/temperature_sweep.py.)
Reproducibility with seed
Sometimes you want randomness and the ability to reproduce a specific run — for a test,
a bug report, or a cache key. Create work/seed.py:
from common import get_client, MODEL
client = get_client()
prompt = [{"role": "user", "content": "Invent a name for a coffee shop."}]
def generate(seed: int) -> str:
r = client.chat.completions.create(
model=MODEL, messages=prompt, temperature=1.0, seed=seed, max_tokens=20,
)
return r.choices[0].message.content.strip()
print("seed 42:", generate(42))
print("seed 42:", generate(42)) # should match the line above
print("seed 99:", generate(99)) # should differ
python work/seed.py
Same seed (int) + same inputs → same output. Change the seed → different output.
Caveat — seeds are best-effort, not a guarantee. Server batching, load, or a config change can still nudge results. Your clue that the underlying setup changed is
response.system_fingerprint(Section 2): if it changes, identical inputs may diverge. (Reference:examples/05/seed_demo.py.)
Security: A fixed
seedand low temperature make behavior reproducible — which is what lets a safety test reliably catch a regression. Non-determinism hides bugs.
Challenges
- Find the breaking point. In
work/temperature.py, push the sweep higher — add1.8, then2.0, then keep going until your server rejects the value. Success: you can name both the temperature at which output becomes nonsense and your server’s maximum accepted temperature (the point wheregenerate()reports a rejection). - Swap the lever. Write a version that sweeps
top_pover0.3, 0.7, 1.0at fixedtemperature=1.0. Success: you can describe howtop_pfeels different fromtemperature. - Determinism for facts. Ask “What is the capital of Australia?” five times at
temperature=0and five times at1.5. Success: thetemperature=0answers are identical and correct; the hot ones vary.
Recap
- The model emits a probability distribution; sampling parameters decide how boldly you pick from it.
temperature(float):0= deterministic/factual;~0.7= balanced;>1= creative and, eventually, incoherent — the dial most tied to hallucination.top_p/top_k/min_pare alternative truncation levers — change one, not several.seed(int) makes a run reproducible (best-effort; watchsystem_fingerprint).
Next
Section 6 — Reasoning / “Thinking” Models: we open up what gpt-oss-120b has been
doing all along — thinking before it answers — and look at reasoning tokens, the
reasoning_effort dial, and what it costs.