Sampling Parameters (Seeing the Effect)
Goal: understand the knobs that control how the model chooses each word — and
watch them work by writing the experiments yourself. By the end you’ll know what
temperature, top_p, and seed do, when to use each, and why cranking temperature up
invites hallucination.
Where this fits: Sections 1–3 were about what you send and receive. This is the first lesson where you shape the model’s behavior.
How a model picks the next token
Here’s the idea everything hangs on. At each step the model doesn’t output a word — it
outputs a probability distribution over all possible next tokens: maybe "blue" at
60%, "dark" at 12%, "quiet" at 8%, and a long tail of everything else.
Sampling turns that distribution into an actual choice. The parameters below all change how adventurous the choice is — from “always take the most likely token” to “happily pick from the unlikely tail.”
The knobs
temperature (the main one)
Temperature (float, usually 0.0–2.0) rescales the distribution before sampling:
0.0— greedy. Always take the most likely token. Deterministic. Best for facts, extraction, classification, code.~0.7— balanced. Some variety, still coherent. Good for chat and general writing.>1.0— flattens the distribution, making unlikely tokens more probable. More creative — and, far enough, incoherent. This is where the model wanders and makes things up.
The rest, briefly
top_p(float, 0.0–1.0) — nucleus sampling: keep the smallest set of tokens whose probabilities sum top, then sample from those. An alternative lever to temperature — change one or the other, not both.top_k(int) — keep only thekmost likely tokens.min_p(float) — keep tokens above a minimum probability relative to the top one.frequency_penalty/presence_penalty(float) — push the model away from repeating tokens it already used.seed(int) — pin the random choices so a run is reproducible (below).
Practical default: start at temperature=0 whenever correctness matters; raise it
only when you want variety. Change one knob, observe, repeat.
Watch temperature work
Reading about it isn’t the same as seeing it. Create work/temperature.py:
from common import get_client, MODEL
client = get_client()
prompt = [{"role": "user", "content": "In one sentence, describe a city at night."}]
def generate(temperature: float) -> str:
r = client.chat.completions.create(
model=MODEL, messages=prompt, temperature=temperature, max_tokens=60,
)
return r.choices[0].message.content.strip()
for temp in (0.0, 0.7, 1.3):
print(f"\n=== temperature = {temp} ===")
print("1:", generate(temp)) # two samples each, so you can compare
print("2:", generate(temp))
Run it (a few times):
python work/temperature.py
What to look for:
- At 0.0, the two samples are identical — greedy decoding is deterministic.
- At 0.7, they differ but both read well.
- At 1.3, they differ a lot, and you may see odd word choices or the sentence losing the thread. That drift is the same mechanism behind hallucination: higher temperature means more willingness to pick a low-probability (often wrong) token. When people say “lower the temperature to reduce hallucinations,” this is what they mean.
Reasoning-model note. Temperature affects both
gpt-oss-120b’s thinking and its final text. For factual tasks, low temperature plus the model’s own reasoning (Section 5) is a strong combination. (Reference:examples/04/temperature_sweep.py.)
Reproducibility with seed
Sometimes you want randomness and the ability to reproduce a specific run — for a test,
a bug report, or a cache key. Create work/seed.py:
from common import get_client, MODEL
client = get_client()
prompt = [{"role": "user", "content": "Invent a name for a coffee shop."}]
def generate(seed: int) -> str:
r = client.chat.completions.create(
model=MODEL, messages=prompt, temperature=1.0, seed=seed, max_tokens=20,
)
return r.choices[0].message.content.strip()
print("seed 42:", generate(42))
print("seed 42:", generate(42)) # should match the line above
print("seed 99:", generate(99)) # should differ
python work/seed.py
Same seed (int) + same inputs → same output. Change the seed → different output.
Caveat — seeds are best-effort, not a guarantee. Server batching, load, or a config change can still nudge results. Your clue that the underlying setup changed is
response.system_fingerprint(Section 2): if it changes, identical inputs may diverge. (Reference:examples/04/seed_demo.py.)
Security: A fixed
seedand low temperature make behavior reproducible — which is what lets a safety test reliably catch a regression. Non-determinism hides bugs.
Challenges
- Find the breaking point. In
work/temperature.py, add1.8and2.0to the sweep. Success: you can name the temperature at which output becomes nonsense. - Swap the lever. Write a version that sweeps
top_pover0.3, 0.7, 1.0at fixedtemperature=1.0. Success: you can describe howtop_pfeels different fromtemperature. - Determinism for facts. Ask “What is the capital of Australia?” five times at
temperature=0and five times at1.5. Success: thetemperature=0answers are identical and correct; the hot ones vary.
Recap
- The model emits a probability distribution; sampling parameters decide how boldly you pick from it.
temperature(float):0= deterministic/factual;~0.7= balanced;>1= creative and, eventually, incoherent — the dial most tied to hallucination.top_p/top_k/min_pare alternative truncation levers — change one, not several.seed(int) makes a run reproducible (best-effort; watchsystem_fingerprint).
Next
Section 5 — Reasoning / “Thinking” Models: we open up what gpt-oss-120b has been
doing all along — thinking before it answers — and look at reasoning tokens, the
reasoning_effort dial, and what it costs.