Hello World
Goal: write, by hand, your first two programs that talk to a language model — one using raw HTTP, one using the official SDK — and build the right mental model of what’s happening on the wire. You’ll finish understanding the single most important idea in the course: the difference between the messages you write and the string of tokens the model actually sees.
How this course works — read once. You write the code. Each section walks you through building small scripts yourself, step by step, running them as you go. You’ll write your files in the
work/folder. A complete reference solution for everything you build lives underexamples/NN/— peek if you get stuck, but type it yourself first. That’s the hands-on part, and it’s where the learning happens.
Security is a through-line. Safety isn’t one chapter at the end — every section closes with a short Security note tied to that topic, because the mistakes that matter (leaked keys, prompt injection, running untrusted output) show up the moment you use each feature. Isolation is big enough to earn its own two sections (15–16, Sandboxing), and the defenses come together in Section 20.
Where this fits: this is lesson 1 of 24 — everything else builds on it. We start at the bottom on purpose. Once you’ve sent the literal request yourself, every later abstraction will feel like a convenience, not magic.
What you’re talking to
Throughout this course you’ll talk to a hosted, OpenAI-compatible inference server. Two phrases there matter:
- Inference server — a program that has a model loaded and answers requests. Ours
runs vLLM
, a fast serving engine, and serves
gpt-oss-120b. You don’t run it; it’s hosted for you. You just send it HTTP requests. - OpenAI-compatible — it speaks the same HTTP API that OpenAI popularized:
POST /v1/chat/completions,GET /v1/models, and so on. Learn this protocol once and the same code talks to OpenAI, vLLM, llama.cpp, and dozens of others.
So “calling a model” is, mechanically, just an HTTP POST with some JSON. That’s the demystification we’re after — and you’re about to do it by hand.
Get set up
Make sure you’ve done the one-time setup from the README
: installed the
dependencies, copied .env.example to .env and filled in your endpoint details, and
created a work/ folder. You’ll write this section’s code in work/.
Your scripts read three values from the environment:
| Variable | What it is | Example |
|---|---|---|
OPENAI_BASE_URL | Where the server lives. Ends in /v1. | https://your-host/v1 |
OPENAI_API_KEY | Your auth token, sent as Authorization: Bearer <token>. | sk-... |
MODEL | The id of the model the server serves. | openai/gpt-oss-120b |
Load them into your shell and confirm they’re set:
set -a; source .env; set +a
python -c "import os; print('MODEL =', os.environ['MODEL'])"
If that prints your model id, you’re ready. If it errors with a KeyError, your .env
isn’t loaded — redo the set -a; source .env; set +a line.
The
/v1in the base URL is part of the path: the chat endpoint isOPENAI_BASE_URL+/chat/completions. And note our hosted endpoint needs a real token — unlike many local servers, which accept any dummy value.
Write your first call — raw HTTP
We’ll use only the requests
library, so nothing is
hidden. Create a new file, work/hello.py, and write this first piece:
import os
import requests
base_url = os.environ["OPENAI_BASE_URL"].rstrip("/")
api_key = os.environ["OPENAI_API_KEY"]
model = os.environ["MODEL"]
print("talking to:", base_url, "as", model)
Run it from the repo root:
python work/hello.py
You should see your endpoint and model printed. No call yet — we just confirmed the settings load. Run early and often; that’s the rhythm of this course.
Now add the request. Append to work/hello.py:
payload = {
"model": model,
"messages": [
{"role": "system", "content": "You are a concise, friendly assistant."},
{"role": "user", "content": "Say hello in one short sentence."},
],
}
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
}
response = requests.post(
f"{base_url}/chat/completions",
headers=headers,
json=payload,
timeout=30,
)
response.raise_for_status() # turn an HTTP error into a Python exception
data = response.json()
print(data["choices"][0]["message"]["content"])
Run it again:
python work/hello.py
You should get a one-sentence greeting back. You just called a language model by hand. Let’s make sure you know what every piece did:
Hit an SSL error instead? If this fails with
CERTIFICATE_VERIFY_FAILED(or the SDK’s vaguerConnection error.), your machine can’t verify the endpoint’s certificate — common behind a corporate proxy. See the README’s Troubleshooting: SSL / certificates : usually settingSSL_CERT_FILEto your CA bundle in your.envfixes it.
payloadis the entire request body. The important part ismessages— a list of turns, each with aroleandcontent(more on roles below).headerscarry your token (Authorization: Bearer ...) and say the body is JSON.requests.post(...)sends it to<base_url>/chat/completions.response.raise_for_status()makes the program fail loudly if the server returned an error (a wrong token, a bad model id) instead of silently continuing.data["choices"][0]["message"]["content"]digs out the reply. We’ll dissect that shape thoroughly in Section 2 — for now, that’s where the text lives.
See the whole response
The reply is one field of a larger object. Temporarily change your last line to print all of it:
import json
print(json.dumps(data, indent=2))
Run once more and read the JSON. You’ll see choices, a usage block with token
counts, and more. Don’t memorize it yet — Section 2 is devoted to it — just notice the
reply is a small part of a structured envelope.
Reference: a complete version of what you just wrote is
examples/01/raw_http.py(it prints the full JSON, the reply, andusage). Compare it with yourwork/hello.py.
Write the same call — with the SDK
Writing the HTTP by hand was the point. Now see what the official openai client buys
you: it builds the same JSON and POSTs to the same endpoint, so you skip the plumbing.
Create work/hello_sdk.py:
import os
from openai import OpenAI
client = OpenAI(
base_url=os.environ["OPENAI_BASE_URL"],
api_key=os.environ["OPENAI_API_KEY"],
)
response = client.chat.completions.create(
model=os.environ["MODEL"],
messages=[
{"role": "system", "content": "You are a concise, friendly assistant."},
{"role": "user", "content": "Say hello in one short sentence."},
],
)
print(response.choices[0].message.content)
Run it:
python work/hello_sdk.py
Same greeting. Look closely at response.choices[0].message.content — the identical
path you used on the raw JSON, now as typed attributes instead of dictionary keys.
The SDK is a thin wrapper over the HTTP you already wrote. That’s why doing it by
hand first matters: when something breaks later, you know exactly what’s underneath.
Reference:
examples/01/with_sdk.py. From here on we’ll mostly use the SDK, dropping to raw HTTP whenever the wire format teaches something.
Roles: the “sections” of a prompt
Each entry in messages has a role. You’ve already used the core three:
system— standing instructions: who the assistant is, its tone, its rules. Set once, applies to the whole conversation.user— a turn from the human.assistant— a turn from the model. You also write assistant messages yourself to replay a conversation or to show worked examples (we’ll lean on this in Section 11).
Other roles you’ll meet later have become standard too:
tool— results handed back from a tool the model asked to call. It replaced the olderfunctionrole, now deprecated. (Section 13.)developer— a newer role from reasoning models; essentiallysystem’s successor in an explicit platform > developer > user chain of command. (Section 5.)
For now: system, user, assistant is enough to hold a conversation.
The one idea to take away: messages vs. the template
This is the most important concept in the lesson. There are two representations of the same conversation:
- What you write — a tidy JSON list of
{role, content}messages. - What the model receives — a single, flat string of tokens. The model has no concept of “roles” or “a list.” It only ever sees text.
The thing that converts #1 into #2 is the chat template — a small template (in Jinja2) that ships inside the model’s own tokenizer configuration. The server renders your messages through it. For a model in the widely-used “ChatML” format (easy to read), your two messages become this exact string:
<|im_start|>system
You are a concise, friendly assistant.<|im_end|>
<|im_start|>user
Say hello in one short sentence.<|im_end|>
<|im_start|>assistant
<|im_start|>/<|im_end|>are special tokens marking boundaries; the role name is just text after the marker.- The dangling
<|im_start|>assistantat the end is the generation prompt: it tells the model “your turn — continue from here.”
Different model families use different templates. Our model, gpt-oss-120b, uses
OpenAI’s harmony format, which is richer than ChatML — it even gives the model
separate channels for private reasoning and the final answer (that reasoning is the
subject of Section 5). So the real rendered string for our model won’t look like the tidy
ChatML above. That’s the point: same messages, a different rendered string for every
model — which is why the template belongs to the model, not to the API. The server
applies the right one automatically; you never hand-write delimiters. But knowing they’re
there is what separates “it just works” from “I understand why.”
Prove it yourself — measure the tokens
You can’t easily print the rendered string without the right tokenizer, and this
course uses no local tokenizer (no Hugging Face downloads, no tiktoken). But you can
measure the result through the server. Create work/tokens.py:
import os
from openai import OpenAI
client = OpenAI(
base_url=os.environ["OPENAI_BASE_URL"],
api_key=os.environ["OPENAI_API_KEY"],
)
def prompt_tokens(text: str) -> int:
response = client.chat.completions.create(
model=os.environ["MODEL"],
messages=[{"role": "user", "content": text}],
max_tokens=1, # we only care about the INPUT size
)
return response.usage.prompt_tokens # tokens AFTER templating
print(prompt_tokens("hi"))
print(prompt_tokens("hi " * 50))
Run it:
python work/tokens.py
The second number is much bigger — your longer message became more tokens. You’re
reading usage.prompt_tokens, the size of your messages after the template was
applied, straight from the server. Section 3 builds a whole lesson on this.
Reference:
examples/01/show_template.pygoes further — if your endpoint exposes vLLM’s/tokenizehelper, it prints the actual rendered harmony string and token ids. Run it to peek at the real delimiters.
Security: API keys are credentials. Read them from the environment, never hard-code them in a script or commit them to git — a leaked key is a billable account someone else controls.
Challenges
Write these from scratch (new files in work/). References are listed, but try first.
- A two-turn conversation. Build a
messageslist with four entries —system,user,assistant(a made-up earlier reply), then a newuserfollow-up that refers back to it (“and what about its capital?”). Send it. Success: the model’s answer clearly uses the earlier turn. - Persona swap. Copy
work/hello_sdk.pytowork/pirate.py, change only thesystemmessage to make the assistant answer like a pirate, and rerun. Success: the user message is unchanged but the style flips. - A token counter. Using
work/tokens.pyas a starting point, write acount(text)function and print the token count of: your name, a full sentence, and the same sentence in ALL CAPS. Success: the three counts differ. (Reference idea:examples/03/count_tokens.py— used in Section 3.) - Discover the model id. Write a script that sends
GET <base_url>/models(with theAuthorizationheader) and prints each model id from the JSON. Success: it prints the id you put inMODEL. (Reference:examples/01/list_models.py.)
Recap
- Calling a model is just
POST /v1/chat/completionswith JSON — you wrote that by hand inwork/hello.py. - The
openaiSDK is a thin wrapper over the same call — same fields, nicer types. - A prompt is a list of role-tagged messages; the core roles are
system,user,assistant(withtoolanddeveloperlater). - Your messages are rendered by the model’s chat template into a string of
tokens — that string is what the model truly sees, and you measured its size with
usage.prompt_tokens.
Next
Section 2 — Anatomy of a Response: you’ll dig into the envelope your call returned —
choices, finish_reason, and the usage block — and learn exactly where each field
lives before we start changing the model’s behavior.