Blog / Prompt engineering / GPT-5.5 in Codex: Why It's Tuned Differe…

GPT-5.5 in Codex: Why It's Tuned Differently

Discover why OpenAI tunes GPT-5.5 differently in Codex, how agentic coding changes prompt strategy, and what developers should do next. Read the full guide.

Ilia Ilinskii
Rephrase · June 1, 2026

Prompt engineering8 min read

On this page

Key Takeaways What changed in GPT-5.5 inside Codex?Why would OpenAI tune the same model differently?What does "agentic" tuning actually optimize?Why does token efficiency matter more than raw intelligence?How does Codex change the prompt shape?What GPT-5.5 in Codex tells us about prompt engineering How should developers prompt Codex differently?What do real-world examples show?Before vs. after: prompt examples for Codex Why this matters beyond OpenAI References

When OpenAI ships the "same" model into two products, it's rarely the same experience. GPT-5.5 in Codex is the cleanest example yet: the base model is one thing, but the agent wrapper, workflow, and tuning make it feel like a different animal. That's not a bug. It's the point.

Key Takeaways

OpenAI tuned GPT-5.5 for agentic work in Codex, not just better chat responses.
The big win is fewer tokens per completed task, which can improve real cost and speed even when API prices rise [1].
Codex-style prompts work best when you define the goal, constraints, and finish line clearly.
The same model can feel very different depending on whether it's optimized for conversation or task execution.
Tools like Rephrase can help you reshape rough instructions into prompts that fit agentic workflows faster.

What changed in GPT-5.5 inside Codex?

GPT-5.5 in Codex is tuned for finishing multi-step work, not just producing a good next token. OpenAI describes Codex as a model for long-horizon technical tasks, where the agent needs to plan, use tools, check its work, and keep going until the job is done [1]. That shift changes the model's behavior more than people expect.

Why would OpenAI tune the same model differently?

The reason is simple: product context matters. A general chat model should be flexible, helpful, and conversational. A coding agent should be persistent, structured, and willing to execute. OpenAI's GPT-5.5 launch notes emphasize token efficiency and task completion in Codex, which suggests the tuning favors fewer dead ends and more decisive action [2].

What does "agentic" tuning actually optimize?

Agentic tuning optimizes for sustained execution across tool calls, not one-shot eloquence. In Codex, the model has to reason over a repo, inspect outputs, recover from errors, and decide whether to continue or ask for help. OpenAI's own framing around GPT-5.5 points to benchmarks like Terminal-Bench and Expert-SWE because they reward multi-step success, not just final-answer style responses [2].

Why does token efficiency matter more than raw intelligence?

Because in agent workflows, the cheapest model is often the one that reaches the finish line fastest with the fewest retries. OpenAI says GPT-5.5 matches GPT-5.4's per-token latency while using significantly fewer tokens on the same Codex tasks [2]. That means the product can be more capable without feeling slower, even if per-token pricing goes up.

How does Codex change the prompt shape?

Codex works better when prompts look like task specs, not creative requests. You want the target state, constraints, and acceptance criteria up front. A vague prompt invites wandering. A precise prompt gives the agent a path. In practice, this is why prompt tooling matters: apps like Rephrase can turn a rough ask into something structured enough for a coding agent.

Here's the kind of shift I mean:

Before:
Fix the auth bug and make it better.

After:
Inspect the login flow, identify the root cause of the session failure, and patch the smallest possible change.
Do not modify unrelated UI code. Return a brief summary of the fix, the files changed, and any risks.
If the bug cannot be reproduced, explain what additional logs you need.

That second prompt gives Codex a job, a boundary, and an exit condition.

What GPT-5.5 in Codex tells us about prompt engineering

The biggest lesson is that model quality and prompt quality are not separate problems. As models get better at tool use, your prompt should become more operational. The best prompts for Codex aren't "clever." They're legible to an agent that needs to act, verify, and recover. That's the real shift OpenAI is betting on in GPT-5.5 [1][2].

How should developers prompt Codex differently?

Start by writing prompts as if you were handing a ticket to a very fast junior engineer. Say what success looks like, what not to touch, and when to stop. If you need a specific style of output, name it. If you need the model to inspect a repo before editing, say so. The more explicit you are, the less the agent has to infer.

What do real-world examples show?

Community feedback lines up with the product strategy. In developer discussions about earlier Codex releases, people kept noticing that these models perform best when the task is well-scoped and the instructions are structurally precise [3]. That matches OpenAI's emphasis on agentic persistence: the model is strongest when the prompt is a workflow, not a wish.

Before vs. after: prompt examples for Codex

Goal	Weak prompt	Better Codex prompt
Bug fix	"Find what's broken."	"Reproduce the failing test, identify the root cause, patch only the minimum files, and explain the change in one paragraph."
Refactor	"Clean this up."	"Refactor this module for readability without changing behavior. Preserve public APIs and list any edge cases you checked."
Feature work	"Add pagination."	"Implement cursor-based pagination for this endpoint, update tests, and note any schema or client changes required."

What I like about this format is that it forces the model into the same mental model as the developer: outcome, constraints, evidence. That's exactly how you get better results from GPT-5.5 in Codex.

Why this matters beyond OpenAI

This isn't just an OpenAI story. It's a lesson in how agentic AI will be packaged everywhere. The base model matters, but the surrounding tuning, system prompt, tool policy, and workflow design matter just as much. The same foundation can feel radically different when it's optimized for chat, code, research, or desktop automation [1][2]. That's why prompt engineering is becoming more product-specific, not less.

If you're still writing prompts like you're chatting with a generic assistant, you're leaving performance on the table. I'd start by tightening your task specs, then let a tool like Rephrase do the boring rewrite work so you can focus on the actual engineering.

References

Documentation & Research

Introducing GPT-5.3-Codex - OpenAI Blog (link)
OpenAI Releases GPT-5.5, a Fully Retrained Agentic Model That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval - MarkTechPost (link)

Community Examples

Claude Opus 4.6 vs. GPT-5.3 Codex: How I shipped 93,000 lines of code in 5 days - Lenny's Newsletter (link)

Frequently asked

Why does GPT-5.5 behave differently in Codex?

Because Codex is optimized for long-horizon, tool-using work, OpenAI tunes the same base model for persistence, task completion, and fewer handoffs. The result is a model that can act more like an agent and less like a chat box.

What should I change in my prompts for Codex?

Be explicit about outcomes, constraints, and definitions of done. Codex responds best when you specify the target state, acceptable tradeoffs, and when it should ask for clarification instead of guessing.