A lot of "prompt engineering" advice still assumes one thing: your best shot is to cram everything into a single heroic prompt and hope the model obediently follows it.
That mental model breaks down fast with modern reasoning / thinking variants, including what people are calling "GPT-5.4 Thinking". These models aren't just autocomplete with better vibes. They're closer to a planner that can drift, self-correct, and sometimes overthink itself into a worse answer if you don't manage the loop.
So my take is simple: stop treating prompts like static instructions. Start treating them like a control system.
You need three capabilities in your prompts now: upfront planning, course correction, and a clear definition of what "done" means. And you need them because thinking models don't mainly fail by being "dumb". They fail by violating constraints, running away with a plan, or spending tokens reasoning long after they already had the right answer.
What's changing with "Thinking" models (and why your old prompts feel worse)
The biggest shift isn't that you must demand step-by-step reasoning. It's that these models already do internal work, and the failure modes moved.
On planning-style tasks, recent research shows the dominant failure is often constraint violation: the model knows the rules (they're in the prompt), but it doesn't reliably apply them at the exact step where the rule matters. "Don't walk through walls" is easy to say and surprisingly hard to consistently execute mid-plan. The paper on Localized In-Context Learning (L‑ICL) makes this point sharply: models routinely generate plans that break domain constraints, and "more generic instruction" or even long retrieved demonstrations don't fix it well; targeted, localized corrections do [1].
At the same time, other work highlights the opposite problem: once a model is in thinking mode, it may keep reasoning after the correct answer is already reachable, producing redundant steps and sometimes even temporary wrong turns. ESTAR frames this as "redundant thinking" and shows you can often stop earlier without losing accuracy, saving huge token budgets [2]. That matters for prompting because if your prompt implicitly rewards long reasoning, you'll pay for it in latency and sometimes quality.
So: the "mega prompt" era is fading. Not because structure is bad, but because the best structure now is adaptive.
Upfront planning: don't ask for an answer first-ask for a plan that can be audited
When you prompt a thinking model, you're basically choosing a failure mode. Ask for an answer immediately and you'll get plausible-but-ungrounded output. Ask for a plan and you'll get something you can steer.
Here's the pattern I like: force the model to produce a plan with checkpoints before it commits to the final output. Think of it as creating "handlebars" you can grab later.
The specific twist, inspired by how L‑ICL treats planning mistakes, is to make the plan expose the first place it might break constraints-then bake in a way to fix it [1]. In practice, that means: ask for assumptions, invariants, and "what would make this plan invalid".
If you're building agentic workflows, this also lines up with the direction of the Responses-style interfaces: you want outputs you can parse, route, and re-feed into the loop, not just prose you hope is correct [3].
Course correction: build a correction channel, not an apology channel
Most people's "iteration loop" is: generate → complain → regenerate.
Thinking models respond better to: generate → localize the error → patch the smallest thing that fixes it.
That is basically L‑ICL's thesis. The method finds the first failing step, injects a minimal correction example, and performance jumps massively with far less context than retrieval-based "show me full solutions" approaches [1]. The prompting lesson is obvious: stop giving the model generic "do better" feedback. Give it pinpointed deltas.
In prompt terms, course correction works best when you ask for two artifacts:
One artifact is the user-facing output. The other is a machine-facing "diff": what changed, why it changed, and what constraint it now satisfies.
This mirrors what ReflexiCoder tries to train into models: structured reflection and correction as a disciplined trajectory, not endless looping. The interesting bit isn't the RL; it's the behavioral shape: reflect only if there's a bug, otherwise optimize once and stop [4]. You can steal that shape in your prompts today.
What to stop doing: forcing long chains-of-thought as a ritual
There's a persistent superstition: "Always tell the model to think step-by-step."
Sometimes that helps. Sometimes it just forces the model to burn tokens and wander. ESTAR's results are a nice reality check: many reasoning trajectories converge early, and extra thinking can be redundant or even destabilizing; early stopping can preserve accuracy while cutting reasoning tokens dramatically [2].
So instead of demanding long reasoning, I prefer to request bounded reasoning: a plan, plus brief verification checks, plus explicit stop conditions.
Your prompt should make it easy for the model to say: "I'm done. Here's the answer. Here are the remaining uncertainties." Not: "Let me keep thinking until I fill the context window."
Practical examples (copy/paste prompts)
Below are prompts I'd actually use with a "GPT-5.4 Thinking" style model. They're deliberately compact. The structure is doing the work.
You are my senior engineer + editor.
Goal: Produce a design doc for {feature}.
Before writing:
1) Ask up to 5 clarifying questions (only the ones that change the design).
2) Propose an outline and a plan (max 10 lines).
3) List 5 invariants/constraints you will not violate (e.g., latency, privacy, backwards compatibility).
4) List the 3 most likely failure points in your plan and how you'll detect them.
Then write the design doc with:
- Assumptions
- Proposed approach
- Tradeoffs
- "Definition of done" (as testable acceptance criteria)
If you notice an invariant conflict, stop and ask me which invariant wins.
That last line is the course-correction hook. You're explicitly telling the model when to stop and reroute.
Here's a second prompt that bakes in "localized correction", inspired by L‑ICL's minimal-patch mindset [1]:
Task: Draft {deliverable} using the information in ###CONTEXT.
Rules:
- If any requirement is ambiguous, do NOT guess: ask a question.
- If you produce a draft, also produce a CHANGELOG listing corrections you made after self-checking.
Process:
A) Draft.
B) Self-check: find the first place the draft violates the constraints or spec.
C) Fix ONLY what is necessary to resolve that first violation.
D) Output final draft + CHANGELOG + Remaining uncertainties.
###CONTEXT
{paste context}
And if you want a "don't answer yet" planning prompt (a popular community pattern), here's a cleaned-up, production version. The community version is basically: force assumptions + ask questions before output [5]. I agree with the instinct; I just want it structured so it's repeatable.
Don't produce the final answer yet.
First output:
- Assumptions you're making (max 6)
- Information that would change your answer (max 6)
- The 2 questions that most reduce uncertainty
After I answer, produce:
- Final answer
- Key rationale (short)
- What would make this wrong
Closing thought: treat prompts like guardrails + feedback loops
If you take one thing from "GPT-5.4 Thinking" prompting, make it this: don't write prompts as if the model will execute a perfect linear script. It won't.
Write prompts that assume drift, detect drift early, and correct drift locally. Research is pointing in that direction-localized fixes beat giant demonstrations for constraint following [1], and disciplined stopping beats endless reasoning for efficiency and stability [2]. Your prompt should look less like a monologue and more like a loop.
Try it once: add explicit invariants, add a "stop and ask" conflict rule, and require a changelog after the first self-check. You'll feel the model snap into a more controllable mode immediately.
References
Documentation & Research
Localizing and Correcting Errors for LLM-based Planners - arXiv cs.AI
https://arxiv.org/abs/2602.00276ESTAR: Early-Stopping Token-Aware Reasoning For Efficient Inference - arXiv
http://arxiv.org/abs/2602.10004v1Open Responses: What you need to know - Hugging Face Blog (re: OpenAI Responses API direction)
https://huggingface.co/blog/open-responsesReflexiCoder: Teaching Large Language Models to Self-Reflect on Generated Code and Self-Correct It via Reinforcement Learning - arXiv cs.CL
https://arxiv.org/abs/2603.05863Community Examples
ChatGPT gives you the answer you asked for. That's actually the problem. - r/ChatGPTPromptGenius
https://www.reddit.com/r/ChatGPTPromptGenius/comments/1rf6ogw/chatgpt_gives_you_the_answer_you_asked_for_thats/
-0191.png&w=3840&q=75)

-0204.png&w=3840&q=75)
-0202.png&w=3840&q=75)
-0197.png&w=3840&q=75)
-0196.png&w=3840&q=75)