Roblox dev has a very specific kind of pain: you can almost describe what you want in plain English ("make the NPC talk, then open the door, then give the player a key"), but the last 10% is where your game breaks. Dialogue becomes inconsistent. Logic becomes hand-wavy. Generated Luau looks plausible… until it hits Studio and explodes.
Prompt engineering is what closes that gap. Not "be polite and add more detail" prompt engineering. I mean designing prompts like you're writing a mini-spec and a test harness-so the model produces assets you can actually ship.
What's interesting is that research on tool-using agents keeps rediscovering the same lesson: models degrade when tasks get more complex, especially when they must output precise structured arguments (think "payloads") and keep state consistent over time [2]. That maps perfectly to Roblox: NPCs are stateful, gameplay is stateful, and scripts are nothing but structured payloads.
So here's the approach I use: treat the LLM like a junior teammate who's great at drafting, mediocre at constraints, and needs guardrails for anything that will run in production.
The core trick: separate "what happens" from "how it's implemented"
If you ask for "write me the script," you're forcing the model to make game design decisions, system design decisions, and Luau implementation decisions in one shot. That's how you get brittle code and nonsense features.
Instead, I split prompts into three artifacts, in order:
- A dialogue/quest spec (player-facing intent, NPC personality, constraints, state transitions).
- A state model (variables, events, allowed transitions, failure modes).
- The Luau implementation (modules, remotes, server/client split, and code).
This isn't just vibes. Benchmarks for tool-use agents show that "argument generation" (structured, exact outputs) becomes a major bottleneck under complexity [2]. If you don't isolate it, you make it worse. In Roblox terms: don't ask for final Luau until you've pinned down the state and the exact I/O.
I also keep outputs small. ExtractBench found that reliability drops sharply as structured output volume grows-models hit formatting errors, truncation, or silent failure as schemas get bigger [1]. Same thing happens when you ask an LLM to write 600 lines of Roblox code with three systems intertwined. So I prompt for "one module, one responsibility" and iterate.
NPC dialogue: prompt like a narrative designer, validate like an engineer
Most Roblox NPC dialogue has two jobs: entertain and drive state (quest flags, shop open/closed, reputation, cooldowns). LLMs are good at the entertaining part. They're unreliable at the state part unless you force structure.
A pattern I like is "dialogue as structured events," similar to what folks building LLM-driven games do with inline markup and an engine-mediated "intent parser" [3]. The model writes the what, your game decides the truth.
Here's a prompt template I actually use:
You are writing NPC dialogue for a Roblox game. The dialogue must be grounded in GAME STATE.
Return ONLY a JSON object.
NPC:
- name: "Mara"
- tone: dry humor, short sentences
- role: quest giver for "Power Cell Recovery"
GAME STATE (authoritative):
- playerLevel: 7
- hasPowerCell: false
- maraTrust: 2 (0-5)
- questStage: "NotStarted" | "Accepted" | "Complete"
PLAYER INPUT (verbatim):
"{playerMessage}"
RULES:
- Mara cannot claim the player has the power cell unless hasPowerCell=true.
- Mara cannot advance questStage unless she explicitly gives the quest or receives the item.
- If info is missing, Mara must ask one clarifying question.
OUTPUT SCHEMA:
{
"npcLine": "string",
"choices": [{"id":"string","text":"string"}],
"stateChanges": {
"questStage": "NotStarted|Accepted|Complete|NO_CHANGE",
"maraTrustDelta": -1|0|1
}
}
Why JSON? Because you can parse it, enforce it, and reject it. And this lines up with what structured-extraction work calls treating the schema as an "executable specification" so you can score/validate outputs field by field [1]. You don't need a fancy evaluator-basic checks catch most failures.
One more thing: keep the dialogue turn short. If you're building a full conversation tree in one prompt, you're back in "large output volume" land where quality and validity drop [1].
Game logic: prompt for a state machine, not a script
Roblox gameplay bugs are often state bugs: doors that open twice, quests that skip stages, NPCs that forget what they told you.
ASTRA-bench is a benchmark, but the failure modes are very game-like: performance drops as tasks require more planning, more context, and more correct tool arguments (payloads) [2]. That's your warning sign: if your prompt doesn't explicitly define state and transitions, the model will invent them.
So I ask the model for a state machine before code. Not a bullet list. A compact, checkable representation.
Example prompt:
Design the quest logic as a finite state machine.
Quest: "Power Cell Recovery"
States: NotStarted, Accepted, HasItem, Complete
Events: TalkToMara, CollectPowerCell, ReturnToMara, Abandon
Constraints:
- CollectPowerCell can only transition Accepted -> HasItem
- ReturnToMara can only transition HasItem -> Complete
- Any invalid event in a state must produce "NO_OP" plus a reason
Return ONLY JSON:
{
"initialState": "...",
"transitions": [
{"from":"...", "event":"...", "to":"...", "effects":["..."], "guard":"..."}
],
"invalidEventHandling": {"default":"NO_OP", "reasonStyle":"short"}
}
Now you've got something you can unit test without Roblox Studio even open. You can validate that every (state,event) pair is handled. You can generate tests. You can later prompt the LLM to implement exactly this machine in Luau.
Script generation: make the model write code like it's submitting a PR
When I finally ask for Luau, I'm strict about boundaries: what runs on the server, what runs on the client, and what is "authoritative." The model is not allowed to "decide" game truth on the client.
I also force a PR-style output: file names, modules, and a narrow scope.
A good starting prompt:
Generate Luau code for Roblox Studio.
Goal:
Implement the quest FSM described below on the SERVER.
The client can request actions, but server is authoritative.
Requirements:
- Use a ModuleScript "QuestService" in ServerScriptService
- Provide functions:
- GetPlayerQuestState(player) -> stateTable
- HandleEvent(player, eventName) -> (ok:boolean, newStateTable, message:string)
- Persist per-player state in memory (no DataStore in this version)
- Include minimal input validation and clear error messages
- Return code ONLY. No explanations.
FSM JSON:
{...paste the FSM...}
Why this works: you're constraining the "payload generation" problem to a small surface area (a couple functions with explicit I/O). ASTRA-bench's analysis basically says: models can often retrieve info, but they struggle to translate it into correct structured actions as complexity rises [2]. So we don't let complexity rise.
Also: I never request multiple systems in one go (NPC dialogue + quest logic + inventory + UI). ExtractBench shows output volume and breadth correlate with catastrophic failure modes (invalid structure, truncation, silent empties) [1]. If you want reliability, keep outputs tight.
Practical examples: the "engine is the user" mindset
One of the best practical framing tricks I've seen (from an LLM game dev write-up) is: in a game, the "user" is really the engine, not the player [3]. The engine asks the model for a specific artifact (a line, an action resolution, a structured event). The player is just part of the input.
That mindset stops you from building a fragile "LLM decides reality" game. You're using the LLM to propose content and actions. Your Roblox code enforces rules.
Even community discussions about "briefing" agents for games tend to circle around this idea: you're architecting intent and constraints, not just generating text [4]. The difference is: in Roblox, you have the luxury of being strict. Your engine can reject and reprompt.
Closing thought: treat prompts as interfaces, not requests
The moment you're generating NPC dialogue, game logic, and scripts, your prompt is basically an API contract. If it's fuzzy, the output will be fuzzy. If it's structured and small, you can validate it. If it's too big, it'll fail in ways that look "random" but are actually very predictable-formatting errors, missing fields, invented state, and broken payloads [1], [2].
My rule: if I can't write a simple validator for the output, I'm not done prompt-engineering yet.
References
Documentation & Research
ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction - arXiv cs.LG
https://arxiv.org/abs/2602.12247ASTRA-bench: Evaluating Tool-Use Agent Reasoning and Action Planning with Personal User Context - arXiv cs.AI
https://arxiv.org/abs/2603.01357From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents - arXiv cs.AI
https://arxiv.org/abs/2601.22607
Community Examples
Intra: Design notes on an LLM-driven text adventure - Ian Bicking (HN discussion source)
https://ianbicking.org/blog/2025/07/intra-llm-text-adventureBeyond Chatbots: Using Prompt Engineering to "Brief" Autonomous Game Agents - r/PromptEngineering
https://www.reddit.com/r/PromptEngineering/comments/1rioa1b/beyond_chatbots_using_prompt_engineering_to_brief/
-0190.png&w=3840&q=75)

-0204.png&w=3840&q=75)
-0202.png&w=3840&q=75)
-0197.png&w=3840&q=75)
-0196.png&w=3840&q=75)