If you've been "prompt engineering" in ChatGPT-style chat boxes, Apple Intelligence can feel weird at first. Not because the models are worse. Because the product is different.
Most chat LLMs are a single surface: you write text, you get text back. Apple Intelligence is closer to a traffic controller. Your input might become an on-device generation request, a structured App Intent, a search across local content, or a handoff to a stronger model somewhere else. And on Apple platforms, that routing decision is a feature, not a bug.
That changes what "a good prompt" even means.
Apple Intelligence is a router, so your prompt competes with everything else
Here's the mental shift I've found most useful: in Apple's world, the best "prompt" often isn't a prompt. It's a clean, unambiguous intent that can be executed deterministically.
When Siri (and now Apple Intelligence features across the OS) can satisfy a request via an intent-set a timer, send a message, log something in your app-that path is usually lower latency, more private, and more reliable than free-form generation. On-device inference stacks are heavily optimized for throughput and latency, but they still have constraints: context length, bandwidth, concurrency, and memory pressure are very real on consumer hardware [1], and decoding tends to run into memory-bandwidth limits as sequences get longer [2]. That's not an academic detail. It's why long, meandering prompts feel "expensive" on-device, and why concise, structured instructions win.
So when you write for Siri and on-device AI, you're effectively writing for a system that's asking: "Can I turn this into a tool call or intent? If yes, do that. If no, fall back to generation."
The practical consequence is brutal: verbosity that helps in a chatbot can hurt you on-device. You're adding tokens that increase compute and memory traffic for little gain [2]. If the system can't extract the actionable core quickly, it may route you down a different path, or ask follow-ups that feel "dumber" than you expected.
The on-device constraint that changes everything: treat tokens like battery
Even if you never touch Apple's frameworks directly, you should write prompts as if every extra sentence costs battery and latency-because it does.
Research on Apple Silicon inference shows big gains come from batching, caching, and avoiding repeated computation (like re-encoding the same image every turn) [1]. The point isn't "you need prefix caches in your prompt." The point is: on-device systems are designed to reward reuse and punish waste. If you can phrase a request so the system can reuse context (or avoid needing it), you'll see more consistent behavior.
Roofline-style benchmarking work on on-device LLMs makes the same theme painfully clear: decoding is often memory-bound, and operational intensity changes with sequence length and architecture; quantization helps most in memory-bound scenarios [2]. Translation: shorter prompts, fewer turns, clearer slots.
So I aim for prompts that are short, slot-like, and completion-friendly. Not because it's prettier. Because it's cooperative with the constraints.
What to write for Siri: make the "action" legible, then allow one clarification
When the user's speaking to Siri, you're not really writing prompts in the classic sense. You're designing utterances and the system's response strategy. The system needs to decide whether it can safely do something, or whether it needs a clarifying question.
So the best Siri-oriented instruction style is:
- state the action in the first clause,
- provide necessary parameters in natural language,
- include a single disambiguation hook.
I like to write with a "first token test": if Siri only heard the first half of the sentence, would it still know what action family this belongs to?
This is the opposite of "role + context + long constraints." For Siri, long roleplay is mostly noise. If you need reliability, your app should expose the action as an intent so Siri can call it deterministically, and your user should speak in ways that map cleanly to that intent.
What to write for on-device generation: fewer instructions, tighter format
When Apple Intelligence does generate text locally, you still benefit from classic prompting ideas (be specific, ask for a format, provide examples). The difference is you should compress those ideas.
Because on-device, the system is juggling more than "make text good." It's also juggling "make text fast."
A pattern that works well is a compact "goal + constraints + output schema" prompt. No preamble. No motivational fluff. And I'm cautious with multi-step reasoning demands, because they tend to balloon tokens and time.
Another subtlety: when you're prompting on-device, you often have local context available implicitly (your note, your email thread, the selected text). Don't restate the whole thing in the prompt. Point to it. The OS already knows what's selected; your prompt should say what to do with it.
Practical prompts that play nicely with Siri + on-device AI
Below are prompts I'd actually ship in a product or recommend to a team. They're short on purpose.
1) Siri-style action request (intent-friendly)
Add an expense: $42.60 for lunch, category Meals, date today.
If anything is missing, ask one question.
2) On-device rewrite (Writing Tools vibe)
Rewrite the selected text to be clearer and shorter.
Keep the meaning. Keep names and numbers unchanged.
Return only the rewritten text.
3) Summarize a long note without copying it into the prompt
Summarize the selected note for a standup update.
Format:
1) What I did
2) What I'm doing next
3) Blockers
Max 80 words.
4) A "Siri fallback" prompt for ambiguity
This is the one place I borrow a community trick: if the system can't proceed, have it interview the user-but keep it bounded. The Reddit version asks for 5 questions; on-device I usually ask for 1-2 to reduce turn cost [3].
I want to do this: schedule a focus block this week.
Before acting, ask me the single most important question you need answered.
5) A prompt that reduces repeated back-and-forth
Because repeated turns cost tokens and time, I'll often ask for a draft plus options in one go:
Draft a reply to the selected email.
Tone: friendly, direct.
Give me two versions: short (2 sentences) and medium (5 sentences).
The takeaway I'd bet on: "prompting" on Apple becomes "interface design"
If you're building for Apple Intelligence, the craft is less "write a magical system prompt" and more: design the shortest path to a reliable outcome.
When an intent can do it, let the system do it. When generation is needed, keep prompts tight, specify output shape, and minimize turns. The hardware and serving research around on-device inference keeps pointing to the same truth: long contexts and repeated work are the enemy; caching, reuse, and shorter sequences are your friend [1], [2].
Try this the next time you test: rewrite your best prompt to half its length, but keep the first clause action-oriented and keep the output schema explicit. If the output gets better, you're feeling the Apple-style routing and on-device constraints in real time.
References
References
Documentation & Research
- Native LLM and MLLM Inference at Scale on Apple Silicon - arXiv cs.LG - https://arxiv.org/abs/2601.19139
- RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis - arXiv cs.LG - https://arxiv.org/abs/2602.11506
Community Examples
3. The "Logic Architect" Prompt: Engineering your own AI path - r/PromptEngineering - https://www.reddit.com/r/PromptEngineering/comments/1rilcm3/the_logic_architect_prompt_engineering_your_own/
-0179.png&w=3840&q=75)

-0204.png&w=3840&q=75)
-0202.png&w=3840&q=75)
-0197.png&w=3840&q=75)
-0196.png&w=3840&q=75)