Learn 4 proven prompt compression techniques that cut token costs and latency in production. Before/after benchmarks included. Read the full guide.
Context windows keep growing, but your compute bill doesn't care. At 10,000 API calls per day, a 300-token bloat per prompt adds up to millions of wasted tokens per month - and every extra token adds latency your users actually feel.
The narrative that "context windows are big enough now" misses the point. A 1M-token context window doesn't make tokens free - it just moves the bottleneck. In production systems running thousands of calls per day, three costs compound: API token pricing, inference latency per call, and KV-cache memory pressure.
Recent research on structured output generation found that even predictable, low-entropy tokens - things like delimiters, parameter names, and repeated labels - impose real latency costs during autoregressive decoding [1]. SimpleTool addressed this in function-calling pipelines by compressing redundant structural tokens 4-6x, achieving up to 9.6x end-to-end speedup [1]. The lesson transfers directly to prompt design: structure and repetition are expensive even when they feel necessary.
Instruction distillation means converting verbose, conversational instructions into dense, imperative directives. You strip articles, hedge words, and redundant framing - keeping every logical constraint, dropping every filler word.
This is the highest-leverage technique for system prompts and fixed templates, because those tokens are paid on every single call.
# BEFORE (47 tokens)
You are a helpful assistant. When the user provides a support ticket,
please make sure to classify it into one of the following categories,
and always respond in JSON format.
# AFTER (21 tokens)
Classify support tickets into: billing, technical, account, other.
Respond: {"category": "...", "confidence": 0-1}
That's a 55% token reduction with no loss of instruction fidelity. The model doesn't need "you are a helpful assistant" to classify tickets accurately. It needs the categories and the output schema.
The OPSDC research on reasoning compression found something related: simply telling a model to "be concise" - rather than specifying token budgets - was enough to achieve 57-59% token reduction on math benchmarks while improving accuracy by 9-16 points absolute [2]. Redundant tokens aren't neutral; they actively dilute signal.
Few-shot examples are powerful, but they're also expensive. Most prompts include more examples than the model needs, often out of caution rather than necessity.
Example pruning means auditing your few-shot set and removing examples that are redundant, similar to each other, or that demonstrate behaviors the model already handles well by default.
# BEFORE - 3 examples, 180 tokens
Input: "Cancel my subscription"
Output: {"intent": "cancellation", "urgency": "high"}
Input: "I want to cancel"
Output: {"intent": "cancellation", "urgency": "medium"}
Input: "Please cancel my account"
Output: {"intent": "cancellation", "urgency": "medium"}
# AFTER - 1 example, 60 tokens
Input: "Cancel my subscription"
Output: {"intent": "cancellation", "urgency": "high"}
Two of those three examples are teaching the same thing. One well-chosen example does the job. Reserve additional examples for genuinely distinct edge cases - ambiguous phrasing, unusual formats, or failure modes you've observed in production logs.
A practical rule: start with one example, run evals, and only add examples where the model demonstrably fails. Don't add examples preemptively.
RAG pipelines are where token bloat gets truly expensive. Retrieved chunks are often passed wholesale into the prompt, including sentences that are completely irrelevant to the query at hand.
Reference compression means preprocessing retrieved content to extract only the query-relevant sentences before passing it to the primary model. Research from KAIST on query-aware context compression shows this approach can maintain strong exact-match and F1 scores on QA tasks while significantly reducing the token footprint passed to the reader model [3]. Their framework evaluates which sentences change "clue richness" when removed - a useful mental model for manual compression too.
| Approach | Tokens (avg) | Latency | Accuracy |
|---|---|---|---|
| Full chunk passthrough | 800 | baseline | baseline |
| Top-3 sentence extraction | 220 | -35% | -1.2% F1 |
| Query-aware compression | 180 | -42% | +0.8% F1 |
In practice, even a simple heuristic - truncate retrieved context to sentences containing the query's key noun phrases - beats full passthrough on both cost and quality. The full chunk almost always contains noise that the model attends to unnecessarily.
Separately, research on semantic routing systems found that compressing 16K-token inputs down to 512 tokens before classification yielded a 12x jailbreak detection speedup with identical accuracy [4]. The principle holds broadly: compress before the expensive operation, not after.
This technique is the most aggressive and the most context-dependent. Shorthand conventions replace natural language phrases with compact symbols, abbreviations, or structured tokens that carry the same semantic payload.
It works reliably when you control the full pipeline - especially with fine-tuned models or when you've established conventions via a system prompt at session start.
# BEFORE (natural language, 38 tokens)
If the sentiment is positive, respond with "approve".
If the sentiment is negative, respond with "reject".
If the sentiment is neutral or unclear, respond with "review".
# AFTER (shorthand convention, 14 tokens)
Sentiment→action: pos=approve, neg=reject, neu=review
The SimpleTool paper formalizes this intuition: structured outputs have "substantial token redundancy" in delimiters and parameter names, and using special tokens to compress these elements delivers 4-6x token reduction with no accuracy loss [1]. You're applying the same principle at the prompt level rather than the decoding level.
Where shorthand breaks down is in user-facing or zero-shot contexts. If a user (or a general-purpose model with no system priming) encounters pos=approve, neg=reject, comprehension isn't guaranteed. Reserve shorthand for internal, machine-to-machine prompt templates.
Here's a composite example - a production prompt for a customer support routing system - applying all four techniques:
# BEFORE - 210 tokens
You are a customer support routing assistant. Your job is to read
incoming customer messages and determine which team should handle
them. The teams are: billing, technical support, account management,
and general inquiries. Please analyze the message carefully,
consider the customer's likely intent, and respond in JSON with
the team name and your confidence score from 0 to 1. Here are
some examples to guide you:
Example 1: "My card was charged twice" → billing
Example 2: "I was billed incorrectly" → billing
Example 3: "The app keeps crashing" → technical
Example 4: "I can't log in" → technical
Now classify the following message:
# AFTER - 74 tokens
Route support message to: billing|technical|account|general.
Reply: {"team": "...", "confidence": 0-1}
Examples:
"My card was charged twice" → billing
"The app keeps crashing" → technical
Classify:
That's a 65% token reduction. Same routing accuracy, same output schema, two representative examples instead of four redundant ones.
Compression has limits. Ambiguous tasks need more context, not less. If your prompt is covering a complex multi-step workflow with conditional logic, stripping it down to imperative shorthand will cause the model to miss branches.
Also: compress instructions, not constraints. Safety rules, format requirements, and edge-case handlers should stay explicit. The tokens you save on filler should not come from the tokens that prevent hallucinations.
A good workflow is to use Rephrase to generate a compressed draft, then manually audit that every logical constraint survived the rewrite. Automated compression is a starting point, not a final pass.
Verbose prompts feel safer. More words seem like more guidance. But the research pushes back on this intuition hard: extra tokens in reasoning chains compound errors [2], extra tokens in structured outputs add latency [1], and extra tokens in retrieved context introduce noise [3].
The discipline of prompt compression is fundamentally the same as good writing: say exactly what you mean, then stop. In production at scale, that discipline has a direct dollar value.
Documentation & Research
Community Examples
Prompt compression is the process of rewriting prompts to use fewer tokens while preserving the information the model needs to respond accurately. Techniques include instruction distillation, example pruning, and shorthand conventions. It reduces cost and latency in production systems.
Results vary by technique and prompt type. Reference compression and instruction distillation typically achieve 30-60% token reduction. Research from OPSDC shows up to 57-59% token reduction on reasoning tasks while improving accuracy.
Instruction distillation means rewriting verbose natural-language instructions into a minimal, imperative form - removing articles, hedging language, and redundant context while keeping every logical constraint intact.