Learn how bad context engineering likely doomed Klarna's AI agent rollout, and what teams should fix before shipping agents. Read the full guide.
Klarna became the poster child for AI customer support efficiency. Then it became the warning label.
The part I find most interesting is this: the likely failure was not "AI is bad at support." It was that the agent was dropped into production with the wrong informational world around it.
Klarna is the classic context engineering case because it shows what happens when an agent is optimized for throughput before it is optimized for judgment. The system may answer a lot of tickets cheaply, but if it lacks the right context, it becomes a fast, polite, high-scale mistake generator.[2]
In the most-cited account of the case, Klarna's AI agent reportedly handled two-thirds of customer inquiries, with work equivalent to 853 full-time employees and large cost savings.[2] But the story did not end there. Klarna's CEO later admitted the company had over-focused on cost and that service quality suffered.[2]
That reversal matters. It tells us something bigger than "customers still want humans." It tells us that agent deployment quality depends on the context layer, not just the model layer.
A support agent does not need abstract intelligence first. It needs the right facts at the right moment: order history, refund rules, loyalty considerations, previous conversations, edge-case policies, and when to escalate. If that bundle is missing, stale, or overloaded with junk, the agent will still respond. It just will not respond well.
Bad context engineering means the agent sees either too little, too much, or the wrong mix of information at the wrong time. In production systems, that usually creates answers that are technically plausible but strategically dumb.[2][3]
The recent context engineering literature is useful here because it moves the conversation beyond prompt tips. Vishnyakova describes context engineering as designing the full informational environment in which an agent acts, including memory, policies, tool outputs, and visibility boundaries.[2] Calboreanu makes a similar point from a practitioner angle: incomplete context was associated with 72% of iteration cycles in a 200-interaction observational study.[3]
That framing helps explain Klarna. A customer support agent can fail in at least four predictable ways:
| Failure mode | What the customer sees | Likely context problem |
|---|---|---|
| Formulaic answers | "Sorry, that's our policy" replies | Missing customer-specific history |
| Wrong trade-offs | Low-empathy cost-cutting | No encoded loyalty or retention goals |
| Inconsistent responses | Different answer each time | Weak retrieval and policy conflict handling |
| Premature closure | Ticket resolved, customer unhappy | Poor escalation and verification loops |
Here's the catch: none of these are fixed by adding "be more empathetic" to the prompt.
Long-running agents get worse over time because their context grows, drifts, and accumulates irrelevant material. As more steps, tool outputs, and partial conclusions pile up, the model becomes more likely to miss constraints, stop exploring too early, or carry forward distorted facts.[1]
That is not just theory. LOCA-bench, a benchmark for long-context agents, found that performance falls sharply as environment description length increases, even when the underlying task stays the same.[1] The paper highlights several failure modes that feel painfully relevant to support automation: weaker instruction following, insufficient exploration, and hallucination-like inconsistencies as context grows.[1]
In plain English: the longer the agent runs, the easier it is for it to become overconfident and underinformed.
That maps directly onto enterprise support. A customer service agent is almost never answering from one clean prompt. It is operating across conversation history, retrieval from knowledge bases, policy documents, CRM records, and tool results. If the context window becomes cluttered, the agent can start acting like a tired employee who skimmed half the file and guessed the rest.
Cost optimization made the problem worse because it likely encouraged thinner context, fewer escalations, and more aggressive automation thresholds. That improves dashboard metrics fast, but it can quietly raise the real cost: lost trust, repeat contacts, churn, and brand damage.[2]
This is one of the sharpest insights from the Klarna discussion in the research: minimizing token spend and maximizing customer experience often push in opposite directions unless you engineer context carefully.[2]
I think this is where a lot of teams get trapped. They ask, "How do we make the agent cheaper?" before they ask, "What information must the agent always have to avoid making an expensive mistake?"
A bad support answer is not cheap. It is just cheap to generate.
Here's a simple before-and-after way to think about it:
Before:
Answer customer questions about refunds using our help center.
After:
Answer refund questions using:
- the customer's order history
- current refund policy by market
- prior support interactions from the last 30 days
- exceptions for delayed delivery, damaged items, and loyalty-tier customers
If policy is ambiguous or customer value is high, escalate instead of denying.
That second version is still not enough for a full agent system, but it points in the right direction: better context, explicit trade-offs, and safe exits.
Tools that sit in your workflow, like Rephrase, can help rewrite rough instructions into clearer prompts fast, but production agents still need context architecture behind the prompt. That is the real work.
Teams should design the context pipeline before scaling the agent. That means deciding what information is authoritative, what gets retrieved per task, what should be compressed, what must stay isolated, and when a human should take over.[2][3]
This is where I'd start.
First, define the minimum support context for each ticket type. Refunds, fraud, delivery issues, and account access do not need the same data.
Second, separate policy context from customer context. One tells the agent what is allowed. The other tells it what is wise.
Third, make escalation a feature, not a failure. Research on agent systems repeatedly suggests that structured verification and staged workflows outperform one-shot generation.[3]
Fourth, test for quality loss under context growth, not just first-response speed. More teams should benchmark the "ticket 7 in a messy conversation" case, not just the clean demo case.
If you want more practical breakdowns like this, the Rephrase blog is a good place to keep digging into prompt and agent workflow design.
The real lesson is that AI agents fail at the system level before they fail at the sentence level. Klarna's story is a reminder that prompt engineering matters, but context engineering decides whether an agent belongs in production.[2][3]
My take is simple: Klarna did not just over-automate. It likely under-specified what the agent should know, how it should decide, and when it should stop pretending it knew enough.
That is why this case matters beyond customer support. Every team building agents for sales, ops, coding, or internal help desks should pay attention. If the context is wrong, scale just makes the mistake louder.
And yes, tools like Rephrase can help you tighten prompts fast across apps. But for real agent deployments, the durable advantage is not prettier wording. It is building the right context system behind the words.
Documentation & Research
Community Examples 4. Context infrastructure - r/PromptEngineering (link)
Context engineering is the practice of deciding what an agent sees, remembers, retrieves, and ignores at each step. It goes beyond prompt wording and covers memory, tools, policies, history, and data selection.
Usually not by themselves. Once an agent runs across multiple steps and tools, the bigger issue is often context design, retrieval quality, memory handling, and business rules rather than prompt phrasing alone.