Blog / Prompt engineering / Why Klarna's AI Agent Deployment Failed

Why Klarna's AI Agent Deployment Failed

Learn how bad context engineering likely doomed Klarna's AI agent rollout, and what teams should fix before shipping agents. Read the full guide.

Ilia Ilinskii
Rephrase · April 22, 2026

Prompt engineering7 min read

On this page

Key Takeaways Why is Klarna the classic context engineering case?What does bad context engineering actually mean?Why can long-running agents get worse over time?How did cost optimization make the problem worse?What should teams do differently before shipping agents?What is the real lesson from the Klarna case?References

Klarna became the poster child for AI customer support efficiency. Then it became the warning label.

The part I find most interesting is this: the likely failure was not "AI is bad at support." It was that the agent was dropped into production with the wrong informational world around it.

Key Takeaways

Klarna's case is best understood as a context engineering failure, not just a model failure.
Research on long-running agents shows performance drops as context grows, gets noisy, or becomes incomplete.[1]
A well-written prompt cannot save an agent that lacks the right customer history, policy context, or escalation rules.[2]
Teams should optimize for error cost, not just token cost or headcount savings.
The fix is usually better context pipelines, stronger isolation, and explicit business priorities.

Why is Klarna the classic context engineering case?

Klarna is the classic context engineering case because it shows what happens when an agent is optimized for throughput before it is optimized for judgment. The system may answer a lot of tickets cheaply, but if it lacks the right context, it becomes a fast, polite, high-scale mistake generator.[2]

In the most-cited account of the case, Klarna's AI agent reportedly handled two-thirds of customer inquiries, with work equivalent to 853 full-time employees and large cost savings.[2] But the story did not end there. Klarna's CEO later admitted the company had over-focused on cost and that service quality suffered.[2]

That reversal matters. It tells us something bigger than "customers still want humans." It tells us that agent deployment quality depends on the context layer, not just the model layer.

A support agent does not need abstract intelligence first. It needs the right facts at the right moment: order history, refund rules, loyalty considerations, previous conversations, edge-case policies, and when to escalate. If that bundle is missing, stale, or overloaded with junk, the agent will still respond. It just will not respond well.

What does bad context engineering actually mean?

Bad context engineering means the agent sees either too little, too much, or the wrong mix of information at the wrong time. In production systems, that usually creates answers that are technically plausible but strategically dumb.[2][3]

The recent context engineering literature is useful here because it moves the conversation beyond prompt tips. Vishnyakova describes context engineering as designing the full informational environment in which an agent acts, including memory, policies, tool outputs, and visibility boundaries.[2] Calboreanu makes a similar point from a practitioner angle: incomplete context was associated with 72% of iteration cycles in a 200-interaction observational study.[3]

That framing helps explain Klarna. A customer support agent can fail in at least four predictable ways:

Failure mode	What the customer sees	Likely context problem
Formulaic answers	"Sorry, that's our policy" replies	Missing customer-specific history
Wrong trade-offs	Low-empathy cost-cutting	No encoded loyalty or retention goals
Inconsistent responses	Different answer each time	Weak retrieval and policy conflict handling
Premature closure	Ticket resolved, customer unhappy	Poor escalation and verification loops

Here's the catch: none of these are fixed by adding "be more empathetic" to the prompt.

Why can long-running agents get worse over time?

Long-running agents get worse over time because their context grows, drifts, and accumulates irrelevant material. As more steps, tool outputs, and partial conclusions pile up, the model becomes more likely to miss constraints, stop exploring too early, or carry forward distorted facts.[1]

That is not just theory. LOCA-bench, a benchmark for long-context agents, found that performance falls sharply as environment description length increases, even when the underlying task stays the same.[1] The paper highlights several failure modes that feel painfully relevant to support automation: weaker instruction following, insufficient exploration, and hallucination-like inconsistencies as context grows.[1]

In plain English: the longer the agent runs, the easier it is for it to become overconfident and underinformed.

That maps directly onto enterprise support. A customer service agent is almost never answering from one clean prompt. It is operating across conversation history, retrieval from knowledge bases, policy documents, CRM records, and tool results. If the context window becomes cluttered, the agent can start acting like a tired employee who skimmed half the file and guessed the rest.

How did cost optimization make the problem worse?

Cost optimization made the problem worse because it likely encouraged thinner context, fewer escalations, and more aggressive automation thresholds. That improves dashboard metrics fast, but it can quietly raise the real cost: lost trust, repeat contacts, churn, and brand damage.[2]

This is one of the sharpest insights from the Klarna discussion in the research: minimizing token spend and maximizing customer experience often push in opposite directions unless you engineer context carefully.[2]

I think this is where a lot of teams get trapped. They ask, "How do we make the agent cheaper?" before they ask, "What information must the agent always have to avoid making an expensive mistake?"

A bad support answer is not cheap. It is just cheap to generate.

Here's a simple before-and-after way to think about it:

Before:
Answer customer questions about refunds using our help center.

After:
Answer refund questions using:
- the customer's order history
- current refund policy by market
- prior support interactions from the last 30 days
- exceptions for delayed delivery, damaged items, and loyalty-tier customers
If policy is ambiguous or customer value is high, escalate instead of denying.

That second version is still not enough for a full agent system, but it points in the right direction: better context, explicit trade-offs, and safe exits.

Tools that sit in your workflow, like Rephrase, can help rewrite rough instructions into clearer prompts fast, but production agents still need context architecture behind the prompt. That is the real work.

What should teams do differently before shipping agents?

Teams should design the context pipeline before scaling the agent. That means deciding what information is authoritative, what gets retrieved per task, what should be compressed, what must stay isolated, and when a human should take over.[2][3]

This is where I'd start.

First, define the minimum support context for each ticket type. Refunds, fraud, delivery issues, and account access do not need the same data.

Second, separate policy context from customer context. One tells the agent what is allowed. The other tells it what is wise.

Third, make escalation a feature, not a failure. Research on agent systems repeatedly suggests that structured verification and staged workflows outperform one-shot generation.[3]

Fourth, test for quality loss under context growth, not just first-response speed. More teams should benchmark the "ticket 7 in a messy conversation" case, not just the clean demo case.

If you want more practical breakdowns like this, the Rephrase blog is a good place to keep digging into prompt and agent workflow design.

What is the real lesson from the Klarna case?

The real lesson is that AI agents fail at the system level before they fail at the sentence level. Klarna's story is a reminder that prompt engineering matters, but context engineering decides whether an agent belongs in production.[2][3]

My take is simple: Klarna did not just over-automate. It likely under-specified what the agent should know, how it should decide, and when it should stop pretending it knew enough.

That is why this case matters beyond customer support. Every team building agents for sales, ops, coding, or internal help desks should pay attention. If the context is wrong, scale just makes the mistake louder.

And yes, tools like Rephrase can help you tighten prompts fast across apps. But for real agent deployments, the durable advantage is not prettier wording. It is building the right context system behind the words.

References

Documentation & Research

LOCA-bench: Benchmarking Language Agents Under Controllable and Extreme Context Growth - arXiv (link)
Context Engineering: From Prompts to Corporate Multi-Agent Architecture - arXiv (link)
Context Engineering: A Practitioner Methodology for Structured Human-AI Collaboration - arXiv (link)

Community Examples 4. Context infrastructure - r/PromptEngineering (link)

Frequently asked

What is context engineering in AI agents?

Context engineering is the practice of deciding what an agent sees, remembers, retrieves, and ignores at each step. It goes beyond prompt wording and covers memory, tools, policies, history, and data selection.

Can better prompts fix a failing AI agent deployment?

Usually not by themselves. Once an agent runs across multiple steps and tools, the bigger issue is often context design, retrieval quality, memory handling, and business rules rather than prompt phrasing alone.