Blog / Prompt engineering / LangGraph at Scale: What Klarna Shows

LangGraph at Scale: What Klarna Shows

Learn how to design LangGraph production systems from Klarna-style workloads, with routing, state, and guardrails that hold up in production. Try free.

Ilia Ilinskii
Rephrase · June 9, 2026

Prompt engineering8 min read

On this page

Key Takeaways What does LangGraph production actually look like?Why do orchestrated agents fail?How should you design a production graph?What should node prompts look like?How do Klarna-style workloads change the architecture?What does a real before/after prompt look like?Where does Rephrase fit in?What should you measure in production?So what's the bottom line?References

I keep seeing the same mistake in agent projects: people treat orchestration like a magic upgrade. It isn't. The real story is harsher and more useful. Production LangGraph looks less like a demo and more like a constrained state machine with sharp edges.

Key Takeaways

LangGraph works best when the workflow has clear stages, not when you want vague "agent magic."
The failure mode is usually routing and state, not model IQ.
Production graphs need stage checks, narrow node prompts, and explicit fallback paths.
If the whole procedure fits in context, a single in-context prompt can outperform orchestration on quality.
Tools like Rephrase can speed up prompt cleanup across nodes and roles.

What does LangGraph production actually look like?

LangGraph production looks like a workflow engine wrapped around an LLM, not a free-form chatbot. In practice, you are managing nodes, transitions, retries, and tool permissions. The hard part is not "making the model talk"; it is keeping it from taking the wrong branch, repeating work, or losing track of what stage it is in [1][2].

Klarna-style workloads make this obvious. Once you have something like 853 employees, multiple task types, and real business constraints, the system must know exactly what is legal to do next. That is where orchestration becomes governance, not just UX.

Why do orchestrated agents fail?

Orchestrated agents fail because each turn only sees a slice of the world. That makes the model locally competent and globally flaky. Research on procedural tasks shows that the same model often performs better when the full procedure is placed in the system prompt than when it is routed through LangGraph-style orchestration [1]. The issue is fragmentation: state, routing, and prompt injection all create new failure points.

The production lesson is simple. Every extra decision hub is another chance to misroute, repeat, or drift.

How should you design a production graph?

You should design a production graph like a business process, not like a conversation tree. The best pattern is to make each node do one thing, keep transitions explicit, and enforce preconditions before a tool call. SDOF's state-constrained dispatch work is a good example of why this matters: stage legality and precondition checks catch the kind of workflow violations that a plain graph can miss [2].

Here's the shape I'd use in serious systems:

Layer	What it does	Why it matters
Router	Chooses the next stage	Prevents random branching
Node prompt	Tells the model one job	Reduces prompt sprawl
Precondition check	Verifies required state	Blocks illegal actions
Tool layer	Executes side effects	Keeps actions auditable
Audit log	Records transitions	Makes debugging possible

That is the boring truth. And boring is good in production.

What should node prompts look like?

Node prompts should be short, stage-specific, and hard to misunderstand. The production anti-pattern is a giant prompt that tries to explain the whole workflow at once. That works in notebooks and fails in real traffic. Research on compiled workflows and orchestration shows that the more structure you can move into the system, the less you need to depend on every node being "smart" [1][2].

A better node prompt says what this node owns, what it must never do, and what state it can assume.

You are the intake node.
Collect only missing booking details.
Do not present options yet.
If the budget is unclear, ask one clarification question.
If required fields are complete, hand off to the routing node.

That is much easier to maintain than a paragraph of policy soup.

How do Klarna-style workloads change the architecture?

Klarna-style workloads push you toward more structure, not less. Once the task spans support, operations, staffing, or fulfillment, the agent is no longer chatting; it is executing a business process. That means domain-specific stages, explicit handoffs, and real constraints. The more operational the workflow, the more orchestration needs to behave like control software.

The interesting twist is that orchestration is not always the endgame. The same research that validates graphs also shows a competing pattern: if the procedure is stable and fits in context, in-context prompting can beat orchestration on quality [1]. So the production choice is not "graph versus no graph." It is "where should the control live?"

What does a real before/after prompt look like?

In production, the biggest prompt win is usually not smarter wording. It is removing ambiguity. Here is the kind of transformation I see all the time:

Before	After
"Help the user with the issue and be thorough."	"You are the triage node. Ask for the missing account detail, then route to billing or support. Do not resolve the case here."
"Be helpful and solve the request."	"You are the approval node. Only approve if the policy, budget, and manager consent are present."
"Continue the conversation naturally."	"You are the handoff node. Summarize state in one paragraph and stop generating after the transfer."

That difference sounds small. It isn't. It prevents the graph from becoming a polite mess.

Where does Rephrase fit in?

This is exactly the sort of workflow where Rephrase helps. If you are writing ten node prompts, a router prompt, and a fallback prompt, you do not want to hand-edit every version. Rephrase can rewrite rough drafts into cleaner, more specific prompts in seconds, which is useful when you are iterating on graph structure and node behavior at the same time.

I would not use it to invent the architecture. I would use it to tighten the language once the architecture is already right.

What should you measure in production?

You should measure transition accuracy, task completion, fallback rate, tool-call success, and how often the agent re-asks for data it already has. The most important metric is usually not raw response quality. It is whether the graph moves forward without illegal transitions or loops. In production, the expensive failures are usually state failures, not wording failures [2].

If the agent keeps "trying" to help but never advances the workflow, that's not a prompt problem anymore. It's an orchestration problem.

So what's the bottom line?

LangGraph production is useful when you need explicit control, auditability, and stage-based execution. It is weaker when you are just adding orchestration because the task sounds sophisticated. Klarna-style scale does not magically make graphs better; it makes weak graphs fail louder. Build the smallest graph that enforces the business rules, keep node prompts sharp, and let the model do less, not more.

If you want more practical prompting breakdowns like this, browse the Rephrase blog. The best production prompt is usually the one that survives contact with real state.

References

Documentation & Research

In-Context Prompting Obsoletes Agent Orchestration for Procedural Tasks - arXiv (link)
SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch - arXiv (link)

Community Examples 3. Lessons from deploying RAG bots for regulated industries - r/LocalLLaMA (link)

Frequently asked

What is LangGraph used for in production?

LangGraph is used to build stateful agent workflows with branching, loops, and tool use. It is best when the task needs explicit control over transitions, retries, and state.

When should I use a graph instead of one prompt?

Use a graph when the workflow has real stages, external tools, or legal/operational constraints. If the procedure is short enough to fit cleanly in context, a single prompt can still win.

How can I make LangGraph more reliable?

Keep nodes narrow, enforce stage checks, log every transition, and add precondition validation before tool calls. Tools like Rephrase can help you rewrite each node prompt faster.