Blog / Prompt engineering / Tracing Multi-Agent Workflows with Trees

Tracing Multi-Agent Workflows with Trees

Master multi-agent tracing with tree-structured logs, causal graphs, and better debugging. Learn why trees beat flat spans and see examples inside.

Ilia Ilinskii
Rephrase · June 7, 2026

Prompt engineering7 min read

On this page

Why do flat span lists break down in multi-agent workflows?What does a tree capture that a list misses?Why do trees fit agent behavior better?How do causal graphs improve debugging?What should production traces record?How do trees help with contamination and drift?Before and after: flat spans vs tree traces What does a better trace look like in practice?How should teams instrument this without overbuilding?My take: tree-first tracing is the default now References

If you've ever stared at a wall of spans and thought, "Okay, but what actually caused this agent to go sideways?", you already know the problem. Multi-agent systems don't fail like single requests. They branch, recurse, hand off, and contaminate downstream state.

Key Takeaways

Flat span lists preserve time, but they lose causality.
Tree-shaped traces mirror how multi-agent workflows actually execute.
Causal backtracking makes root-cause analysis faster and more reliable.
Hierarchical traces are better for debugging, replay, and human review.
Tools like Rephrase can help you turn messy observations into cleaner prompts for tracing, debugging, and incident analysis.

Why do flat span lists break down in multi-agent workflows?

Flat span lists are fine when a system is mostly linear. In multi-agent workflows, they collapse too much structure: parent-child relationships, routing decisions, nested tool calls, and cross-agent dependencies disappear into a timestamped stream. That makes it hard to see which branch created the bad state and which branch merely inherited it [1][2].

The core issue is that time order is not the same as causal order.

What does a tree capture that a list misses?

A tree captures provenance. When one agent delegates to another, or one tool result influences a later routing decision, the trace should show a parent-child edge, not just a nearby timestamp. That matters because failures often originate upstream and only show up much later. In tree form, you can trace the path from symptom back to decision [1].

This is exactly why hierarchical plans are easier to debug: they preserve structure without forcing you to reconstruct it by hand [3].

Why do trees fit agent behavior better?

Multi-agent systems are usually graph-shaped in design but tree-like in execution. A coordinator picks a branch, a specialist calls a tool, the tool output creates a new subdecision, and the workflow continues. Even when the architecture is a DAG, the runtime trace often behaves like an execution tree because each decision opens a new path [2][3].

That's why a tree is such a good observability primitive. It matches how agents actually think, not just how logs are appended.

How do causal graphs improve debugging?

Causal graphs turn "what happened" into "what caused what." In AgentTrace, the system reconstructs a causal graph from execution logs, then walks backward from the error node to candidate root causes [1]. The paper's key point is simple: you don't need to inspect every span if you can identify ancestor nodes that changed the trajectory.

That approach is stronger than eyeballing a trace because it encodes dependency, not just sequence.

What should production traces record?

Production traces should record routing decisions, tool invocations, memory writes, message handoffs, and outcome events. Those are the moments where the workflow can fork or contaminate downstream state. The contamination paper shows that uncertainty can change decomposition and routing, not just final answers, which means the trace has to capture intermediate state transitions too [2].

If your tracing only records final outputs, you'll miss the bad branch entirely.

How do trees help with contamination and drift?

Trees make divergence visible. In structured workflows, a small upstream perturbation can trigger a longer execution, a different branch, or a silent semantic error that still looks locally plausible [2]. That's the real production pain: the workflow can "recover" and still end up costly, or stay structurally similar and still be wrong.

A tree lets you compare branch shape, depth, and ancestor state. A flat list mostly just tells you that time passed.

Before and after: flat spans vs tree traces

View	What you see	What you miss	Best use
Flat span list	Ordered events and latency	Causality, nesting, branch ancestry	Quick timeline, export, basic monitoring
Tree trace	Parent-child execution structure	Less compact, harder to eyeball at first	Debugging, replay, root-cause analysis
Causal graph	Dependency between actions	Implementation detail unless well-instrumented	Failure localization, contamination analysis

The important thing is not that lists are useless. It's that lists are the wrong source of truth once your workflow branches.

What does a better trace look like in practice?

Here's the pattern I'd use. Start with a tree of actions, attach typed metadata to each node, and preserve the parent edge from every agent handoff. Then layer in causal links for data dependencies. That gives you a runtime tree for navigation and a causal graph for diagnosis [1][2].

root
├─ router: choose research agent
│  ├─ agent.research: search docs
│  └─ tool.search: returns ambiguous evidence
└─ router: escalate to verifier
   └─ verifier: retries with stricter prompt

That structure tells you immediately where the system branched, what it consumed, and why the next agent existed.

How should teams instrument this without overbuilding?

Don't start with perfect observability. Start with the minimum set of structural events: choose-agent, tool-call, message-send, message-receive, memory-write, and outcome. Then add tree reconstruction on top of those events. Structured logging frameworks for agents already argue for richer runtime telemetry because static auditing is too weak for nondeterministic workflows [4].

If you're building prompts or incident summaries from these logs, a tool like Rephrase can save time by turning rough notes into cleaner, action-oriented prompts for investigation or replay.

My take: tree-first tracing is the default now

If your system has more than one agent, flat spans are already a compromise. They're useful for transport, but not for understanding. Trees give you the shape of execution, and shape is what you need when the bug is hiding in a branch, not in the last line of output.

That's the big shift: trace the workflow the way it actually unfolds, then debug it the way causality works. If you want more practical prompting and AI workflow articles like this, check out the Rephrase blog.

References

Documentation & Research

AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems - arXiv (link)
Trace-Level Analysis of Information Contamination in Multi-Agent Systems - arXiv (link)
STRUCTUREDAGENT: Planning with AND/OR Trees for Long-Horizon Web Tasks - arXiv (link)
AgentTrace: A Structured Logging Framework for Agent System Observability - arXiv (link)

Frequently asked

Why are trees better than flat span lists for agent tracing?

Trees preserve parent-child relationships, causality, and nested decisions. Flat lists show order, but they hide why a branch happened and what it depended on.

How do I debug a failed multi-agent run?

Start at the error node and trace backward through the causal graph to the first decision that changed the path. That usually finds the root cause faster than reading the whole timeline.

Can flat spans still be useful?

Yes, as an export format or quick timeline. But for production debugging, they work best as a view on top of a tree, not as the source of truth.