Master multi-agent tracing with tree-structured logs, causal graphs, and better debugging. Learn why trees beat flat spans and see examples inside.
If you've ever stared at a wall of spans and thought, "Okay, but what actually caused this agent to go sideways?", you already know the problem. Multi-agent systems don't fail like single requests. They branch, recurse, hand off, and contaminate downstream state.
Key Takeaways
Flat span lists are fine when a system is mostly linear. In multi-agent workflows, they collapse too much structure: parent-child relationships, routing decisions, nested tool calls, and cross-agent dependencies disappear into a timestamped stream. That makes it hard to see which branch created the bad state and which branch merely inherited it [1][2].
The core issue is that time order is not the same as causal order.
A tree captures provenance. When one agent delegates to another, or one tool result influences a later routing decision, the trace should show a parent-child edge, not just a nearby timestamp. That matters because failures often originate upstream and only show up much later. In tree form, you can trace the path from symptom back to decision [1].
This is exactly why hierarchical plans are easier to debug: they preserve structure without forcing you to reconstruct it by hand [3].
Multi-agent systems are usually graph-shaped in design but tree-like in execution. A coordinator picks a branch, a specialist calls a tool, the tool output creates a new subdecision, and the workflow continues. Even when the architecture is a DAG, the runtime trace often behaves like an execution tree because each decision opens a new path [2][3].
That's why a tree is such a good observability primitive. It matches how agents actually think, not just how logs are appended.
Causal graphs turn "what happened" into "what caused what." In AgentTrace, the system reconstructs a causal graph from execution logs, then walks backward from the error node to candidate root causes [1]. The paper's key point is simple: you don't need to inspect every span if you can identify ancestor nodes that changed the trajectory.
That approach is stronger than eyeballing a trace because it encodes dependency, not just sequence.
Production traces should record routing decisions, tool invocations, memory writes, message handoffs, and outcome events. Those are the moments where the workflow can fork or contaminate downstream state. The contamination paper shows that uncertainty can change decomposition and routing, not just final answers, which means the trace has to capture intermediate state transitions too [2].
If your tracing only records final outputs, you'll miss the bad branch entirely.
Trees make divergence visible. In structured workflows, a small upstream perturbation can trigger a longer execution, a different branch, or a silent semantic error that still looks locally plausible [2]. That's the real production pain: the workflow can "recover" and still end up costly, or stay structurally similar and still be wrong.
A tree lets you compare branch shape, depth, and ancestor state. A flat list mostly just tells you that time passed.
| View | What you see | What you miss | Best use |
|---|---|---|---|
| Flat span list | Ordered events and latency | Causality, nesting, branch ancestry | Quick timeline, export, basic monitoring |
| Tree trace | Parent-child execution structure | Less compact, harder to eyeball at first | Debugging, replay, root-cause analysis |
| Causal graph | Dependency between actions | Implementation detail unless well-instrumented | Failure localization, contamination analysis |
The important thing is not that lists are useless. It's that lists are the wrong source of truth once your workflow branches.
Here's the pattern I'd use. Start with a tree of actions, attach typed metadata to each node, and preserve the parent edge from every agent handoff. Then layer in causal links for data dependencies. That gives you a runtime tree for navigation and a causal graph for diagnosis [1][2].
root
├─ router: choose research agent
│ ├─ agent.research: search docs
│ └─ tool.search: returns ambiguous evidence
└─ router: escalate to verifier
└─ verifier: retries with stricter prompt
That structure tells you immediately where the system branched, what it consumed, and why the next agent existed.
Don't start with perfect observability. Start with the minimum set of structural events: choose-agent, tool-call, message-send, message-receive, memory-write, and outcome. Then add tree reconstruction on top of those events. Structured logging frameworks for agents already argue for richer runtime telemetry because static auditing is too weak for nondeterministic workflows [4].
If you're building prompts or incident summaries from these logs, a tool like Rephrase can save time by turning rough notes into cleaner, action-oriented prompts for investigation or replay.
If your system has more than one agent, flat spans are already a compromise. They're useful for transport, but not for understanding. Trees give you the shape of execution, and shape is what you need when the bug is hiding in a branch, not in the last line of output.
That's the big shift: trace the workflow the way it actually unfolds, then debug it the way causality works. If you want more practical prompting and AI workflow articles like this, check out the Rephrase blog.
Documentation & Research
Trees preserve parent-child relationships, causality, and nested decisions. Flat lists show order, but they hide why a branch happened and what it depended on.
Start at the error node and trace backward through the causal graph to the first decision that changed the path. That usually finds the root cause faster than reading the whole timeline.
Yes, as an export format or quick timeline. But for production debugging, they work best as a view on top of a tree, not as the source of truth.