Tracing

How tracing fits into the loop

Traditional software is largely deterministic, executions follow a pre-defined format. For LLM applications that's not the case. Agent executions can be messy, we are dealing with emergent behaviour with rich and unexpected inputs and outputs, and execution order. You need something else to follow your agent's behavior: traces.

A trace is a structured record of what your application did for a single request: which steps it took, what data it saw, what it produced.

Deploy

Online

Trace

traces · sessions · agents · prompts

Online

Monitor

dashboards · LLM-as-judge · feedback

Offline

Build datasets

datasets · features-as-tests

Offline

Experiment

prompts · models · code variants

Offline

Evaluate

judges · custom evals · annotation

Tracing is central to the entire improvement loop. Every other step — reviewing, building datasets, running experiments, evaluating — operates on traces.

If you're already familiar with traditional observability concepts, some of what follows may feel repetitive. Feel free to skim or skip ahead.

The anatomy of a trace

A trace can be as complex or as simple as your application requires, but all traces share the same basic structure. It's composed of a set of observations that map out the path your agent took.

An observation is a single step in the process. It has an input, an output, start/end time, and metadata about what happened during that step.

Hierarchy

A trace has a hierarchical tree structure. Nested inside are observations that can contain other observations, forming a parent-child structure that mirrors the actual execution of your AI application.

NameDuration

01s2s3s4s

Tracerag-chat-pipeline4.23s

Spanretrieve-documents1.07s

Genembed-query150ms

Spanvector-db-search510ms

Eventcache-miss@ 225ms

Genrerank-results180ms

Spangenerate-answer3.16s

Gengenerate-answer1.07s

Eventstream-complete@ 1.93s

You can see what happened in what order, and which steps were part of which larger step.

Observation data

Input and output. Every observation can have an input and an output. Most of the time it will have both; in some specific cases it might only have one of the two. It's important for interpretability that you set an input and/or output that makes sense for the type of action happening in that observation.

Observation types. In order to make it easy to differentiate between operations, you'll see different types of observations. Each type of observation is used to capture different kinds of interactions of an agent.

Action of an agent	Observation type	Typical observation input/output
A call to a language model	`generation`	Full prompt or message history as input, the completion as output, plus metadata like the model name and token counts
A step that fetches information from an external source	`retriever`	Query and the returned documents
An invocation of a tool or function by an agent	`tool`	Which tool was called, the arguments, and the return value
General processes	`span`	Highly dependent on the use case

Observation types make it easier to read traces and to filter. In a trace with 20 observations, being able to quickly spot the LLM calls saves time.

Cost, latency, token usage

Beyond input and output, there are a few attributes on observations that are table stakes in any LLM application: cost, latency, and token usage. These are recorded per observation and aggregated at the trace level.

Traces vs sessions

Most of the time you would not see an entire agent's lifecycle execution in one trace. Traces can be grouped into sessions. But where do you draw the line between a trace and a session?

Session

Trace

Observation

Trace

Observation

Trace

Observation

A general rule of thumb is: one trace corresponds to one invocation of your system, typically one API call or one agent execution. A session then groups multiple traces together, for example all the turns in a multi-turn conversation.

To make this concrete, here are two examples of applications and how they designed their trace-session split:

A customer support chatbot embedded on a SaaS company's help page. Users open it to ask questions about their account, billing, or product usage. A typical session is a handful of back-and-forth turns until the issue is resolved or the conversation is handed off to a human agent.

The trace and session split looks like this:

Session: conversation_8f2a
├── Trace: "Why was I charged twice this month?"
├── Trace: "Can you refund the duplicate?"
└── Trace: "Thanks, when will I see it?"

That way, evaluating the execution of the chatbot for a given input can be done on one single trace in isolation. For inspecting the entire conversation, one can look at the session as a whole.

An automated code review agent that runs on every commit pushed to an open pull request. For each run, the agent reads the diff and the surrounding files, executes static checks, and posts inline review comments. A PR usually receives several review runs over its lifetime as the author iterates.

The trace and session split looks like this:

Session: PR #1234
├── Trace: review run on commit a3f9b
├── Trace: review run on commit c7d2e
└── Trace: review run on commit 8b1f0

Here a trace is one review iteration. The comment quality depends on what was read. Splitting each step of a review iteration into its own trace would scatter the context and make the review hard to follow.

The shared tradeoff: cut too small and a trace loses the cohesion to tell a story; cut too large and individual failures get buried inside an unreadable trace.

Where to start

If you're just getting started, focus on instrumenting one real workflow end to end before trying to cover every possible path.

Set up tracing for one important request path in your application.
Make sure each observation captures useful input, output, and metadata for the step it represents.
Review a handful of real traces manually to confirm that the structure is easy to follow and useful for debugging.

What comes next

Once you see traces, you can move on to the next step: monitoring. Monitoring is what connects traces to the loop of improving and iterating on your agent.

Was this page helpful?

On this page