Braintrust and Laminar both ingest your traces, both speak OpenTelemetry, and both let you ask questions about what your AI did in production. Pick either one and you will see your LLM calls, tool calls, and token counts. So the interesting question is not "which one logs traces." It is "which workflow does the product actually optimize for," because that decision shows up in every screen you touch afterward.
Braintrust optimizes for the eval loop: write a scorer, sweep it across prompts and models, catch the regression before the PR merges. Laminar optimizes for the debug loop: open a ten-minute agent run, read what it did, and find the tool call that broke it. Those are different products that happen to share a data model.
This is a comparison written by the Laminar team, so weigh it accordingly. We have tried to keep it grounded: every Braintrust feature, price, and limit below comes from Braintrust's own docs and pricing page as of June 2026, and we call out the cases where Braintrust is the better tool. Verify anything budget- or compliance-critical against the current vendor pages before you commit.
Short answer
- Laminar: best for catching agent failures in production, knowing why in seconds, and confirming the fix. You describe failures in plain English and Laminar flags them across every run, transcript view shows you why without scrolling a span tree, SQL runs over all of it (traces, signals, clusters, evals), the rollout debugger reruns the agent from any step, and recurring failures become eval datasets that catch the regression. Apache 2.0, and trace compression that changes the bill.
- Braintrust: best when your bottleneck is CI eval regression. Mature scorer harness, immutable experiments, a regression dashboard, and a polished playground for prompt iteration against frozen datasets.
If your work is shifting from "did this change break the eval set" to "why is this agent failing on prod," that shift is the whole reason this comparison exists.
Feature comparison
| Laminar | Braintrust | |
|---|---|---|
| Primary workflow | Debug agents in production | CI eval regression |
| Trace UX default | Transcript view (conversation) | Logs table to span tree; thread view available |
| Outcome tracking | Signals: plain-language extraction with a JSON schema, backfill + live | Online scoring (numeric scores) + Topics (auto-classification) |
| Emergent pattern discovery | Clusters over signal events (automatic) | Topics pipeline (beta) |
| SQL over data | Real SQL over traces, signals, clusters, evals (editor, API, CLI, MCP) | BTQL dialect over logs/experiments (rate-limited on Free/Pro) |
| Agent re-run | Rollout debugger: rerun from any span | Playground: iterate a prompt on a dataset |
| Evals framework | Yes (datasets, code + LLM-judge evaluators) | Yes, mature (experiments, autoevals, regression) |
| License | Apache 2.0 (full platform) | Closed source (SDKs + autoevals are OSS) |
| Self-host | Free, Docker Compose or Helm, all features | Enterprise-only hybrid deployment |
| Pricing shape | Data volume + Signals steps | Processed data (GB) + scores |
| Trace compression | Content-addressed dedup, ~20x on agents | Not documented |
| OpenTelemetry | Native | Ingestion path with a translation layer |
How each models a run
Both products use a trace-of-spans model, and the shapes rhyme. A Braintrust trace is one end-to-end execution; spans carry semantic types (llm, tool, function, score, task, eval) and record input, output, metrics, and scores. The same span format is reused for production logs and for eval runs, which is the clean idea at the center of Braintrust: instrument once, evaluate and observe with the same code.
Laminar's model is also span-based, also OpenTelemetry, also typed (LLM, TOOL, DEFAULT, plus evaluation types). The difference is not the schema. It is what the product does with it on the first screen.
Braintrust opens a trace into a span hierarchy: tool calls nest under LLM calls, and you click down the tree. There is a thread view that strips the hierarchy and lays out messages, tool calls, and scores in chronological order, which is the right shape for reading a conversation. But the default surface, and the one the rest of the product feeds, is the eval-centric trace.
Laminar makes the conversation the default. Transcript view renders the run top-to-bottom: the agent input, every LLM turn, each tool call with its arguments and result, and every subagent as a single collapsed card. The span tree is one click away when you need to confirm nesting or inspect attributes. On a 2,000-span trace this is the difference between a ten-second read and a ten-minute scroll.
Three things transcript view gives you without any extra instrumentation:
- The input to every agent and subagent, parsed out of the system and user messages and rendered as a labelled Input block. You do not open the first LLM span and scroll the messages array to find what you asked the agent to do.
- Subagents as cards, not a flood of spans. Click a card and it expands in place with that subagent's own turns and tool calls; the rest stay collapsed.
- One-line previews on every LLM turn and tool row, so you can scan a long run without expanding anything.
Signals vs Topics
This is the comparison most teams actually care about, and it is the one most likely to be described wrong, so here is the careful version.
Both products know that opening one trace at a time does not scale. When ten thousand agents ran overnight, you cannot read them. Both built a layer on top of raw traces to answer cross-run questions. They built different layers.
Braintrust splits the job in two. Online scoring runs scorers automatically against a sample of production traces and writes back a numeric score (you set a sampling rate, 1 to 10 percent for high-volume apps). Separately, Topics is an unsupervised pipeline: a daily batch job converts traces to narrative text, extracts summaries through built-in facets (Task, Sentiment, Issues), clusters the summaries, and once roughly 100 summaries accumulate it names the clusters (for example, "Refund requests"). Topics is built for discovery: surfacing failure modes you did not know to look for. It is also, per Braintrust's docs, in beta, with custom topics gated to Pro and above, and early access only for self-hosted deployments.
Laminar splits the job the other way. Signals are directed. You describe an outcome in plain language ("agent looped on the same tool without progress," "user asked for the same thing twice and gave up") and pair it with a JSON output schema. Laminar reads each matching trace, runs your prompt, and emits a structured event that conforms to your schema. The event is not a single number. It is a payload with whatever fields you defined, linked back to the trace, and queryable as if it were an extra column on the traces table.
The flexibility falls out of the approach. A Signal is just a prompt plus an output schema, run by a model that reads the entire trace. There is no fixed taxonomy of "scores" or "metrics" you have to map your problem onto. If you can describe the thing to a teammate, you can write a Signal for it, and the categories you get back are the ones you defined, not the ones the tool ships with. The same primitive covers all of these without changing tools or learning a new feature:
- Business outcomes: "agent completed checkout," "agent answered the question correctly."
- Logical failures: "agent looped on the same tool without progress," "agent gave up after one retry."
- Behavioral categorization: user intent, topic, any categorical field you want on the trace.
- User friction: "user asked for the same information twice," "user rephrased the request three times."
- Cost and waste: "long context, short answer," "tool called with malformed arguments and retried."
A scorer answers "how good, on a scale." A Signal answers "what happened, in the shape I asked for." That difference is why one Signal can replace a category enum, a quality score, and a free-text reason at once.
Two more properties separate Signals from online scoring:
- Backfill. A Signal runs in two modes. Triggers fire on new traces as they arrive (Batch or Realtime). Jobs run the same Signal across a historical slice. The failure mode you care about today is not the one your scorer caught last month. With a Job you name the new failure now and have it tagged across last week's traffic, no re-instrumentation and no re-labeling.
- Structured payloads, not scores. A Signal can extract an enum, a reason string, a confidence, a list of offending tool names in a single event. Those become filterable, queryable columns. A scorer gives you a float.
Every new project also ships with a Failure Detector Signal already running, categorizing issues as tool_error, api_error, logic_error, looping, wrong_tool, timeout, or other on any trace over 1,000 tokens. You get outcome tracking on day one without writing anything.
So is Laminar missing Braintrust's unsupervised discovery? No, that is the part that gets misread. Laminar runs Clusters on top of signal events automatically: it summarizes each event, embeds it, and groups events hierarchically into broad clusters that break down into specific sub-clusters. A cluster that was empty yesterday and full today is a new failure mode, the same discovery use case Topics targets. The difference is the input. Topics clusters all of your raw traffic on a daily batch; Laminar clusters the events a directed Signal already extracted, in close to real time, and lets you alert on a brand-new cluster.
The honest framing: if your need is "tell me what categories exist in traffic I have never looked at," Braintrust Topics is purpose-built for that and worth a look (once it is out of beta for your deployment). If your need is "track these specific outcomes, get a structured record per run, backfill them across history, and still get emergent clustering on top," that is the shape Signals plus Clusters fit.
Evals: where Braintrust leads
Credit where it is due. Braintrust is an eval company first, and it shows. The autoevals library ships a deep catalog of scorers (Factuality, Closed QA, Faithfulness, Context recall, Answer relevancy, plus heuristics like Levenshtein and JSON diff), all open source under MIT. Experiments are immutable and comparable, which is exactly what you want for regression detection: re-run the suite on a PR, diff the scores, block the merge if something dropped. The playground supports diff mode across models and prompts, scorer annotation, and Loop-powered prompt optimization.
Laminar has an evaluations framework too: datasets, code evaluators, LLM-as-a-judge evaluators, and human evaluators, with runs you compare across groups. It is solid. But if your single most important workflow is scorer-driven CI regression against frozen datasets, Braintrust's harness is more mature and more specialized, and we are not going to pretend otherwise. The two products compose cleanly: Braintrust in CI, Laminar in production.
Re-running the agent
When a bug is "the agent chose the wrong tool at span 1,400," a scorer sweep over a static dataset will not tell you why. You need to rerun the agent from that point.
Braintrust's playground iterates a prompt against dataset rows: change the prompt, rerun, diff the outputs. Good for prompt tuning, not built to replay a captured multi-step agent run with its surrounding tool calls and state intact.
Laminar's rollout debugger does that. Mark an agent entrypoint, set a checkpoint on a span, override the system prompt in the UI, and rerun from there. Laminar reuses the cached earlier steps so you do not wait for the agent to grind back to the interesting point, and the new trace lands in the same page. For browser agents the session replay is synced to spans, so you can see what the agent actually saw.
SQL over everything, not just traces
This is the part that compounds. Laminar ships an in-product SQL editor over ClickHouse, and it does not stop at the spans table. Every primitive in the product is a queryable table: spans and traces for raw execution, signal_events for the structured outcomes your Signals extract, signal_runs for which Signals ran and whether they failed, clusters for the emergent groupings, evaluation_datapoints for scores and executor output, dataset_datapoints for your datasets, and logs for streaming output.
That means the interesting questions become one query instead of a pipeline. "How many runs called tool X more than five times and then errored" is a query on spans. "Which traces did my checkout_failed Signal fire on, joined against the cluster they landed in, last 7 days" is a query on signal_events. "Of the runs my agent_gave_up Signal flagged, what was the median token count and which model" mixes signal payloads with span metrics. Your hand-written outcome labels, the automatic clusters built on top of them, and the raw spans underneath all live in the same query surface. No dataset export, no notebook, no API round-trip to stitch them together.
You can hit it four ways: the in-product editor, a REST API (POST /v1/sql/query), the CLI (lmnr-cli sql query), and an MCP server. The MCP path matters for the agentic workflow: point Claude Code or Cursor at it and the coding agent fixing your bug can query your production outcomes directly, then verify its own fix against the same tables.
Braintrust has BTQL, a SQL-like query language, plus bt sql on the CLI and CSV/JSON export. It is capable and it does reach across logs, experiments, and datasets. Two differences. It is its own dialect, not SQL, so you learn BTQL rather than reusing what you know. And the catch documented on Braintrust's own pricing material is that BTQL is rate-limited on the Free and Pro plans; the high-throughput tier is Enterprise.
The cost difference is structural, not a discount
Both products bill on the volume of trace data you send. That makes the comparison sound like a per-GB price match. It is not, because of how agent traces are shaped.
A long-running agent re-sends its entire conversation history on every turn. Turn 50 carries the content of turns 1 through 49 again. Stored naively, a k-turn trace holds roughly k(k+1)/2 copies of message content: a 50-turn run stores 1,275 message-equivalents for 50 unique messages. Storage scales with the square of run length. Double the run, quadruple the bytes.
Laminar deduplicates this. It hashes each message in canonical form, stores each unique message once per trace, and keeps compact hash arrays on spans instead of repeated full text, reconstructing the full history at query time. The result, measured across Laminar's customer base, is roughly 20x average storage reduction, reaching 50x on 50-turn coding agents with stable system prompts and long tool outputs. The metric you are billed on is exactly the metric that shrinks.
Braintrust's docs do not describe message-level content deduplication, so a 50-turn agent stores the full quadratic payload. On the same workload, the GB you pay for is the uncompressed number.
Then there is the second axis. Braintrust bills processed data and scores: Pro includes 50,000 scores per month, then $1.50 per 1,000. Online scoring and per-span outcomes burn through that count, and there is no tier between Pro and Enterprise, so a team that emits many per-span scores can hit the score ceiling well before the data ceiling. Laminar bills data volume plus Signals steps; there is no separate per-score meter on your production outcomes.
| Plan | Laminar | Braintrust |
|---|---|---|
| Free | $0: 1 GB, 7-day retention, 1 seat | $0: 1 GB, 10k scores, 14-day retention |
| Entry paid | Hobby $30/mo: 3 GB, 30-day retention, unlimited seats | Pro $249/mo: 5 GB, 50k scores, 30-day retention |
| Mid paid | Pro $150/mo: 10 GB, 6-month retention, unlimited seats | (no tier between Pro and Enterprise) |
| Top | Enterprise: custom, on-prem | Enterprise: custom, hybrid self-host |
Net: Laminar's Pro is $150/month for 10 GB at 6-month retention against Braintrust's $249/month for 5 GB at 30-day retention, and the 10 GB you send Laminar represents far more agent activity because of compression. Model it on your own trace volume, retention needs, and how many per-span scores you emit. Braintrust does not precisely define what counts as a processed GB, so confirm that before you extrapolate.
Open source and self-hosting
Laminar is Apache 2.0, the whole platform. You can run it with docker compose up for local work, or the production Helm chart for a real deployment, and every feature ships on the open-source image: Signals, the SQL editor, the rollout debugger, all of it. There is no Enterprise-gated feature tier for self-hosting. See hosting options.
Braintrust's platform is closed source. The SDKs (Apache 2.0 and MIT) and the autoevals library (MIT) are open, but Brainstore, the backend that actually stores and serves your data, is proprietary with no public repo. Self-hosting exists only as an Enterprise "hybrid deployment": Braintrust runs the control plane as SaaS while the data plane runs in your cloud via their Terraform modules. If your legal review requires OSS, if you need air-gapped, or if you just want to run the thing on your own Helm chart without a sales cycle, that is a hard gate.
OpenTelemetry
Both ingest OpenTelemetry, but at different depths. Laminar is OTel-native: OpenInference and OpenLLMetry spans flow in without re-instrumenting, and if you already run OTel pipelines, Laminar is a backend you point them at.
Braintrust accepts OTLP at a dedicated endpoint and maps GenAI semantic conventions onto its own format. But its own SDKs emit a proprietary span format, not OTel spans; OTel is supported through a BraintrustSpanProcessor wrapper and a server-side translation layer. In practice that translation layer is where the friction shows up (their knowledge base documents quirks like null inputs on OTel-sourced spans). It works. It is just not the native path.
Which should you choose
Choose Braintrust if:
- Your primary workload is scorer-driven regression testing in CI against frozen datasets.
- You want a deep, mature autoeval catalog and immutable experiments for reproducible regression detection.
- You are fine with closed-source SaaS and have no near-term need to self-host outside an Enterprise contract.
- Unsupervised discovery over all raw traffic (Topics) is the specific thing you want, and beta status is acceptable for your deployment.
Choose Laminar if:
- Your bottleneck has moved from "did this change regress" to "why is this agent failing in production."
- You are debugging long-running, multi-step, or multi-agent runs and want a transcript instead of a span tree.
- You want directed outcome tracking with structured payloads, historical backfill, automatic clustering, and alerts (Signals plus Clusters).
- You want real SQL over everything (traces, signals, clusters, evals), an agent rollout debugger, or browser-agent session replay.
- You need OSS self-hosting with every feature included, no Enterprise gate.
- Your agents are long-running and trace compression materially changes your bill.
The two are not mutually exclusive. OpenTelemetry supports multiple exporters, so a common pattern is Braintrust wired into CI for regression and Laminar in production for debugging, instrumented once. Over time, the production-facing subset of Braintrust scorers often becomes Laminar Signals, because Signals backfill across history and fire on new traces automatically.
Start with the free tier, instrument one agent, and read a real trace in transcript view. If the time-to-answer on a failing run does not change, tell us why.
Try Laminar free · Read the docs · Braintrust alternatives, ranked
Last updated: June 2026. Verify features and pricing against each vendor's current documentation before committing.