Evaluations

Evaluations are how you answer one question before shipping a prompt, model, or agent change: is this version better than the one in production. You answer it by running the new version against a fixed set of inputs, scoring every output, and comparing the scores to the previous run. Laminar evaluations are the offline testing layer for your agents and LLM pipelines. You define a list of inputs, a function that produces an output, and one or more functions that score it. Laminar runs them in parallel, traces every call, stores the scores, and compares them across runs and groups.

Evaluation run page with a datapoint table on the left and the selected datapoint's transcript on the right — A single run: the datapoint table with a score per evaluator, and the selected datapoint's trace showing the executor, the model call, and both evaluator scores

What an evaluation can help you answer

Evaluations are for questions you can answer with a score. Some of them are obvious, some are not:

Did the new prompt break anything? Same dataset, new prompt, see if any scores regressed.
Is the cheaper model good enough for this task? Swap gpt-5 for gpt-5-mini, run the same inputs, compare.
Does this tool actually do what I think it does? Score the tool’s output against a target, not the agent’s final answer.
Did the last hundred production traces expose a case my evals don’t cover? Pull the failing traces into a dataset and rerun the evaluation against them.

If you’ve ever changed a prompt and hoped nothing broke, this is the answer to that hope.

Explore a run, then track progress across runs

The run view above answers why a datapoint scored the way it did: every datapoint and its scores on the left, the selected datapoint’s full trace on the right, so you go from a score to the exact model call behind it in one click. The group view answers did the score move: runs sharing a group name are charted together, one line per score dimension, so a regression on any dimension is visible the moment the run lands.

Evaluation list for the capitals group with a trend chart at the top showing accuracy flat at 1.0 and length_ok falling to 0.0 — Three runs of the same evaluation, grouped: the progression chart shows length_ok dropping from 1.0 to 0.0 on the third run

Render eval traces your way

Every datapoint in a run has the same trace shape, so reviewing a run means reading the same few fields over and over. Custom rendering lets you replace the default trace view with one built for your eval: a small JSX component that receives the executor and evaluator spans and renders exactly what you need to judge a row.

Evaluation trace rendered through a custom template showing a side-by-side SQL diff between model output and target — A text-to-SQL eval datapoint rendered through a custom template: the question, the scores, and a word diff between the model's SQL and the target

The view sticks as you click through datapoints, so reviewing a whole run takes seconds per row. See Custom rendering for the full walkthrough.

Anatomy of an evaluation

An evaluation has four parts:

Datapoints: a list of { data, target?, metadata? } objects. data is what the executor receives, target is what the evaluators compare against, metadata is anything extra you want to filter or query on.
Executor: a function that takes data and returns whatever you want to score: the agent’s output, a tool call, a parsed field.
Evaluators: one or more functions that take the executor’s output (and optionally target) and return a number or a map of numbers. Each number is a score dimension.
Group name: a string that ties related runs together so Laminar can chart them over time.

When you call evaluate(), Laminar runs the executor on every datapoint in parallel, runs each evaluator against the output, records every call as a trace, and stores the scores on the datapoint.

Every datapoint becomes one trace with an EVALUATION root span, an EXECUTOR child, and one EVALUATOR child per scoring function. You can query those spans in the SQL editor the same way you query any other trace.

Run an evaluation from code or from the CLI

Evaluations are SDK-first. You write the file once, then run it whichever way fits your workflow:

As a script: python my_eval.py or tsx my-eval.ts. The script runs the evaluation and exits.
Via the CLI: lmnr eval (Python) or npx lmnr eval (TypeScript) picks up every file matching the eval convention in your evals/ directory and runs them. Use this in CI or when you want to run many evals at once.

Both paths talk to the same evaluate() function. The CLI is just a runner.

Next steps

Quickstart

Write your first evaluation, run it, and read the results.

Concepts

Datapoints, executors, evaluators, groups, and how they map to traces.

Compare runs

Group runs, read the progression chart, and do side-by-side diffs.

Datasets

Back evaluations with a Laminar dataset instead of hardcoded lists.

Overview

Tracing

Signals

Debugger

Datasets

Platform

Evaluations

What an evaluation can help you answer

Explore a run, then track progress across runs

Render eval traces your way

Anatomy of an evaluation

Run an evaluation from code or from the CLI

Next steps

Quickstart

Concepts

Compare runs

Datasets

​What an evaluation can help you answer

​Explore a run, then track progress across runs

​Render eval traces your way

​Anatomy of an evaluation

​Run an evaluation from code or from the CLI

​Next steps

Quickstart

Concepts

Compare runs

Datasets

What an evaluation can help you answer

Explore a run, then track progress across runs

Render eval traces your way

Anatomy of an evaluation

Run an evaluation from code or from the CLI

Next steps