Compare evaluation runs

An evaluation in isolation tells you your current score. What you actually want to know is whether the score moved. That’s what groups, the progression chart, and side-by-side comparison are for.

Group runs to compare them

Pass groupName / group_name to evaluate(). Every run with the same group name lands together on the evaluations page.

TypeScript
Python

evaluate({
  data,
  executor,
  evaluators,
  name: 'Capitals v2 (harder countries)',
  groupName: 'capitals',
});

evaluate(
    data=data,
    executor=executor,
    evaluators=evaluators,
    name="Capitals v2 (harder countries)",
    group_name="capitals",
)

Pick one name per thing you’re testing, not per version. The capitals eval stays in the capitals group whether you’re swapping models, prompts, or datasets. Changing the group name means Laminar can’t chart the runs together.

Read the progression chart

The evaluations page shows every run in a group, newest first, with the group’s average score for each dimension plotted across the top.

Evaluations list for the capitals group with a progression chart showing accuracy flat at 1.0 and length_ok falling to 0.0

One line per score dimension. Each point is the average of that dimension for one run. A sudden drop on one line means a regression on that dimension. In the screenshot above, length_ok fell from 1.0 to 0.0 on the most recent run because the prompt was changed to ask for a one-sentence fun fact instead of a one-word answer. Every output now exceeds the 50-character limit the evaluator checks for.

Side-by-side comparison

Click any run to open its detail page. Use the Select compared evaluation dropdown to pick a second run from the same group. Laminar renders both score distributions on top of each other and shows per-row deltas.

Comparison view with the length_ok distribution chart and a datapoints table showing per-row score deltas from v3 to v1

The big-number summary at the top (0.00 → 1.00) tells you the direction. The per-row deltas below tell you exactly which datapoints moved.

Filter by group in the list

The evaluations list at /evaluations groups runs by default. Click a group in the sidebar, or visit /evaluations?groupId=<group-name> directly. The progression chart and run list scope to the selected group.

Export comparisons

Hit Download on the evaluation detail page to export the datapoints, scores, and executor outputs as CSV. Useful for external analysis or for building regression test suites out of the rows that failed. For anything beyond CSV, query the underlying table with SQL:

SELECT
    e.name,
    e.created_at,
    AVG(ed.scores['accuracy']) AS avg_accuracy,
    AVG(ed.scores['length_ok']) AS avg_length_ok
FROM evaluation_datapoints ed
JOIN evaluations e ON ed.evaluation_id = e.id
WHERE e.group_name = 'capitals'
GROUP BY e.id, e.name, e.created_at
ORDER BY e.created_at DESC

See the SQL editor page for more.

Next steps

Datasets

Keep the dataset constant across runs so comparisons are apples-to-apples.

SQL editor

Query evaluation_datapoints for bespoke comparisons and dashboards.

Manual API

The lower-level API when evaluate() is too opinionated.

SDK reference

Full parameters for evaluate, LaminarDataset, and EvaluationDataset.

Overview

Tracing

Signals

Evaluations

Datasets

Platform

Group runs to compare them

Read the progression chart

Side-by-side comparison

Filter by group in the list

Export comparisons

Next steps

Datasets

SQL editor

Manual API

SDK reference

Overview

Tracing

Signals

Evaluations

Datasets

Platform

Documentation Index

​Group runs to compare them

​Read the progression chart

​Side-by-side comparison

​Filter by group in the list

​Export comparisons

​Next steps

Datasets

SQL editor

Manual API

SDK reference

Group runs to compare them

Read the progression chart

Side-by-side comparison

Filter by group in the list

Export comparisons

Next steps