> ## Documentation Index > Fetch the complete documentation index at: https://laminar.sh/docs/llms.txt > Use this file to discover all available pages before exploring further. # Compare evaluation runs An evaluation in isolation tells you your current score. What you actually want to know is whether the score moved. That's what groups, the progression chart, and side-by-side comparison are for. ## Group runs to compare them Pass `groupName` / `group_name` to `evaluate()`. Every run with the same group name lands together on the evaluations page. ```typescript theme={null} evaluate({ data, executor, evaluators, name: 'Capitals v2 (harder countries)', groupName: 'capitals', }); ``` ```python theme={null} evaluate( data=data, executor=executor, evaluators=evaluators, name="Capitals v2 (harder countries)", group_name="capitals", ) ``` Pick one name per *thing you're testing*, not per *version*. The capitals eval stays in the `capitals` group whether you're swapping models, prompts, or datasets. Changing the group name means Laminar can't chart the runs together. ## Read the progression chart The evaluations page lists every group in a sidebar, with a run count and the time of the latest run. Select a group and the right side shows its progression chart and run list, newest first. Evaluations page with the capitals group selected in the sidebar and a progression chart showing accuracy flat at 1.0 and length_ok falling to 0.0

Evaluations page with the capitals group selected in the sidebar and a progression chart showing accuracy flat at 1.0 and length_ok falling to 0.0

One line per score dimension. Each point is one run's aggregate for that dimension. A sudden drop on one line means a regression on that dimension. A few controls worth knowing: * **Aggregation dropdown** next to the group name. Average by default; switch to Sum, Min, Max, Median, p90, p95, or p99. Percentiles are the right choice when a few catastrophic rows hide behind a healthy average. * **Legend toggles**. Click a score name in the chart legend to hide or show that line. Useful when one dimension's scale dwarfs another's. * **Eye icons** on run rows. Click the eye to exclude a run from the chart, for example a half-finished run or a known-bad experiment that distorts the y-axis. In the screenshot above, `length_ok` fell from 1.0 to 0.0 on the most recent run because the prompt was changed to ask for a one-sentence fun fact instead of a one-word answer. Every output now exceeds the 50-character limit the evaluator checks for. ## Side-by-side comparison Click any run to open its detail page, then use the **Select compared evaluation** dropdown in the header to pick a baseline run from the same group. The comparison is in the URL (`?targetId=`), so you can share the link. Hit **Reset** to drop the baseline. With a baseline selected: * The score card shows `old → new` with a colored delta badge, for example `0.83 → 0.74` with `▼ 11.5%`. Switch the score dropdown to see the delta for each dimension. * Every row in the datapoint table shows its own `old → new` delta per score column, so you can see exactly which datapoints moved. Comparison view with the score card showing 0.83 to 0.74 and a datapoints table with per-row score deltas

Comparison view with the score card showing 0.83 to 0.74 and a datapoints table with per-row score deltas

The score card tells you the direction. The per-row deltas tell you exactly which datapoints moved. The trace panel on the right shows you *why*: click a regressed row and read what the model actually produced. ## See why a score moved with a render template Per-row deltas point at the regressed datapoints, but the answer to "what actually changed" is buried in the executor's output. For evals where the output is comparable to the target (SQL generation, structured extraction, translation), a [trace render template](/docs/platform/render-templates#trace-templates) turns the trace panel into a purpose-built diff view. The example below is a text-to-SQL eval. The template pulls the question from the executor's input, the generated SQL from its output, and the target SQL from the evaluator's input, then renders a word-level diff: Evaluation trace rendered through a custom SQL diff template showing model output and target side by side with highlighted differing tokens

Evaluation trace rendered through a custom SQL diff template showing model output and target side by side with highlighted differing tokens

Reading this view across a few rows answers the regression in seconds: without the style guide, the model stopped aliasing tables and switched join strategies, so `token_f1` dropped even where the SQL was semantically fine. To build one, open any datapoint's trace, click the view dropdown (where **Transcript** and **Tree** live), pick **+ New template**, and describe the view: Laminar generates both the span filter (here `span_type IN ('EXECUTOR', 'EVALUATOR')`) and the JSX from the trace's real spans. Once created, the template appears in the same dropdown for every trace in the project, and the pane remembers your pick, so clicking through datapoints keeps the diff view open. The full walkthrough, template source, and span payload shape are on the [Custom rendering](/docs/evaluations/custom-rendering) page. ## Filter by group in the list The evaluations page groups runs by default. Click a group in the sidebar, or visit `/evaluations?groupId=` directly. The progression chart and run list scope to the selected group. ## Export comparisons Hit **Download** on the evaluation detail page to export the datapoints, scores, and executor outputs as CSV. Useful for external analysis or for building regression test suites out of the rows that failed. For anything beyond CSV, query the underlying table with SQL: ```sql theme={null} SELECT evaluation_id, AVG(scores['accuracy']) AS avg_accuracy, AVG(scores['length_ok']) AS avg_length_ok FROM evaluation_datapoints GROUP BY evaluation_id ORDER BY avg_accuracy DESC ``` The SQL editor exposes `evaluation_datapoints` (one row per datapoint, with `scores`, `target`, `executor_output`, and the joined trace). Each run's name and group live in a separate Postgres table that the SQL editor does not expose, so filter and group by `evaluation_id` here and map ids back to runs in the UI. See the [SQL editor](/docs/platform/sql-editor) page for more. ## Next steps Build custom trace views like the SQL diff above. Keep the dataset constant across runs so comparisons are apples-to-apples. Query `evaluation_datapoints` for bespoke comparisons and dashboards. Full parameters for `evaluate`, `LaminarDataset`, and `EvaluationDataset`.