> ## Documentation Index
> Fetch the complete documentation index at: https://laminar.sh/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluations

Evaluations are how you answer one question before shipping a prompt, model, or agent change: *is this version better than the one in production*. You answer it by running the new version against a fixed set of inputs, scoring every output, and comparing the scores to the previous run.

**Laminar evaluations are the offline testing layer for your agents and LLM pipelines.** You define a list of inputs, a function that produces an output, and one or more functions that score it. Laminar runs them in parallel, traces every call, stores the scores, and **[compares them](/evaluations/comparing-runs) across [runs](/evaluations/concepts) and [groups](/evaluations/comparing-runs#group-runs-to-compare-them)**.

<Frame caption="Three runs of the same evaluation, grouped: the progression chart shows length_ok dropping from 1.0 to 0.0 on the third run">
  <img src="https://mintcdn.com/laminarai/-q9WJgn2x9iWK3Su/images/evaluations/eval-list.png?fit=max&auto=format&n=-q9WJgn2x9iWK3Su&q=85&s=1773b413046f97e2c2337d11115f1d43" alt="Evaluation list for the capitals group with a trend chart at the top showing accuracy flat at 1.0 and length_ok falling to 0.0" width="1512" height="982" data-path="images/evaluations/eval-list.png" />
</Frame>

## What an evaluation can help you answer

Evaluations are for questions you can answer with a score. Some of them are obvious, some are not:

* **Did the new prompt break anything?** Same dataset, new prompt, see if any scores regressed.
* **Is the cheaper model good enough for this task?** Swap `gpt-5` for `gpt-5-mini`, run the same inputs, compare.
* **Does this tool actually do what I think it does?** Score the tool's output against a target, not the agent's final answer.
* **Did the last hundred production traces expose a case my evals don't cover?** Pull the failing traces into a [dataset](/datasets/introduction) and rerun the evaluation against them.

If you've ever changed a prompt and hoped nothing broke, this is the answer to that hope.

## Anatomy of an evaluation

An evaluation has four parts:

* **Datapoints**: a list of `{ data, target?, metadata? }` objects. `data` is what the executor receives, `target` is what the evaluators compare against, `metadata` is anything extra you want to filter or query on.
* **Executor**: a function that takes `data` and returns whatever you want to score: the agent's output, a tool call, a parsed field.
* **Evaluators**: one or more functions that take the executor's output (and optionally `target`) and return a number or a map of numbers. Each number is a score dimension.
* **Group name**: a string that ties related runs together so Laminar can chart them over time.

When you call `evaluate()`, Laminar runs the executor on every datapoint in parallel, runs each evaluator against the output, records every call as a trace, and stores the scores on the datapoint.

<Note>
  Every datapoint becomes one trace with an `EVALUATION` root span, an `EXECUTOR` child, and one `EVALUATOR` child per scoring function. You can query those spans in the [SQL editor](/platform/sql-editor) the same way you query any other trace.
</Note>

## Run an evaluation from code or from the CLI

Evaluations are SDK-first. You write the file once, then run it whichever way fits your workflow:

* **As a script**: `python my_eval.py` or `tsx my-eval.ts`. The script runs the evaluation and exits.
* **Via the CLI**: `lmnr eval` (Python) or `npx lmnr eval` (TypeScript) picks up every file matching the eval convention in your `evals/` directory and runs them. Use this in CI or when you want to run many evals at once.

Both paths talk to the same `evaluate()` function. The CLI is just a runner.

## Next steps

<CardGroup cols={2}>
  <Card title="Quickstart" href="/evaluations/quickstart" icon="play">
    Write your first evaluation, run it, and read the results.
  </Card>

  <Card title="Concepts" href="/evaluations/concepts" icon="boxes">
    Datapoints, executors, evaluators, groups, and how they map to traces.
  </Card>

  <Card title="Compare runs" href="/evaluations/comparing-runs" icon="chart-line">
    Group runs, read the progression chart, and do side-by-side diffs.
  </Card>

  <Card title="Datasets" href="/evaluations/datasets" icon="database">
    Back evaluations with a Laminar dataset instead of hardcoded lists.
  </Card>
</CardGroup>
