Evaluations are how you answer one question before shipping a prompt, model, or agent change: is this version better than the one in production. You answer it by running the new version against a fixed set of inputs, scoring every output, and comparing the scores to the previous run. Laminar evaluations are the offline testing layer for your agents and LLM pipelines. You define a list of inputs, a function that produces an output, and one or more functions that score it. Laminar runs them in parallel, traces every call, stores the scores, and compares them across runs and groups.Documentation Index
Fetch the complete documentation index at: https://laminar.sh/docs/llms.txt
Use this file to discover all available pages before exploring further.

What an evaluation can help you answer
Evaluations are for questions you can answer with a score. Some of them are obvious, some are not:- Did the new prompt break anything? Same dataset, new prompt, see if any scores regressed.
- Is the cheaper model good enough for this task? Swap
gpt-5forgpt-5-mini, run the same inputs, compare. - Does this tool actually do what I think it does? Score the tool’s output against a target, not the agent’s final answer.
- Did the last hundred production traces expose a case my evals don’t cover? Pull the failing traces into a dataset and rerun the evaluation against them.
Anatomy of an evaluation
An evaluation has four parts:- Datapoints: a list of
{ data, target?, metadata? }objects.datais what the executor receives,targetis what the evaluators compare against,metadatais anything extra you want to filter or query on. - Executor: a function that takes
dataand returns whatever you want to score: the agent’s output, a tool call, a parsed field. - Evaluators: one or more functions that take the executor’s output (and optionally
target) and return a number or a map of numbers. Each number is a score dimension. - Group name: a string that ties related runs together so Laminar can chart them over time.
evaluate(), Laminar runs the executor on every datapoint in parallel, runs each evaluator against the output, records every call as a trace, and stores the scores on the datapoint.
Every datapoint becomes one trace with an
EVALUATION root span, an EXECUTOR child, and one EVALUATOR child per scoring function. You can query those spans in the SQL editor the same way you query any other trace.Run an evaluation from code or from the CLI
Evaluations are SDK-first. You write the file once, then run it whichever way fits your workflow:- As a script:
python my_eval.pyortsx my-eval.ts. The script runs the evaluation and exits. - Via the CLI:
lmnr eval(Python) ornpx lmnr eval(TypeScript) picks up every file matching the eval convention in yourevals/directory and runs them. Use this in CI or when you want to run many evals at once.
evaluate() function. The CLI is just a runner.
Next steps
Quickstart
Write your first evaluation, run it, and read the results.
Concepts
Datapoints, executors, evaluators, groups, and how they map to traces.
Compare runs
Group runs, read the progression chart, and do side-by-side diffs.
Datasets
Back evaluations with a Laminar dataset instead of hardcoded lists.
