Evaluations quickstart

In this quickstart you’ll write a tiny evaluation, run it, and read the results in Laminar. The example asks an LLM for the capital of a country and scores each answer on correctness and brevity.

Prerequisites

To get the project API key, go to the Laminar dashboard, click the project settings, and generate a project API key. This is available both in the cloud and in the self-hosted version of Laminar. Specify the key at Laminar initialization. If not specified, Laminar will look for the key in the LMNR_PROJECT_API_KEY environment variable. Install the SDK and set your API key:

TypeScript
Python

npm install @lmnr-ai/lmnr openai
export LMNR_PROJECT_API_KEY=<YOUR_PROJECT_API_KEY>

pip install lmnr openai
export LMNR_PROJECT_API_KEY=<YOUR_PROJECT_API_KEY>

Write the evaluation

The file below has all four parts of an evaluation: a list of datapoints, an executor that calls the LLM, two evaluators (one for correctness, one for output length), and a name so you can find it later.

TypeScript
Python

capitals-eval.ts

import { evaluate } from '@lmnr-ai/lmnr';
import { OpenAI } from 'openai';

const openai = new OpenAI();

const capitalOfCountry = async (data: { country: string }) => {
  const response = await openai.chat.completions.create({
    model: 'gpt-5-mini',
    messages: [
      {
        role: 'user',
        content:
          `What is the capital of ${data.country}? ` +
          'Answer only with the capital, no other text.',
      },
    ],
  });
  return response.choices[0].message.content ?? '';
};

const accuracy = (output: string, target: string) =>
  output.toLowerCase().includes(target.toLowerCase()) ? 1 : 0;

const lengthOk = (output: string) =>
  output.length > 0 && output.length < 50 ? 1 : 0;

evaluate({
  data: [
    { data: { country: 'France' }, target: 'Paris' },
    { data: { country: 'Germany' }, target: 'Berlin' },
    { data: { country: 'Japan' }, target: 'Tokyo' },
    { data: { country: 'Brazil' }, target: 'Brasília' },
    { data: { country: 'Australia' }, target: 'Canberra' },
    { data: { country: 'Canada' }, target: 'Ottawa' },
  ],
  executor: capitalOfCountry,
  evaluators: { accuracy, lengthOk },
  name: 'Capitals v1 (gpt-5-mini)',
  groupName: 'capitals',
  config: {
    instrumentModules: { OpenAI },
  },
});

Pass instrumentModules so Laminar instruments the OpenAI client. Without it, LLM calls still run but they won’t appear as spans on the executor trace.

capitals_eval.py

from lmnr import evaluate
from openai import OpenAI

client = OpenAI()


def capital_of_country(data: dict) -> str:
    response = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[
            {
                "role": "user",
                "content": (
                    f"What is the capital of {data['country']}? "
                    "Answer only with the capital, no other text."
                ),
            }
        ],
    )
    return response.choices[0].message.content or ""


def accuracy(output: str, target: str) -> int:
    return 1 if target.lower() in output.lower() else 0


def length_ok(output: str, *_args, **_kwargs) -> int:
    return 1 if 0 < len(output) < 50 else 0


evaluate(
    data=[
        {"data": {"country": "France"}, "target": "Paris"},
        {"data": {"country": "Germany"}, "target": "Berlin"},
        {"data": {"country": "Japan"}, "target": "Tokyo"},
        {"data": {"country": "Brazil"}, "target": "Brasília"},
        {"data": {"country": "Australia"}, "target": "Canberra"},
        {"data": {"country": "Canada"}, "target": "Ottawa"},
    ],
    executor=capital_of_country,
    evaluators={
        "accuracy": accuracy,
        "length_ok": length_ok,
    },
    name="Capitals v1 (gpt-5-mini)",
    group_name="capitals",
)

A few things worth pointing out:

The executor’s type on data matches the shape of each datapoint’s data field. Evaluators take the executor’s return as their first argument and target as their second.
Evaluators return a number. You can also return a dict if one evaluator produces multiple score dimensions (for example { "precision": 0.9, "recall": 0.8 }).
groupName / group_name ties related runs together. Keep it stable across prompt and model changes so Laminar can compare runs and draw a progression chart.
No Laminar.initialize() needed. evaluate() initializes Laminar on first call.

Run it

You have two options. Pick whichever fits your workflow.

As a script

TypeScript
Python

npx tsx capitals-eval.ts

python capitals_eval.py

Via the CLI

The CLI discovers evaluation files in an evals/ directory and runs them all in one pass. Use this in CI or when you want to run a suite.

TypeScript
Python

Files named *.eval.ts or *.eval.js under evals/ are picked up automatically:

project/
├─ src/
└─ evals/
   ├─ capitals.eval.ts
   └─ tone.eval.ts

npx lmnr eval              # runs every *.eval.{ts,js} under evals/
npx lmnr eval capitals-eval.ts   # runs a single file

Files named eval_*.py or *_eval.py under evals/ are picked up automatically:

project/
├─ src/
└─ evals/
   ├─ capitals_eval.py
   └─ tone_eval.py

lmnr eval                  # runs every eval_*.py or *_eval.py under evals/
lmnr eval capitals_eval.py # runs a single file

When the run finishes, the SDK prints a link to the evaluation page in Laminar.

Read the results

Every datapoint becomes one trace. Every trace gets one score per evaluator. The evaluation page is a split view: the datapoint table on the left, the trace for the selected row on the right. Pick a score dimension in the top-left dropdown to see its aggregate for the run; for non-binary scores an aggregation picker appears next to it (Average by default; Sum, Min, Max, Median, and percentiles). Score chips above the trace show every score for the selected datapoint.

Evaluation detail page with six datapoints scoring 1 on accuracy and length_ok, and a transcript view of the selected datapoint's trace showing the executor, model span, and evaluator scores — One evaluation run: datapoint table on the left with a score per evaluator, the selected datapoint's trace on the right. The transcript shows the input, the executor span, the model call, and both evaluator scores

Change something and run it again

Keep groupName the same and change the thing you want to test. The second run lands in the same group, and Laminar charts the average of every score dimension over time. For example, change the prompt so the model returns a full explanatory sentence:

"content": (
    f"What is the capital of {data['country']}? "
    "Answer in ONE sentence explaining a fun fact."
),

Rerun. Because the output is now longer than 50 characters, length_ok drops from 1.0 to 0.0 for every datapoint while accuracy stays at 1.0. That’s a regression you caught before production.

Evaluations page with the capitals group selected in the sidebar and a progression chart showing length_ok dropping to zero on the most recent run — The capitals group after three runs. length_ok goes from 1.0 on v1 and v2 to 0.0 on v3 (the chatty prompt); accuracy stays flat

See Compare runs for side-by-side diffs and per-datapoint deltas.

Add custom columns

In the evaluation results table, click Columns → Add column to create a computed column from SQL. This is useful for pulling fields out of metadata, data, target, or scores without changing your evaluation code. Example: extract a model name from JSON metadata:

simpleJSONExtractString(simpleJSONExtractRaw(metadata, 'llm'), 'model')

Pick a data type (usually String or Float64) and save. The expression runs per row on evaluation_datapoints. For more JSON parsing functions, see the SQL editor.

Next steps

Concepts

The executor, evaluator, datapoint, and group model in detail.

Compare runs

Group runs, read the progression chart, diff side-by-side.

Datasets

Point evaluate() at a Laminar dataset instead of a hardcoded list.

Manual API

Lower-level control for pipelines where evaluate() is too opinionated.

Overview

Tracing

Signals

Debugger

Evaluations

Datasets

Platform

Prerequisites

Write the evaluation

Run it

As a script

Via the CLI

Read the results

Change something and run it again

Add custom columns

Next steps

Concepts

Compare runs

Datasets

Manual API

​Prerequisites

​Write the evaluation

​Run it

​As a script

​Via the CLI

​Read the results

​Change something and run it again

​Add custom columns

​Next steps

Concepts

Compare runs

Datasets

Manual API

Prerequisites

Write the evaluation

Run it

As a script

Via the CLI

Read the results

Change something and run it again

Add custom columns

Next steps