Skip to main content

Documentation Index

Fetch the complete documentation index at: https://laminar.sh/docs/llms.txt

Use this file to discover all available pages before exploring further.

In this quickstart you’ll write a tiny evaluation, run it, and read the results in Laminar. The example asks an LLM for the capital of a country and scores each answer on correctness and brevity.

Prerequisites

To get the project API key, go to the Laminar dashboard, click the project settings, and generate a project API key. This is available both in the cloud and in the self-hosted version of Laminar. Specify the key at Laminar initialization. If not specified, Laminar will look for the key in the LMNR_PROJECT_API_KEY environment variable. Install the SDK and set your API key:
npm install @lmnr-ai/lmnr openai
export LMNR_PROJECT_API_KEY=<YOUR_PROJECT_API_KEY>

Write the evaluation

The file below has all four parts of an evaluation: a list of datapoints, an executor that calls the LLM, two evaluators (one for correctness, one for output length), and a name so you can find it later.
capitals-eval.ts
import { evaluate } from '@lmnr-ai/lmnr';
import { OpenAI } from 'openai';

const openai = new OpenAI();

const capitalOfCountry = async (data: { country: string }) => {
  const response = await openai.chat.completions.create({
    model: 'gpt-5-mini',
    messages: [
      {
        role: 'user',
        content:
          `What is the capital of ${data.country}? ` +
          'Answer only with the capital, no other text.',
      },
    ],
  });
  return response.choices[0].message.content ?? '';
};

const accuracy = (output: string, target: string) =>
  output.toLowerCase().includes(target.toLowerCase()) ? 1 : 0;

const lengthOk = (output: string) =>
  output.length > 0 && output.length < 50 ? 1 : 0;

evaluate({
  data: [
    { data: { country: 'France' }, target: 'Paris' },
    { data: { country: 'Germany' }, target: 'Berlin' },
    { data: { country: 'Japan' }, target: 'Tokyo' },
    { data: { country: 'Brazil' }, target: 'Brasília' },
    { data: { country: 'Australia' }, target: 'Canberra' },
    { data: { country: 'Canada' }, target: 'Ottawa' },
  ],
  executor: capitalOfCountry,
  evaluators: { accuracy, lengthOk },
  name: 'Capitals v1 (gpt-5-mini)',
  groupName: 'capitals',
  config: {
    instrumentModules: { OpenAI },
  },
});
Pass instrumentModules so Laminar instruments the OpenAI client. Without it, LLM calls still run but they won’t appear as spans on the executor trace.
A few things worth pointing out:
  • The executor’s type on data matches the shape of each datapoint’s data field. Evaluators take the executor’s return as their first argument and target as their second.
  • Evaluators return a number. You can also return a dict if one evaluator produces multiple score dimensions (for example { "precision": 0.9, "recall": 0.8 }).
  • groupName / group_name ties related runs together. Keep it stable across prompt and model changes so Laminar can compare runs and draw a progression chart.
  • No Laminar.initialize() needed. evaluate() initializes Laminar on first call.

Run it

You have two options. Pick whichever fits your workflow.

As a script

npx tsx capitals-eval.ts

Via the CLI

The CLI discovers evaluation files in an evals/ directory and runs them all in one pass. Use this in CI or when you want to run a suite.
Files named *.eval.ts or *.eval.js under evals/ are picked up automatically:
project/
├─ src/
└─ evals/
   ├─ capitals.eval.ts
   └─ tone.eval.ts
npx lmnr eval              # runs every *.eval.{ts,js} under evals/
npx lmnr eval capitals-eval.ts   # runs a single file
When the run finishes, the SDK prints a link to the evaluation page in Laminar.

Read the results

Every datapoint becomes one trace. Every trace gets one score per evaluator.
Evaluation detail page showing 1.00 average on length_ok and six datapoints, each with accuracy and length_ok columns set to 1
Click any datapoint row to open the trace in a side panel. The trace shows the extracted input, the executor span, the LLM call nested under it, and the evaluator spans with their scores.
Transcript view of a single evaluation datapoint with the Input block, executor span, model span, and accuracy and length_ok evaluator scores

Change something and run it again

Keep groupName the same and change the thing you want to test. The second run lands in the same group, and Laminar charts the average of every score dimension over time. For example, change the prompt so the model returns a full explanatory sentence:
"content": (
    f"What is the capital of {data['country']}? "
    "Answer in ONE sentence explaining a fun fact."
),
Rerun. Because the output is now longer than 50 characters, length_ok drops from 1.0 to 0.0 for every datapoint while accuracy stays at 1.0. That’s a regression you caught before production.
Evaluations list filtered by capitals group with a progression chart at the top showing length_ok dropping to zero on the most recent run
See Compare runs for side-by-side diffs and per-datapoint deltas.

Add custom columns

In the evaluation results table, click Columns → Add column to create a computed column from SQL. This is useful for pulling fields out of metadata, data, target, or scores without changing your evaluation code. Example: extract a model name from JSON metadata:
simpleJSONExtractString(simpleJSONExtractRaw(metadata, 'llm'), 'model')
Pick a data type (usually String or Float64) and save. The expression runs per row on evaluation_datapoints. For more JSON parsing functions, see the SQL editor.

Next steps

Concepts

The executor, evaluator, datapoint, and group model in detail.

Compare runs

Group runs, read the progression chart, diff side-by-side.

Datasets

Point evaluate() at a Laminar dataset instead of a hardcoded list.

Manual API

Lower-level control for pipelines where evaluate() is too opinionated.