> ## Documentation Index
> Fetch the complete documentation index at: https://laminar.sh/docs/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluations quickstart

In this quickstart you'll write a tiny evaluation, run it, and read the results in Laminar. The example asks an LLM for the capital of a country and scores each answer on correctness and brevity.

## Prerequisites

To get the project API key, go to the Laminar dashboard, click the project settings,
and generate a project API key. This is available both in the cloud and in the self-hosted version of Laminar.

Specify the key at `Laminar` initialization. If not specified,
Laminar will look for the key in the `LMNR_PROJECT_API_KEY` environment variable.

Install the SDK and set your API key:

<Tabs>
  <Tab title="TypeScript">
    ```bash theme={null}
    npm install @lmnr-ai/lmnr openai
    export LMNR_PROJECT_API_KEY=<YOUR_PROJECT_API_KEY>
    ```
  </Tab>

  <Tab title="Python">
    ```bash theme={null}
    pip install lmnr openai
    export LMNR_PROJECT_API_KEY=<YOUR_PROJECT_API_KEY>
    ```
  </Tab>
</Tabs>

## Write the evaluation

The file below has all four parts of an evaluation: a list of datapoints, an executor that calls the LLM, two evaluators (one for correctness, one for output length), and a name so you can find it later.

<Tabs>
  <Tab title="TypeScript">
    ```typescript capitals-eval.ts theme={null}
    import { evaluate } from '@lmnr-ai/lmnr';
    import { OpenAI } from 'openai';

    const openai = new OpenAI();

    const capitalOfCountry = async (data: { country: string }) => {
      const response = await openai.chat.completions.create({
        model: 'gpt-5-mini',
        messages: [
          {
            role: 'user',
            content:
              `What is the capital of ${data.country}? ` +
              'Answer only with the capital, no other text.',
          },
        ],
      });
      return response.choices[0].message.content ?? '';
    };

    const accuracy = (output: string, target: string) =>
      output.toLowerCase().includes(target.toLowerCase()) ? 1 : 0;

    const lengthOk = (output: string) =>
      output.length > 0 && output.length < 50 ? 1 : 0;

    evaluate({
      data: [
        { data: { country: 'France' }, target: 'Paris' },
        { data: { country: 'Germany' }, target: 'Berlin' },
        { data: { country: 'Japan' }, target: 'Tokyo' },
        { data: { country: 'Brazil' }, target: 'Brasília' },
        { data: { country: 'Australia' }, target: 'Canberra' },
        { data: { country: 'Canada' }, target: 'Ottawa' },
      ],
      executor: capitalOfCountry,
      evaluators: { accuracy, lengthOk },
      name: 'Capitals v1 (gpt-5-mini)',
      groupName: 'capitals',
      config: {
        instrumentModules: { OpenAI },
      },
    });
    ```

    <Note>
      Pass `instrumentModules` so Laminar instruments the OpenAI client. Without it, LLM calls still run but they won't appear as spans on the executor trace.
    </Note>
  </Tab>

  <Tab title="Python">
    ```python capitals_eval.py theme={null}
    from lmnr import evaluate
    from openai import OpenAI

    client = OpenAI()


    def capital_of_country(data: dict) -> str:
        response = client.chat.completions.create(
            model="gpt-5-mini",
            messages=[
                {
                    "role": "user",
                    "content": (
                        f"What is the capital of {data['country']}? "
                        "Answer only with the capital, no other text."
                    ),
                }
            ],
        )
        return response.choices[0].message.content or ""


    def accuracy(output: str, target: str) -> int:
        return 1 if target.lower() in output.lower() else 0


    def length_ok(output: str, *_args, **_kwargs) -> int:
        return 1 if 0 < len(output) < 50 else 0


    evaluate(
        data=[
            {"data": {"country": "France"}, "target": "Paris"},
            {"data": {"country": "Germany"}, "target": "Berlin"},
            {"data": {"country": "Japan"}, "target": "Tokyo"},
            {"data": {"country": "Brazil"}, "target": "Brasília"},
            {"data": {"country": "Australia"}, "target": "Canberra"},
            {"data": {"country": "Canada"}, "target": "Ottawa"},
        ],
        executor=capital_of_country,
        evaluators={
            "accuracy": accuracy,
            "length_ok": length_ok,
        },
        name="Capitals v1 (gpt-5-mini)",
        group_name="capitals",
    )
    ```
  </Tab>
</Tabs>

A few things worth pointing out:

* The executor's type on `data` matches the shape of each datapoint's `data` field. Evaluators take the executor's return as their first argument and `target` as their second.
* Evaluators return a number. You can also return a dict if one evaluator produces multiple score dimensions (for example `{ "precision": 0.9, "recall": 0.8 }`).
* `groupName` / `group_name` ties related runs together. Keep it stable across prompt and model changes so Laminar can [compare runs](/evaluations/comparing-runs) and draw a progression chart.
* No `Laminar.initialize()` needed. `evaluate()` initializes Laminar on first call.

## Run it

You have two options. Pick whichever fits your workflow.

### As a script

<Tabs>
  <Tab title="TypeScript">
    ```bash theme={null}
    npx tsx capitals-eval.ts
    ```
  </Tab>

  <Tab title="Python">
    ```bash theme={null}
    python capitals_eval.py
    ```
  </Tab>
</Tabs>

### Via the CLI

The CLI discovers evaluation files in an `evals/` directory and runs them all in one pass. Use this in CI or when you want to run a suite.

<Tabs>
  <Tab title="TypeScript">
    Files named `*.eval.ts` or `*.eval.js` under `evals/` are picked up automatically:

    ```
    project/
    ├─ src/
    └─ evals/
       ├─ capitals.eval.ts
       └─ tone.eval.ts
    ```

    ```bash theme={null}
    npx lmnr eval              # runs every *.eval.{ts,js} under evals/
    npx lmnr eval capitals-eval.ts   # runs a single file
    ```
  </Tab>

  <Tab title="Python">
    Files named `eval_*.py` or `*_eval.py` under `evals/` are picked up automatically:

    ```
    project/
    ├─ src/
    └─ evals/
       ├─ capitals_eval.py
       └─ tone_eval.py
    ```

    ```bash theme={null}
    lmnr eval                  # runs every eval_*.py or *_eval.py under evals/
    lmnr eval capitals_eval.py # runs a single file
    ```
  </Tab>
</Tabs>

When the run finishes, the SDK prints a link to the evaluation page in Laminar.

## Read the results

Every datapoint becomes one trace. Every trace gets one score per evaluator.

<Frame caption="Single evaluation run: average score at the top, per-datapoint table below">
  <img src="https://mintcdn.com/laminarai/-q9WJgn2x9iWK3Su/images/evaluations/eval-detail.png?fit=max&auto=format&n=-q9WJgn2x9iWK3Su&q=85&s=a8efd66f78d59724d09e3f2482cb49a3" alt="Evaluation detail page showing 1.00 average on length_ok and six datapoints, each with accuracy and length_ok columns set to 1" width="1512" height="982" data-path="images/evaluations/eval-detail.png" />
</Frame>

Click any datapoint row to open the trace in a side panel. The trace shows the extracted input, the executor span, the LLM call nested under it, and the evaluator spans with their scores.

<Frame caption="Drill-down into a single datapoint: the executor span, nested model span, and evaluator scores">
  <img src="https://mintcdn.com/laminarai/-q9WJgn2x9iWK3Su/images/evaluations/eval-transcript.png?fit=max&auto=format&n=-q9WJgn2x9iWK3Su&q=85&s=81c678da7f0a2079aa68e11e8a5b2385" alt="Transcript view of a single evaluation datapoint with the Input block, executor span, model span, and accuracy and length_ok evaluator scores" width="1512" height="982" data-path="images/evaluations/eval-transcript.png" />
</Frame>

## Change something and run it again

Keep `groupName` the same and change the thing you want to test. The second run lands in the same group, and Laminar charts the average of every score dimension over time.

For example, change the prompt so the model returns a full explanatory sentence:

```python theme={null}
"content": (
    f"What is the capital of {data['country']}? "
    "Answer in ONE sentence explaining a fun fact."
),
```

Rerun. Because the output is now longer than 50 characters, `length_ok` drops from 1.0 to 0.0 for every datapoint while `accuracy` stays at 1.0. That's a regression you caught before production.

<Frame caption="The capitals group after three runs. length_ok goes from 1.0 on v1 and v2 to 0.0 on v3 (gpt-3.5 with a chatty prompt); accuracy stays flat">
  <img src="https://mintcdn.com/laminarai/-q9WJgn2x9iWK3Su/images/evaluations/eval-list.png?fit=max&auto=format&n=-q9WJgn2x9iWK3Su&q=85&s=1773b413046f97e2c2337d11115f1d43" alt="Evaluations list filtered by capitals group with a progression chart at the top showing length_ok dropping to zero on the most recent run" width="1512" height="982" data-path="images/evaluations/eval-list.png" />
</Frame>

See [Compare runs](/evaluations/comparing-runs) for side-by-side diffs and per-datapoint deltas.

## Add custom columns

In the evaluation results table, click **Columns → Add column** to create a computed column from SQL. This is useful for pulling fields out of `metadata`, `data`, `target`, or `scores` without changing your evaluation code.

Example: extract a model name from JSON metadata:

```sql theme={null}
simpleJSONExtractString(simpleJSONExtractRaw(metadata, 'llm'), 'model')
```

Pick a data type (usually `String` or `Float64`) and save. The expression runs per row on `evaluation_datapoints`. For more JSON parsing functions, see the [SQL editor](/platform/sql-editor).

## Next steps

<CardGroup cols={2}>
  <Card title="Concepts" href="/evaluations/concepts" icon="boxes">
    The executor, evaluator, datapoint, and group model in detail.
  </Card>

  <Card title="Compare runs" href="/evaluations/comparing-runs" icon="chart-line">
    Group runs, read the progression chart, diff side-by-side.
  </Card>

  <Card title="Datasets" href="/evaluations/datasets" icon="database">
    Point `evaluate()` at a Laminar dataset instead of a hardcoded list.
  </Card>

  <Card title="Manual API" href="/evaluations/manual-evaluation" icon="wrench">
    Lower-level control for pipelines where `evaluate()` is too opinionated.
  </Card>
</CardGroup>
