Eval function
The primary tool for the Measure stage is the Eval function, available in the axiom/ai/evals package. It provides a simple, declarative way to define a test suite for your capability directly in your codebase.
An Eval is structured around a few key parameters:
data: An async function that returns your collection of{ input, expected }pairs, which serve as your ground truth.task: The function that executes your AI capability, taking aninputand producing anoutput.scorers: An array of scorer functions that score theoutputagainst theexpectedvalue.metadata: Optional metadata for the evaluation, such as a description.
/evals/text-match.eval.ts
Get started
Prerequisites
- Node.js 22.20 or higher
- Existing AI SDK setup (e.g.,
@ai-sdk/openai,ai) - Axiom account with API token and dataset
Install dependencies
Configure
Set up environment variables
Create a.env file:
Create instrumentation setup (optional)
If you are evaluating components of a production application that is instrumented with OpenTelemetry, you can see your application spans in Axiom. In order to enable this, your instrumentation setup must be a function that can be passed inaxiom.config.ts. An example is shown below.
Create src/instrumentation.node.ts:
/src/instrumentation.node.ts
src/tracer.ts:
Create axiom.config.ts
Create a configuration file at the root of your project:
/axiom.config.ts
Set up flags
Flags let you parameterize your AI behavior (like model choice or prompting strategies) and run experiments with different configurations. They’re type-safe via Zod schemas and can be overridden at runtime. Createsrc/lib/app-scope.ts:
/src/lib/app-scope.ts
Write real-world evals
Let’s build a practical evaluation for a support ticket classification system. Create an eval filesrc/evals/ticket-classification.eval.ts:
Score with scorers
A scorer is a function that scores a capability’s output. Scorers receive theinput, the generated output, and the expected value, and return a score (typically 0-1).
Simple custom scorer
Use AutoEvals library
Theautoevals library provides pre-built scorers for common tasks like semantic similarity, factual correctness, and text matching:
Scorer with metadata
Scorers can return additional metadata alongside the score:Run evaluations
To run your evaluation suites from your terminal, install the Axiom CLI and use the following commands.Run all evals
**/*.eval.{ts,js}.
Run specific eval file
Run evals matching a glob pattern
Run eval by name
List available evals without running
Override flags
Flags allow you to run experiments by testing different configurations without changing code.From CLI (dot notation)
Override individual flags:From JSON file
Createexperiment.json:
Analyze results in Console
Coming soon
When you run an , the Axiom SDK captures a detailed OpenTelemetry trace for the entire run. This includes parent spans for the evaluation suite and child spans for each individual test case, task execution, and scorer result. These traces are enriched with eval.* attributes, allowing you to deeply analyze results in the Axiom Console.
After running evals, you’ll see:
- Pass/fail status for each test case
- Scores from each scorer
- Comparison to baseline (if available)
- Links to view detailed traces in Axiom
What’s next?
Once your capability meets your quality benchmarks in the Measure stage, it’s ready to be deployed. Additional next steps include:- Baseline comparisons: Run evals multiple times to track regression over time
- Experiment with flags: Test different models or strategies using flag overrides
- Advanced scorers: Build custom scorers for domain-specific metrics
- CI/CD integration: Add
axiom evalto your CI pipeline to catch regressions