pydantic-evals 1.77.0


pip install pydantic-evals

  Latest version

Released: Apr 03, 2026


Meta
Author: Samuel Colvin, Marcelo Trylesinski, David Montague, Alex Hall, Douwe Maan
Requires Python: >=3.10

Classifiers

Development Status
  • 5 - Production/Stable

Environment
  • Console
  • MacOS X

Intended Audience
  • Developers
  • Information Technology
  • System Administrators

License
  • OSI Approved :: MIT License

Operating System
  • POSIX :: Linux
  • Unix

Programming Language
  • Python
  • Python :: 3
  • Python :: 3 :: Only
  • Python :: 3.10
  • Python :: 3.11
  • Python :: 3.12
  • Python :: 3.13
  • Python :: 3.14

Topic
  • Internet
  • Software Development :: Libraries :: Python Modules

Pydantic Evals

CI Coverage PyPI python versions license

This is a library for evaluating non-deterministic (or "stochastic") functions in Python. It provides a simple, Pythonic interface for defining and running stochastic functions, and analyzing the results of running those functions.

While this library is developed as part of Pydantic AI, it only uses Pydantic AI for a small subset of generative functionality internally, and it is designed to be used with arbitrary "stochastic function" implementations. In particular, it can be used with other (non-Pydantic AI) AI libraries, agent frameworks, etc.

As with Pydantic AI, this library prioritizes type safety and use of common Python syntax over esoteric, domain-specific use of Python syntax.

Full documentation is available at ai.pydantic.dev/evals.

Example

While you'd typically use Pydantic Evals with more complex functions (such as Pydantic AI agents or graphs), here's a quick example that evaluates a simple function against a test case using both custom and built-in evaluators:

from pydantic_evals import Case, Dataset
from pydantic_evals.evaluators import Evaluator, EvaluatorContext, IsInstance

# Define a test case with inputs and expected output
case = Case(
    name='capital_question',
    inputs='What is the capital of France?',
    expected_output='Paris',
)

# Define a custom evaluator
class MatchAnswer(Evaluator[str, str]):
    def evaluate(self, ctx: EvaluatorContext[str, str]) -> float:
        if ctx.output == ctx.expected_output:
            return 1.0
        elif isinstance(ctx.output, str) and ctx.expected_output.lower() in ctx.output.lower():
            return 0.8
        return 0.0

# Create a dataset with the test case and evaluators
dataset = Dataset(
    name='capital_eval',
    cases=[case],
    evaluators=[IsInstance(type_name='str'), MatchAnswer()],
)

# Define the function to evaluate
async def answer_question(question: str) -> str:
    return 'Paris'

# Run the evaluation
report = dataset.evaluate_sync(answer_question)
report.print(include_input=True, include_output=True)
"""
                                    Evaluation Summary: answer_question
┏━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓
┃ Case ID          ┃ Inputs                         ┃ Outputs ┃ Scores            ┃ Assertions ┃ Duration ┃
┡━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩
│ capital_question │ What is the capital of France? │ Paris   │ MatchAnswer: 1.00 │ ✔          │     10ms │
├──────────────────┼────────────────────────────────┼─────────┼───────────────────┼────────────┼──────────┤
│ Averages         │                                │         │ MatchAnswer: 1.00 │ 100.0% ✔   │     10ms │
└──────────────────┴────────────────────────────────┴─────────┴───────────────────┴────────────┴──────────┘
"""

Using the library with more complex functions, such as Pydantic AI agents, is similar — all you need to do is define a task function wrapping the function you want to evaluate, with a signature that matches the inputs and outputs of your test cases.

Logfire Integration

Pydantic Evals uses OpenTelemetry to record traces for each case in your evaluations.

You can send these traces to any OpenTelemetry-compatible backend. For the best experience, we recommend Pydantic Logfire, which includes custom views for evals:

Logfire Evals Overview Logfire Evals Case View

You'll see full details about the inputs, outputs, token usage, execution durations, etc. And you'll have access to the full trace for each case — ideal for debugging, writing path-aware evaluators, or running the similar evaluations against production traces.

Basic setup:

import logfire

logfire.configure(
    send_to_logfire='if-token-present',
    environment='development',
    service_name='evals',
)

...

my_dataset.evaluate_sync(my_task)

Read more about the Logfire integration here.

1.77.0 Apr 03, 2026
1.76.0 Apr 02, 2026
1.75.0 Apr 01, 2026
1.74.0 Mar 31, 2026
1.73.0 Mar 27, 2026
1.72.0 Mar 26, 2026
1.71.0 Mar 24, 2026
1.70.0 Mar 18, 2026
1.69.0 Mar 17, 2026
1.68.0 Mar 13, 2026
1.67.0 Mar 06, 2026
1.66.0 Mar 05, 2026
1.65.0 Mar 03, 2026
1.64.0 Mar 02, 2026
1.63.0 Feb 23, 2026
1.62.0 Feb 19, 2026
1.61.0 Feb 18, 2026
1.60.0 Feb 17, 2026
1.59.0 Feb 14, 2026
1.58.0 Feb 11, 2026
1.57.0 Feb 10, 2026
1.56.0 Feb 06, 2026
1.55.0 Feb 05, 2026
1.54.0 Feb 04, 2026
1.53.0 Feb 04, 2026
1.52.0 Feb 03, 2026
1.51.0 Jan 31, 2026
1.50.0 Jan 30, 2026
1.49.0 Jan 29, 2026
1.48.0 Jan 28, 2026
1.47.0 Jan 24, 2026
1.46.0 Jan 23, 2026
1.44.0 Jan 17, 2026
1.43.0 Jan 16, 2026
1.42.0 Jan 14, 2026
1.41.0 Jan 10, 2026
1.40.0 Jan 07, 2026
1.39.1 Jan 06, 2026
1.39.0 Dec 24, 2025
1.38.0 Dec 23, 2025
1.37.0 Dec 20, 2025
1.36.0 Dec 19, 2025
1.35.0 Dec 18, 2025
1.34.0 Dec 17, 2025
1.33.0 Dec 16, 2025
1.32.0 Dec 13, 2025
1.31.0 Dec 12, 2025
1.30.1 Dec 11, 2025
1.30.0 Dec 11, 2025
1.29.0 Dec 10, 2025
1.28.0 Dec 09, 2025
1.27.0 Dec 05, 2025
1.26.0 Dec 03, 2025
1.25.1 Nov 28, 2025
1.25.0 Nov 28, 2025
1.24.0 Nov 27, 2025
1.23.0 Nov 26, 2025
1.22.0 Nov 22, 2025
1.21.0 Nov 21, 2025
1.20.0 Nov 19, 2025
1.19.0 Nov 18, 2025
1.18.0 Nov 15, 2025
1.17.0 Nov 14, 2025
1.16.0 Nov 13, 2025
1.15.0 Nov 13, 2025
1.14.1 Nov 12, 2025
1.14.0 Nov 10, 2025
1.13.0 Nov 10, 2025
1.12.0 Nov 07, 2025
1.11.1 Nov 06, 2025
1.11.0 Nov 05, 2025
1.10.0 Nov 04, 2025
1.9.1 Oct 31, 2025
1.9.0 Oct 29, 2025
1.8.0 Oct 29, 2025
1.7.0 Oct 28, 2025
1.6.0 Oct 24, 2025
1.5.0 Oct 24, 2025
1.4.0 Oct 24, 2025
1.3.0 Oct 23, 2025
1.2.1 Oct 20, 2025
1.2.0 Oct 20, 2025
1.1.0 Oct 15, 2025
1.0.18 Oct 13, 2025
1.0.17 Oct 09, 2025
1.0.16 Oct 08, 2025
1.0.15 Oct 03, 2025
1.0.14 Oct 03, 2025
1.0.13 Oct 02, 2025
1.0.12 Oct 01, 2025
1.0.11 Sep 30, 2025
1.0.10 Sep 20, 2025
1.0.9 Sep 18, 2025
1.0.8 Sep 17, 2025
1.0.7 Sep 15, 2025
1.0.6 Sep 12, 2025
1.0.5 Sep 12, 2025
1.0.4 Sep 11, 2025
1.0.3 Sep 11, 2025
1.0.2 Sep 09, 2025
1.0.1 Sep 05, 2025
1.0.0 Sep 05, 2025
1.0.0b1 Aug 30, 2025
0.8.1 Aug 29, 2025
0.8.0 Aug 26, 2025
0.7.6 Aug 26, 2025
0.7.5 Aug 25, 2025
0.7.4 Aug 20, 2025
0.7.3 Aug 19, 2025
0.7.2 Aug 14, 2025
0.7.1 Aug 13, 2025
0.7.0 Aug 12, 2025
0.6.2 Aug 07, 2025
0.6.1 Aug 07, 2025
0.6.0 Aug 06, 2025
0.5.1 Aug 06, 2025
0.5.0 Aug 04, 2025
0.4.11 Aug 02, 2025
0.4.10 Jul 30, 2025
0.4.9 Jul 28, 2025
0.4.8 Jul 28, 2025
0.4.7 Jul 24, 2025
0.4.6 Jul 23, 2025
0.4.5 Jul 22, 2025
0.4.4 Jul 18, 2025
0.4.3 Jul 16, 2025
0.4.2 Jul 10, 2025
0.4.1 Jul 10, 2025
0.4.0 Jul 08, 2025
0.3.7 Jul 07, 2025
0.3.6 Jul 04, 2025
0.3.5 Jun 30, 2025
0.3.4 Jun 26, 2025
0.3.3 Jun 24, 2025
0.3.2 Jun 21, 2025
0.3.1 Jun 18, 2025
0.3.0 Jun 18, 2025
0.2.20 Jun 18, 2025
0.2.19 Jun 17, 2025
0.2.18 Jun 13, 2025
0.2.17 Jun 12, 2025
0.2.16 Jun 08, 2025
0.2.15 Jun 05, 2025
0.2.14 Jun 03, 2025
0.2.13 Jun 03, 2025
0.2.12 May 29, 2025
0.2.11 May 28, 2025
0.2.10 May 27, 2025
0.2.9 May 26, 2025
0.2.8 May 25, 2025
0.2.7 May 24, 2025
0.2.6 May 21, 2025
0.2.5 May 20, 2025
0.2.4 May 14, 2025
0.2.3 May 13, 2025
0.2.2 May 13, 2025
0.2.1 May 13, 2025
0.2.0 May 12, 2025
0.1.12 May 12, 2025
0.1.11 May 10, 2025
0.1.10 May 06, 2025
0.1.9 May 02, 2025
0.1.8 Apr 28, 2025
0.1.7 Apr 28, 2025
0.1.6 Apr 25, 2025
0.1.5 Apr 25, 2025
0.1.4 Apr 24, 2025
0.1.3 Apr 18, 2025
0.1.2 Apr 17, 2025
0.1.1 Apr 16, 2025
0.1.0 Apr 15, 2025
0.0.55 Apr 09, 2025
0.0.54 Apr 09, 2025
0.0.53 Apr 07, 2025
0.0.52 Apr 03, 2025
0.0.51 Apr 03, 2025
0.0.50 Apr 03, 2025
0.0.49 Apr 01, 2025
0.0.48 Mar 31, 2025
0.0.47 Mar 31, 2025

Wheel compatibility matrix

Platform Python 3
any

Files in release

Extras:
Dependencies:
anyio (>=0)
logfire-api (>=3.14.1)
pydantic-ai-slim (==1.77.0)
pydantic (>=2.12)
pyyaml (>=6.0.2)
rich (>=13.9.4)