Intended Audience
- Science/Research
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.9
- Python :: 3.10
- Python :: 3.11
- Python :: 3.12
Topic
- Scientific/Engineering :: Artificial Intelligence
CJE - Causal Judge Evaluation
Your LLM judge scores are noisy and biased. CJE calibrates them to what actually matters.
Quick Start
pip install cje-eval
from cje import analyze_dataset
results = analyze_dataset(
fresh_draws_data={
"gpt-4o": [
{"prompt_id": "eval_001", "judge_score": 0.85, "oracle_label": 0.9},
{"prompt_id": "eval_002", "judge_score": 0.72, "oracle_label": 0.7},
{"prompt_id": "eval_003", "judge_score": 0.68},
{"prompt_id": "eval_004", "judge_score": 0.79},
],
"claude-sonnet": [
{"prompt_id": "eval_001", "judge_score": 0.78, "oracle_label": 0.82},
{"prompt_id": "eval_002", "judge_score": 0.81, "oracle_label": 0.79},
{"prompt_id": "eval_003", "judge_score": 0.75},
{"prompt_id": "eval_004", "judge_score": 0.83},
],
}
)
results.plot_estimates(save_path="ranking.png") # requires pip install "cje-eval[viz]"
CJE learns the judge→oracle mapping from labeled samples and applies it everywhere. Label 5–25% of samples with your oracle (human raters, strong model, downstream metric). Any bounded scale works automatically (0–1, 0–100, Likert 1–5).
Default workflow: If you can generate fresh responses on a shared prompt set, use Direct + two-stage calibration. Use IPS/DR only when you truly need off-policy estimation and overlap diagnostics look healthy enough to trust reweighting.
What CJE covers: reward calibration, calibration-aware inference, transport audits, and overlap diagnostics for counterfactual OPE.
Real-World Validation
We ran CJE on 29,511 physician-labeled HealthBench records with two LLM judges. Both judges were overconfident — by 24.5 pp and 13.0 pp respectively — and disagreed with each other by up to 73 percentage points on specific criteria categories. After calibration with just 5% oracle labels (~1,400 records), both converged to the physician ground truth.
Read the full HealthBench audit →
Example output: calibrated estimates with valid confidence intervals
Documentation
| Resource | Description |
|---|---|
| Interactive Tutorial | Walk through a complete example in Colab — no setup required |
| CJE in 3 Minutes | Video: why raw judge scores mislead and how CJE fixes it |
| Technical Walkthrough | Video: calibration, evaluation, and transport auditing pipeline |
| Operational Playbook | End-to-end runbook: audits, drift correction, label budgeting |
| Planning Notebook | Optimize your evaluation budget with pilot data |
| Full Docs | Installation, assumptions, API reference, research notes |
Bridges: Already running evals in Promptfoo, TruLens, LangSmith, OpenCompass, or Inspect AI? Convert those outputs into CJE format with one command.
Technical deep dives: Calibration methods · Diagnostics · Estimators · Interface/API · Experiments
Development
git clone https://github.com/cimo-labs/cje.git
cd cje && poetry install && make test
Citation
If you use CJE in your research, please cite:
@misc{landesberg2025causaljudgeevaluationcalibrated,
title={Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems},
author={Eddie Landesberg},
year={2025},
eprint={2512.11150},
archivePrefix={arXiv},
primaryClass={stat.ME},
url={https://arxiv.org/abs/2512.11150},
}
License
MIT — See LICENSE for details.