CJE - Causal Judge Evaluation

Your LLM judge scores are noisy and biased. CJE calibrates them to what actually matters.

Quick Start

pip install cje-eval

from cje import analyze_dataset

results = analyze_dataset(
    fresh_draws_data={
        "gpt-4o": [
            {"prompt_id": "eval_001", "judge_score": 0.85, "oracle_label": 0.9},
            {"prompt_id": "eval_002", "judge_score": 0.72, "oracle_label": 0.7},
            {"prompt_id": "eval_003", "judge_score": 0.68},
            {"prompt_id": "eval_004", "judge_score": 0.79},
        ],
        "claude-sonnet": [
            {"prompt_id": "eval_001", "judge_score": 0.78, "oracle_label": 0.82},
            {"prompt_id": "eval_002", "judge_score": 0.81, "oracle_label": 0.79},
            {"prompt_id": "eval_003", "judge_score": 0.75},
            {"prompt_id": "eval_004", "judge_score": 0.83},
        ],
    }
)

results.plot_estimates(save_path="ranking.png")  # requires pip install "cje-eval[viz]"

CJE learns the judge→oracle mapping from labeled samples and applies it everywhere. Label 5–25% of samples with your oracle (human raters, strong model, downstream metric). Any bounded scale works automatically (0–1, 0–100, Likert 1–5).

Default workflow: If you can generate fresh responses on a shared prompt set, use Direct + two-stage calibration. Use IPS/DR only when you truly need off-policy estimation and overlap diagnostics look healthy enough to trust reweighting.

What CJE covers: reward calibration, calibration-aware inference, transport audits, and overlap diagnostics for counterfactual OPE.

Real-World Validation

We ran CJE on 29,511 physician-labeled HealthBench records with two LLM judges. Both judges were overconfident — by 24.5 pp and 13.0 pp respectively — and disagreed with each other by up to 73 percentage points on specific criteria categories. After calibration with just 5% oracle labels (~1,400 records), both converged to the physician ground truth.

Read the full HealthBench audit →

CJE forest plot showing calibrated policy estimates with confidence intervals

Example output: calibrated estimates with valid confidence intervals

Documentation

Resource	Description
Interactive Tutorial	Walk through a complete example in Colab — no setup required
CJE in 3 Minutes	Video: why raw judge scores mislead and how CJE fixes it
Technical Walkthrough	Video: calibration, evaluation, and transport auditing pipeline
Operational Playbook	End-to-end runbook: audits, drift correction, label budgeting
Planning Notebook	Optimize your evaluation budget with pilot data
Full Docs	Installation, assumptions, API reference, research notes

Bridges: Already running evals in Promptfoo, TruLens, LangSmith, OpenCompass, or Inspect AI? Convert those outputs into CJE format with one command.

Technical deep dives: Calibration methods · Diagnostics · Estimators · Interface/API · Experiments

Development

git clone https://github.com/cimo-labs/cje.git
cd cje && poetry install && make test

Citation

If you use CJE in your research, please cite:

@misc{landesberg2025causaljudgeevaluationcalibrated,
  title={Causal Judge Evaluation: Calibrated Surrogate Metrics for LLM Systems},
  author={Eddie Landesberg},
  year={2025},
  eprint={2512.11150},
  archivePrefix={arXiv},
  primaryClass={stat.ME},
  url={https://arxiv.org/abs/2512.11150},
}