Moriya Dechtiar (Harvard University), Daniel Martin Katz (Illinois Tech – Chicago Kent College of Law; Bucerius Center for Legal Technology & Data Science; Stanford CodeX – The Center for Legal Informatics; 273 Ventures; ALEA Institute), Sylvain Jaume (Massachusetts Institute of Technology (MIT)), & Hongming Wang (Harvard University) have posted LLM as a Judge for Evaluating Contract Graphs: Multi-Judge Benchmarking and Agentic Uncertainty-Aware Refinement on SSRN. Here is the abstract:
We propose a calibrated, uncertainty-aware evaluation and agentic refinement framework that is structure-aware, semantically grounded, bias-aware, and compute-efficient. We apply this framework to the task of clause-level legal knowledge graph extraction from contracts, where both accuracy and confidence quantification matter. Instead of collapsing multiple evaluation signals into a single score, we treat inter-judge disagreement among complementary evaluators as a meaningful signal of epistemic uncertainty. Our multi-judge ensemble combines structural, semantic, and LLM-based judges and normalizes their outputs into a unified risk space. We show that the coefficient of variation over judge risks achieves 0.982 AUC for error detection and, after isotonic regression, yields well-calibrated probabilities. This enables selective refinement. We trigger an agentic Planner-Refiner-Verifier loop only when disagreement exceeds a validated threshold. On our dataset, selective refinement recovers 91% of errors at 67% precision while reducing compute by 90% versus blind refinement. We release ContractGraphEval v1.0, the first benchmark for clause-level extraction with calibrated uncertainty scores and reference-free features, enabling training of learned uncertainty estimators for production deployment.
