Diagnosing the Reliability of LLM-as-a-Judge via Item Response Theory
Abstract
A two-phase diagnostic framework based on Item Response Theory and Graded Response Model is introduced to assess the reliability of LLM-as-a-Judge by examining intrinsic consistency and human alignment.
While LLM-as-a-Judge is widely used in automated evaluation, existing validation practices primarily operate at the level of observed outputs, offering limited insight into whether LLM judges themselves function as stable and reliable measurement instruments. To address this limitation, we introduce a two-phase diagnostic framework for assessing reliability of LLM-as-a-Judge, grounded in Item Response Theory (IRT). The framework adopts Graded Response Model (GRM) of IRT and formalizes reliability along two complementary dimensions: (1) intrinsic consistency, defined as the stability of measurement behavior under prompt variations, and (2) human alignment, capturing correspondence with human quality assessments. We empirically examine diverse LLM judges with this framework, and show that leveraging IRT-GRM yields interpretable signals for diagnosing judgments systematically. These signals provide practical guidance for verifying reliablity of LLM-as-a-Judge and identifying potential causes of unreliability.
Community
,
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Grading Scale Impact on LLM-as-a-Judge: Human-LLM Alignment Is Highest on 0-5 Grading Scale (2026)
- Judging Against the Reference: Uncovering Knowledge-Driven Failures in LLM-Judges on QA Evaluation (2026)
- RULERS: Locked Rubrics and Evidence-Anchored Scoring for Robust LLM Evaluation (2026)
- Calibrating LLM Judges: Linear Probes for Fast and Reliable Uncertainty Estimation (2025)
- When Wording Steers the Evaluation: Framing Bias in LLM judges (2026)
- Becoming Experienced Judges: Selective Test-Time Learning for Evaluators (2025)
- A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper