Mentatcurated
▸ Concept also: LLM judge, model-as-judge, automated LLM evaluation

LLM-as-a-judge

Using a language model to score or rank another model's outputs in place of human annotators.

In a nutshell

Evaluating open-ended language model outputs — summaries, code, conversation — at scale is expensive when humans do it. LLM-as-a-judge replaces the human annotator with a second model that reads a response and returns a score or preference. The approach is fast and cheap enough to run on every output. The hard part is that the judge is not neutral: it tends to prefer longer answers, outputs that resemble its own style, and text it generated itself. Those biases can corrupt both benchmarks and fine-tuning signal without being obvious from the numbers alone.

Where it came from

Year2023
SourceZheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (NeurIPS 2023)
Why it matteredThe paper named and systematised the pattern; the practice of using GPT-4 as an evaluator preceded it.

How this connects

Tap a node to open it