▸ Concept also: LLM judge, model-as-judge, automated LLM evaluation
LLM-as-a-judge
Using a language model to score or rank another model's outputs in place of human annotators.
Learn first
In a nutshell
Evaluating open-ended language model outputs — summaries, code, conversation — at scale is expensive when humans do it. LLM-as-a-judge replaces the human annotator with a second model that reads a response and returns a score or preference. The approach is fast and cheap enough to run on every output. The hard part is that the judge is not neutral: it tends to prefer longer answers, outputs that resemble its own style, and text it generated itself. Those biases can corrupt both benchmarks and fine-tuning signal without being obvious from the numbers alone.
Where it came from
Year2023
SourceZheng et al., "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena" (NeurIPS 2023)
Why it matteredThe paper named and systematised the pattern; the practice of using GPT-4 as an evaluator preceded it.
In megatrends
How this connects
Tap a node to open it
