arxiv preprint – Evaluating Large Language Models at Evaluating Instruction Following

In this episode, we discuss Evaluating Large Language Models at Evaluating Instruction Following by Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, Danqi Chen. This paper examines the effectiveness of using large language models (LLMs) to evaluate the performance of other models in following instructions, and introduces a new meta-evaluation benchmark called LLM-BAR. The benchmark consists of 419 pairs of texts, with one text in each pair following a given instruction and the other not, designed to challenge the evaluative capabilities of LLMs. The findings show that LLM evaluators vary in their ability to judge instruction adherence and suggest that even the best evaluators need improvement, with the paper proposing new prompting strategies to enhance LLM evaluator performance.


Posted

in

by

Tags: