arxiv preprint - Self-Rewarding Language Models

In this episode, we discuss Self-Rewarding Language Models by Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Sainbayar Sukhbaatar, Jing Xu, Jason Weston. The paper introduces self-rewarding language models (SR-LMs) which generate their own rewards for self-improvement beyond human performance levels. Using a method called Iterative Direct Preference Optimization, SR-LMs can enhance their ability to follow instructions and improve the quality of self-generated rewards through iteration. The authors demonstrate that their approach, when applied to Llama 2 70B, exceeds the performance of other systems on the AlpacaEval 2.0 leaderboard, suggesting potential for models to self-improve continuously.

arxiv preprint – Self-Rewarding Language Models