arxiv preprint - Nash Learning from Human Feedback

In this episode we discuss Nash Learning from Human Feedback by Remi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, Bilal Piot from Google DeepMind. The paper introduces Nash Learning from Human Feedback (NLHF), a new approach for tuning large language models (LLMs) based on human preferences, different from the traditional reinforcement learning from human feedback (RLHF). The NLHF technique involves learning a preference model from paired comparisons and refining the LLM’s policy towards a Nash equilibrium, where no alternative policy produces more preferred responses. They developed a Nash-MD algorithm and gradient descent approaches for implementing NLHF, and demonstrated its effectiveness on a text summarization task, suggesting NLHF as a promising direction for aligning LLMs with human preferences.

arxiv preprint – Nash Learning from Human Feedback