arxiv Preprint - Contrastive Prefence Learning: Learning from Human Feedback without RL

In this episode we discuss Contrastive Prefence Learning: Learning from Human Feedback without RL
by Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh. Traditional approaches to Reinforcement Learning from Human Feedback (RLHF) assume that human preferences align with reward, but recent research suggests they align with regret under the user’s optimal policy. This flawed assumption complicates the optimization of the learned reward function using RL. Contrastive Preference Learning (CPL) is proposed as a new approach that learns optimal policies directly from preferences without the need for RL, using maximum entropy and a contrastive objective. CPL is off-policy, applicable to various problems, and can handle high-dimensional and sequential RLHF tasks.

arxiv Preprint – Contrastive Prefence Learning: Learning from Human Feedback without RL