arxiv preprint - sDPO: Don’t Use Your Data All at Once

In this episode, we discuss sDPO: Don’t Use Your Data All at Once by Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, Chanjun Park. The paper introduces stepwise DPO (sDPO), a novel technique for better aligning large language models (LLM) with human preferences by utilizing preference datasets in stages rather than all at once. sDPO improves upon the direct preference optimization (DPO) process by employing progressively aligned reference models throughout training. The results showed that models trained using sDPO outperformed larger, more parameter-heavy LLMs, demonstrating the effectiveness of this stepwise approach.

arxiv preprint – sDPO: Don’t Use Your Data All at Once