abstract:91db6feddd1d2b54.tex

1: \begin{abstract}

2: Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals.

3: However, DPO requires high-quality preference data and suffers from unstable preference optimization.

4: In this work, we aim to improve the preference optimization pipeline by taking a closer look at preference data generation and training regularization techniques.

5: For preference data generation, we demonstrate that existing scoring-based reward models produce unsatisfactory preference data and perform poorly on out-of-distribution tasks.

6: This significantly impacts the LLM alignment performance when using these data for preference tuning.

7: To ensure high-quality preference data generation, we propose an iterative pairwise ranking mechanism that derives preference ranking of completions using pairwise comparison signals.

8: For training regularization,

9: we observe that preference optimization tends to achieve better convergence when the LLM predicted likelihood of preferred samples gets slightly reduced.

10: However, the widely used supervised next-word prediction regularization strictly prevents any likelihood reduction of preferred samples.

11: This observation motivates our design of a budget-controlled regularization formulation.

12: Empirically we show that combining the two designs leads to aligned models that surpass existing SOTA across two popular benchmarks.

13: \end{abstract}

14: