91db6feddd1d2b54.tex
1: \begin{abstract}
2: Direct Preference Optimization (DPO) and its variants have become the de facto standards for aligning large language models (LLMs) with human preferences or specific goals. 
3: However, DPO requires high-quality preference data and suffers from unstable preference optimization.
4: In this work, we aim to improve the preference optimization pipeline by taking a closer look at preference data generation and training regularization techniques. 
5: For preference data generation, we demonstrate that existing scoring-based reward models produce unsatisfactory preference data and perform poorly on out-of-distribution tasks. 
6: This significantly impacts the LLM alignment performance when using these data for preference tuning. 
7: To ensure high-quality preference data generation, we propose an iterative pairwise ranking mechanism that derives preference ranking of completions using pairwise comparison signals.
8: For training regularization,
9: we observe that preference optimization tends to achieve better convergence when the LLM predicted likelihood of preferred samples gets slightly reduced. 
10: However, the widely used supervised next-word prediction regularization strictly prevents any likelihood reduction of preferred samples. 
11: This observation motivates our design of a budget-controlled regularization formulation.
12: Empirically we show that combining the two designs leads to aligned models that surpass existing SOTA across two popular benchmarks.
13: \end{abstract}
14: