Direct Preference Optimization: Your Language Model is Secretly a Reward Model (Explained)

By Seifeur Guizeni - CEO & Founder

Large unsupervised language models (LMs) learn broad world knowledge and some reasoning skills. However, controlling their behavior precisely is challenging due to the unsupervised nature of their training. Existing methods gather human labels of model generations’ quality and fine-tune the unsupervised LM to align with these preferences. This process often involves reinforcement learning from human feedback (RLHF). RLHF is a complex and often unstable procedure, as it first fits a reward model reflecting human preferences, followed by fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without deviating too far from the original model.


Large unsupervised language models (LMs) trained on extensive datasets acquire remarkable capabilities. However, since these models are trained on data generated by humans with varying goals and skillsets, selecting the desired responses and behavior from the model’s broad knowledge is crucial for building safe, performant, and controllable AI systems. Existing methods typically steer LMs to match human preferences using reinforcement learning (RL).

In this paper, a new parameterization of the reward model in RLHF is introduced to enable extracting the optimal policy in a closed form, solving the RLHF problem with a simple classification loss. This new algorithm, termed Direct Preference Optimization (DPO), is stable, performant, and computationally lightweight. It eliminates the need for sampling from the LM during fine-tuning or significant hyperparameter tuning.

Direct Preference Optimization (DPO) demonstrates the capability to fine-tune LMs to align with human preferences better than existing methods. Notably, DPO surpasses PPO-based RLHF in controlling sentiment in generations and matches or improves response quality in summarization and single-turn dialogue. Furthermore, DPO simplifies the implementation and training process significantly.

Direct Preference Optimization: A New Approach

The process of sampling from the language model policy during training incurs significant computational costs. This paper introduces a method to directly optimize a language model to align with human preferences without the need for explicit reward modeling or reinforcement learning. Known as Direct Preference Optimization (DPO), this algorithm implicitly optimizes the same objective as existing RLHF algorithms while being simple to implement and easy to train. The DPO update aims to increase the relative log probability of preferred responses over dispreferred ones, incorporating a dynamic importance weight per example to prevent model degeneration.

Novel Approach with Direct Preference Optimization

DPO relies on a theoretical preference model, such as the Bradley-Terry model, to gauge how well a reward function aligns with empirical preference data. In contrast to existing methods that use the preference model to define a preference loss for training a reward model and then optimizing a policy, DPO defines the preference loss as a function of the policy directly. Using human preference data, DPO optimizes a policy through a simple binary cross-entropy objective, thereby yielding the optimal policy aligned with an implicit reward function derived from the preference data.

See also  Why are AI Products Doomed to Fail? Unveiling the Truth Behind the Hype Bubble

Experimentation and Effectiveness

The main contribution of Direct Preference Optimization (DPO) is a straightforward RL-free algorithm for training language models based on preferences. Experimental results demonstrate that DPO is as effective as existing methods, including PPO-based RLHF, in tasks such as sentiment modulation, summarization, and dialogue when using language models with parameters up to 6B.

Self-supervised language models have shown improved performance through fine-tuning on datasets of instructions and human-written completions, thus enhancing their alignment with user intent. Instruction-tuning has facilitated the generalization and usability of large language models (LLMs). Subsequent works have focused on fine-tuning LLMs with datasets of human preferences to enhance proficiency in various tasks such as translation, summarization, story-telling, and instruction-following.

Despite the success of instruction tuning, incorporating human preference data remains a significant challenge in training large language models with reinforcement learning. Various approaches in both bandit and reinforcement learning settings have been proposed to learn policies from preferences, presenting potential directions for further exploration and improvement in this area.

Direct Preference Optimization

To involve direct preference optimization, one must first explicitly estimate the latent scoring function, which is essentially the reward model. Instead of following the traditional approach, where the latent scoring function is estimated first and then optimized, we propose a single-stage policy learning strategy that directly optimizes a policy to fulfill preferences.


Reviewing the RLHF pipeline outlined by Ziegler et al. involves three main phases: supervised fine-tuning (SFT), preference sampling and reward learning, and RL optimization. The process typically begins with fine-tuning a pre-trained Language Model (LM) using supervised learning on high-quality data for specific downstream tasks such as dialogue or summarization. This fine-tuned model, πSFT, serves as the starting point for the subsequent phases of the RLHF pipeline. In the second phase, known as the Reward Modeling Phase, the fine-tuned model is prompted with inputs to generate pairs of answers, which are then evaluated by human labelers who express their preferences. These preferences are determined based on a latent reward model, denoted as r*(y, x), with various modeling approaches like the Bradley-Terry (BT) model commonly used for capturing human preferences. Framing the preference modeling problem as a binary classification, the negative log-likelihood loss is utilized to estimate and parametrize a reward model based on a static dataset of comparisons. The reward function is often initialized from the fine-tuned model, with enhancements to reduce variance and ensure effective learning. During the RL Fine-Tuning Phase, the learned reward function guides the language model by providing feedback to optimize the policy. This optimization process involves maximizing a specific objective function while controlling the deviation from an initial reference policy.

Direct Preference Optimization

Our objective is to address the challenges associated with applying reinforcement learning algorithms for large-scale tasks like language model fine-tuning. Unlike conventional methods that involve learning and optimizing a reward model through RL, our approach focuses on directly optimizing policies using preferences. By leveraging a unique parameterization of the reward model, we can derive the optimal policy without the need for an RL training loop. Our key innovation lies in establishing an analytical mapping from reward functions to optimal policies, allowing us to transform a loss function defined over reward functions into a loss function over policies. This approach eliminates the necessity of fitting an explicit reward model separately while still aligning with existing models of human preferences like the Bradley-Terry model. Essentially, the policy network now embodies both the language model and the implicit reward, simplifying the optimization process.

See also  Revolutionizing Consulting: Can ChatGPT Replace Traditional Methods?

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

To utilize in practice, we can rearrange Eq. 4 to express the reward function in terms of its corresponding optimal policy πr, the reference policy πref, and the unknown partition function Z(·). Specifically, by taking the logarithm of both sides of Eq. 4 and performing some algebra, we obtain: r(x, y) = β log πr(y | x) / πref(y | x) + β log Z(x) (5).

Reparameterization and Model Application

Applying this reparameterization to the ground-truth reward r∗ and corresponding optimal model π∗, the Bradley-Terry model depends solely on the difference of rewards between two completions. That is, p∗(y1 ≻ y2 | x) = σ(r∗(x, y1) − r∗(x, y2)). By substituting the reparameterization in Eq. 5 for r∗(x, y) into the preference model Eq. 1, the partition function cancels out, allowing us to express the human preference probability solely in terms of the optimal policy π∗ and reference policy πref. Consequently, the optimal RLHF policy π∗ under the Bradley-Terry model satisfies the preference model defined in Eq. 6.

Model Derivation and Extension

The derivation for Eq. 6 can be found in Appendix A.2, which employs the Bradley-Terry model. Similarly, expressions under the more general Plackett-Luce models [30, 21] are derivable, as shown in Appendix A.3. With the probability of human preference data now framed in terms of the optimal policy rather than the reward model, we can establish a maximum likelihood objective for a parametrized policy πθ as depicted in Eq. 7.

Gradient Analysis and Mechanistic Understanding

For a mechanistic understanding of DPO, analyzing the gradient of the loss function LDPO is crucial. The gradient concerning the parameters θ can be expressed as: ∇θLDPO(πθ; πref) = -βE(x,yw,yl)∼D [σ(ˆrθ(x, yl) − ˆrθ(x, yw))∇θ log π(yw | x) – ∇θ log π(yl | x)], where ˆrθ(x, y) = β log πθ(y | x) / πref(y | x). Intuitively, the gradient of the loss function aims to increase the likelihood of preferred completions yw and decrease the likelihood of dispreferred completions yl based on the implicitly defined reward model. This weighting accounts for how incorrectly the implicit reward model orders completions, considering the strength of the KL constraint.

DPO Outline

The general DPO pipeline involves sampling completions y1, y2 ∼ πref(· | x) for each prompt x, labeling with human preferences to construct an offline dataset of preferences D = {x(i), y(i)w, yl)(i)}Ni=1 and optimizing the language model πθ to minimize LDPO for the given πref and D with the desired β. To mitigate distribution shift between the true reference distribution and the πref used by DPO, initialization procedures are in place.

Further implementation details and hyperparameters can be found in Appendix B. Original work credits. Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, Chelsea Finn.

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *