Direct Preference Optimization:Your Language Model is Secretly a Reward Model

By Seifeur Guizeni - CEO & Founder

Abstract

While large-scale unsupervised language models (LMs) learn broad world knowl- edge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely unsupervised nature of their training. Existing methods for gaining such steerability collect human labels of the relative quality of model generations and fine-tune the unsupervised LM to align with these prefer- ences, often with reinforcement learning from human feedback (RLHF). However, RLHF is a complex and often unstable procedure, first fitting a reward model that reflects the human preferences, and then fine-tuning the large unsupervised LM using reinforcement learning to maximize this estimated reward without drifting too far from the original model. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss. The resulting algorithm, which we call Direct Prefer-ence Optimization (DPO), is stable, performant, and computationally lightweight, eliminating the need for sampling from the LM during fine-tuning or performing significant hyperparameter tuning. Our experiments show that DPO can fine-tune LMs to align with human preferences as well as or better than existing methods. Notably, fine-tuning with DPO exceeds PPO-based RLHF in ability to control sen- timent of generations, and matches or improves response quality in summarization and single-turn dialogue while being substantially simpler to implement and train.

1. Introduction

Large unsupervised language models (LMs) trained on very large datasets acquire surprising capabili- ties [ 11 , 7 , 40 , 8 ]. However, these models are trained on data generated by humans with a wide variety of goals, priorities, and skillsets. Some of these goals and skillsets may not be desirable to imitate; for example, while we may want our AI coding assistant to understand common programming mistakes in order to correct them, nevertheless, when generating code, we would like to bias our model toward the (potentially rare) high-quality coding ability present in its training data. Similarly, we might want our language model to be aware of a common misconception believed by 50% of people, but we certainly do not want the model to claim this misconception to be true in 50% of queries about it! In other words, selecting the model’s desired responses and behavior from its very wide knowledgeand abilities is crucial to building AI systems that are safe, performant, and controllable [ 26 ]. While existing methods typically steer LMs to match human preferences using reinforcement learning (RL),

Equal contribution; more junior authors listed earlier.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).

Figure 1: DPO optimizes for human preferences while avoiding reinforcement learning. Existing methods for fine-tuning language models with human feedback first fit a reward model to a dataset of prompts and human preferences over pairs of responses, and then use RL to find a policy that maximizes the learned reward. In contrast, DPO directly optimizes for the policy best satisfying the preferences with a simple classification objective, fitting an implicit reward model whose corresponding optimal policy can be extracted in closed form.

we will show that the RL-based objective used by existing methods can be optimized exactly with a simple binary cross-entropy objective, greatly simplifying the preference learning pipeline.

At a high level, existing methods instill the desired behaviors into a language model using curated sets of human preferences representing the types of behaviors that humans find safe and helpful. This preference learning stage occurs after an initial stage of large-scale unsupervised pre-training on a large text dataset. While the most straightforward approach to preference learning is supervised fine-tuning on human demonstrations of high quality responses, the most successful class of methods is reinforcement learning from human (or AI) feedback (RLHF/RLAIF; [ 12 , 2 ]). RLHF methods fit a reward model to a dataset of human preferences and then use RL to optimize a language model policy to produce responses assigned high reward without drifting excessively far from the original model. While RLHF produces models with impressive conversational and coding abilities, the RLHF pipeline is considerably more complex than supervised learning, involving training multiple LMs and sampling from the LM policy in the loop of training, incurring significant computational costs.

In this paper, we show how to directly optimize a language model to adhere to human preferences, without explicit reward modeling or reinforcement learning. We propose Direct Preference Optimiza-tion (DPO) , an algorithm that implicitly optimizes the same objective as existing RLHF algorithms (reward maximization with a KL-divergence constraint) but is simple to implement and straight- forward to train. Intuitively, the DPO update increases the relative log probability of preferred to dispreferred responses, but it incorporates a dynamic, per-example importance weight that prevents the model degeneration that we find occurs with a naive probability ratio objective. Like existing algorithms, DPO relies on a theoretical preference model (such as the Bradley-Terry model; [ 5 ]) that measures how well a given reward function aligns with empirical preference data. However, while existing methods use the preference model to define a preference loss to train a reward model and then train a policy that optimizes the learned reward model, DPO uses a change of variables to define the preference loss as a function of the policy directly. Given a dataset of human preferences over model responses, DPO can therefore optimize a policy using a simple binary cross entropy objective, producing the optimal policy to an implicit reward function fit to the preference data.

Our main contribution is Direct Preference Optimization (DPO), a simple RL-free algorithm for training language models from preferences. Our experiments show that DPO is at least as effective as existing methods, including PPO-based RLHF, for learning from preferences in tasks such as sentiment modulation, summarization, and dialogue, using language models with up to 6B parameters.

Self-supervised language models of increasing scale learn to complete some tasks zero-shot [ 31 ] or with few-shot prompts [ 6 , 25 , 11 ]. However, their performance on downstream tasks and alignment with user intent can be significantly improved by fine-tuning on datasets of instructions and human- written completions [ 23 , 36 , 13 , 39 ]. This ‘instruction-tuning’ procedure enables LLMs to generalize to instructions outside of the instruction-tuning set and generally increase their usability [ 13 ]. Despite the success of instruction tuning, relative human judgments of response quality are often easier to collect than expert demonstrations, and thus subsequent works have fine-tuned LLMs with datasets of human preferences, improving proficiency in translation [ 18 ], summarization [ 38 , 49 ], story-telling [ 49 ], and instruction-following [ 26 , 32 ]. These methods first optimize a neural network reward function for compatibility with the dataset of preferences under a preference model such as the

2

Bradley-Terry model [ 5 ], then fine-tune a language model to maximize the given reward using reinforcement learning algorithms, commonly REINFORCE [ 45 ], proximal policy optimization (PPO; [ 37 ]), or variants [ 32 ]. A closely-related line of work leverages LLMs fine-tuned for instruction following with human feedback to generate additional synthetic preference data for targeted attributes such as safety or harmlessness [ 2 ], using only weak supervision from humans in the form of a text rubric for the LLM’s annotations. These methods represent a convergence of two bodies of work: one body of work on training language models with reinforcement learning for a variety of objectives [ 33 , 27 , 46 ] and another body of work on general methods for learning from human preferences [ 12 , 19 ]. Despite the appeal of using relative human preferences, fine-tuning large language models with reinforcement learning remains a major practical challenge; this work provides a theoretically-justified approach to optimizing relative preferences without RL.

Outside of the context of language, learning policies from preferences has been studied in both bandit and reinforcement learning settings, and several approaches have been proposed. Contextual bandit learning using preferences or rankings of actions, rather than rewards, is known as a contextual dueling bandit (CDB; [ 48 , 14 ]). In the absence of absolute rewards, theoretical analysis of CDBs substitutes the notion of an optimal policy with a von Neumann winner , a policy whose expected win rate against any other policy is at least 50% [ 14 ]. However, in the CDB setting, preference labels are given online, while in learning from human preferences, we typically learn from a fixed batch of offline preference-annotated action pairs [ 47 ]. Similarly, preference-based RL (PbRL) learns from binary preferences generated by an unknown ‘scoring’ function rather than rewards [9, 35]. Various algorithms for PbRL exist, including methods that can reuse off-policy preference data, but generally involve first explicitly estimating the latent scoring function (i.e. the reward model) and subsequently optimizing it [ 16 , 9 , 12 , 34 , 19 ]. We instead present a single stage policy learning approach that directly optimizes a policy to satisfy preferences.

3. Preliminaries

We review the RLHF pipeline in Ziegler et al. (and later [ 38 , 1 , 26 ]). It usually includes three phases: 1) supervised fine-tuning (SFT); 2) preference sampling and reward learning and 3) RL optimization.

SFT : RLHF typically begins by fine-tuning a pre-trained LM with supervised learning on high-quality data for the downstream task(s) of interest (dialogue, summarization, etc.), to obtain a model π SFT .

Reward Modelling Phase : In the second phase the SFT model is prompted with prompts x to produce pairs of answers ( y 1 , y 2 ) ∼π SFT ( y|x ) . These are then presented to human labelers who express preferences for one answer, denoted as yw≻yl|x where yw and yl denotes the preferred and dispreferred completion amongst ( y 1 , y 2 ) respectively. The preferences are assumed to be generated by some latent reward model r∗ ( y, x ) , which we do not have access to. There are a number of approaches used to model preferences, the Bradley-Terry (BT) [ 5 ] model being a popular choice (although more general Plackett-Luce ranking models [ 30 , 21 ] are also compatible with the framework if we have access to several ranked answers). The BT model stipulates that the human preference distribution p∗ can be written as:

exp ( r∗ ( x, y 1 )) + exp ( r∗ ( x, y 2 )) . (1)

p∗ ( y 1 ≻y 2 |x ) = exp ( r∗ ( x, y 1 ))

Assuming access to a static dataset of comparisons D = x ( i ) , y ( i ) w __ , y ( i ) lNi =1 sampled from p∗ , we can parametrize a reward model ( x, y ) and estimate the parameters via maximum likelihood. Framing the problem as a binary classification we have the negative log-likelihood loss:

LR ( rφ,D ) = E ( x,yw,yl ) ∼D

log σ ( ( x, yw ) −rφ ( x, yl ))

(2)

where σ is the logistic function. In the context of LMs, the network ( x, y ) is often initialized from the SFT model π SFT ( y|x ) with the addition of a linear layer on top of the final transformer layer that produces a single scalar prediction for the reward value [ 49 ]. To ensure a reward function with lower variance, prior works normalize the rewards, such that E x,y∼D [ ( x, y )] = 0 for all x .

RL Fine-Tuning Phase : During the RL phase, we use the learned reward function to provide feedback to the language model. In particular, we formulate the following optimization problem

max πθ E x∼D,y∼πθ ( y|x )

( x, y )

−β D KL

πθ ( y|x ) ||π ref ( y|x )

(3)

3

where β is a parameter controlling the deviation from the base reference policy π ref , namely the ini- tial SFT model π SFT . In practice, the language model policy πθ is also initialized to π SFT . The added constraint is important, as it prevents the model from deviating too far from the distri- bution on which the reward model is accurate, as well as maintaining the generation diversity and preventing mode-collapse to single high-reward answers. Due to the discrete nature of lan- guage generation, this objective is not differentiable and is typically optimized with reinforce- ment learning. The standard approach [ 49 , 38 , 1 , 26 ] has been to construct the reward function r ( x, y ) = ( x, y ) −β (log πθ ( y|x ) log π ref ( y|x )) , and maximize using PPO [37].

4. Direct Preference Optimization

Motivated by the challenges of applying reinforcement learning algorithms on large-scale problems such as fine-tuning language models, our goal is to derive a simple approach for policy optimization using preferences directly. Unlike prior RLHF methods, which learn a reward and then optimize it via RL, our approach leverages a particular choice of reward model parameterization that enables extraction of its optimal policy in closed form, without an RL training loop. As we will describe next in detail, our key insight is to leverage an analytical mapping from reward functions to optimal policies, which enables us to transform a loss function over reward functions into a loss function over policies. This change-of-variables approach avoids fitting an explicit, standalone reward model, while still optimizing under existing models of human preferences, such as the Bradley-Terry model. In essence, the policy network represents both the language model and the (implicit) reward.

Deriving the DPO objective. We start with the same RL objective as prior work, Eq. 3, under a general reward function r . Following prior work [ 29 , 28 , 17 , 15 ], it is straightforward to show that the optimal solution to the KL-constrained reward maximization objective in Eq. 3 takes the form:

πr ( y|x ) = 1

Z ( x ) π ref ( y|x ) exp 1

βr ( x, y ) , (4)

where Z ( x ) = P

y __ π ref ( y|x ) exp 1 β __ r ( x, y ) is the partition function. See Appendix A.1 for a complete derivation. Even if we use the MLE estimate of the ground-truth reward function r∗ , it is still expensive to estimate the partition function Z ( x ) [ 17 , 15 ], which makes this representation hard to utilize in practice. However, we can rearrange Eq. 4 to express the reward function in terms of its corresponding optimal policy πr , the reference policy π ref , and the unknown partition function Z ( ) . Specifically, we first take the logarithm of both sides of Eq. 4 and then with some algebra we obtain:

r ( x, y ) = β log __ πr ( y|x )

π ref ( y|x ) + β log Z ( x ) . (5)

We can apply this reparameterization to the ground-truth reward r∗ and corresponding optimal model π∗ . Fortunately, the Bradley-Terry model depends only on the difference of rewards between two completions, i.e., p∗ ( y 1 ≻y 2 |x ) = σ ( r∗ ( x, y 1 ) −r∗ ( x, y 2 )) . Substituting the reparameterization in Eq. 5 for r∗ ( x, y ) into the preference model Eq. 1, the partition function cancels, and we can express the human preference probability in terms of only the optimal policy π∗ and reference policy π ref . Thus, the optimal RLHF policy π∗ under the Bradley-Terry model satisfies the preference model:

p∗ ( y 1 ≻y 2 |x ) = 1

π ref ( y 2 |x ) __ −β log π∗ ( y 1 |x )

1 + exp β log __ π∗ ( y 2 |x )

π ref ( y 1 |x ) (6)

The derivation is in Appendix A.2. While Eq. 6 uses the Bradley-Terry model, we can similarly derive expressions under the more general Plackett-Luce models [30, 21], shown in Appendix A.3.

Now that we have the probability of human preference data in terms of the optimal policy rather than the reward model, we can formulate a maximum likelihood objective for a parametrized policy πθ . Analogous to the reward modeling approach (i.e. Eq. 2), our policy objective becomes:

L DPO ( πθ ; π ref ) = E ( x,yw,yl ) ∼D

log σβ log __ πθ ( yw|x )

. (7)

π ref ( yw|x ) __ −β log πθ ( yl|x )

π ref ( yl|x )

This way, we fit an implicit reward using an alternative parameterization, whose optimal policy is simply πθ . Moreover, since our procedure is equivalent to fitting a reparametrized Bradley-Terry

4

model, it enjoys certain theoretical properties, such as consistencies under suitable assumption of the preference data distribution [ 4 ]. In Section 5, we further discuss theoretical properties of DPO in relation to other works.

What does the DPO update do? For a mechanistic understanding of DPO, it is useful to analyze the gradient of the loss function L DPO . The gradient with respect to the parameters θ can be written as:

∇θL DPO ( πθ ; π ref ) =

−β E ( x,yw,yl ) ∼D

,

σ ( x, yl ) ˆ ( x, yw )) | {z } higher weight when reward estimate is wrong

∇θ log π ( yw|x ) | {z } increase likelihood of yw

−∇θ log π ( yl|x ) | {z } decrease likelihood of yl

where ˆ ( x, y ) = β log __ πθ ( y|x )

π ref ( y|x ) is the reward implicitly defined by the language model πθ and refer- ence model π ref (more in Section 5). Intuitively, the gradient of the loss function L DPO increases the likelihood of the preferred completions yw and decreases the likelihood of dispreferred completions yl . Importantly, the examples are weighed by how much higher the implicit reward model ˆ rates the dispreferred completions, scaled by β , i.e, how incorrectly the implicit reward model orders the completions, accounting for the strength of the KL constraint. Our experiments suggest the importance of this weighting, as a naïve version of this method without the weighting coefficient can cause the language model to degenerate (Appendix Table 3).

DPO outline. The general DPO pipeline is as follows: 1) Sample completions y 1 , y 2 ∼π ref ( – |x ) for every prompt x , label with human preferences to construct the offline dataset of preferences D = {x ( i ) , y ( i ) w __ , yl ) ( i ) }Ni =1 and 2) optimize the language model πθ to minimize L DPO for the given π ref and D and desired β . In practice, one would like to reuse preference datasets publicly available, rather than generating samples and gathering human preferences. Since the preference datasets are sampled using π SFT , we initialize π ref = π SFT whenever available. However, when π SFT is not available, we initialize π ref by maximizing likelihood of preferred completions ( x, yw ) , that is, π ref = arg max π E x,yw∼D [log π ( yw|x )] . This procedure helps mitigate the distribution shift between the true reference distribution which is unavailable, and π ref used by DPO. Further details related to the implementation and hyperparameters can be found in Appendix B.

5. Theoretical Analysis of DPO

In this section, we give further interpretation of the DPO method, provide theoretical backing, and relate advantages of DPO to issues with actor critic algorithms used for RLHF (such as PPO [37]).

5.1Your Language Model Is Secretly a Reward Model

DPO is able to bypass both fitting an explicit reward and performing RL to learn the policy using a single maximum likelihood objective. Note the optimization objective Eq. 5 is equivalent to a Bradley-Terry model with a reward parameterization r∗ ( x, y ) = β log __ π∗θ ( y|x )

π ref ( y|x ) and we optimize our parametric model πθ , equivalently to the reward model optimization in Eq. 2 under the change of variables. In this section we will build the theory behind this reparameterization, show that it does not constrain the class of learned reward models, and allows for the exact recovery of the optimal policy. We begin with by defining an equivalence relation between reward functions. Definition 1. We say that two reward functionsr ( x, y ) andr′ ( x, y ) are equivalent iffr ( x, y ) −r′ ( x, y ) = f ( x ) for some functionf.

It is easy to see that this is indeed an equivalence relation, which partitions the set of reward functions into classes. We can state the following two lemmas: Lemma 1. Under the Plackett-Luce, and in particular the Bradley-Terry, preference framework, tworeward functions from the same class induce the same preference distribution. Lemma 2. Two reward functions from the same equivalence class induce the same optimal policyunder the constrained RL problem.

The proofs are straightforward and we defer them to Appendix A.5. The first lemma is a well-known under-specification issue with the Plackett-Luce family of models [ 30 ]. Due to this under-specification,

5

we usually have to impose additional identifiability constraints to achieve any guarantees on the MLE estimates from Eq. 2 [ 4 ]. The second lemma states that all reward functions from the same class yield the same optimal policy, hence for our final objective, we are only interested in recovering an arbitrary reward function from the optimal class. We prove the following Theorem in Appendix A.6:

Theorem 1. Under mild assumptions, all reward classes consistent with the Plackett-Luce(and Bradley-Terry in particular) models can be represented with the reparameterizationr ( x, y ) = β log π ( y|x )

πref ( y|x ) __ for some modelπ ( y|x ) and a given reference modelπref ( y|x ) .

Proof Sketch. Consider any reward function r ( x, y ) , which induces a corresponding optimal model πr ( y|x ) , specified by Eq. 4. We will show that a reward function from the equivalence class of r can be represented using the reparameterization given above. We define the projection f as

ref ( y|x ) exp 1

βr ( x, y ) (8)

f ( r ; π ref , β )( x, y ) = r ( x, y ) −β log X

The operator f simply normalizes the reward function with the logarithm of the partition function of πr . Since the added normalization term is only a function of the prefix x , f ( r ; π ref , β )( x, y ) is a reward function in the equivalence class of r ( x, y ) . Finally, replacing r with the RHS of Eq. 5 (which holds for any reward function), we have f ( r ; π ref , β )( x, y ) = β log __ πr ( y|x )

π ref ( y|x ) . That is, the projection f produces a member of the equivalence class of r with the desired form, and we do not lose any generality in our reward model from the proposed reparameterization.

We can alternatively view Theorem 1 as specifying exactly which reward function within each equivalence class the DPO reparameterization selects, that is, the reward function satisfying:

= 1 , (9)

βr ( x, y )

ref ( y|x ) exp 1

X

| {z } = π ( y|x ) , using Thm. 1 reparam.

i.e., π ( y|x ) is a valid distribution (probabilities are positive and sum to 1). However, following Eq. 4, we can see that Eq. 9 is the partition function of the optimal policy induced by the reward function r ( x, y ) . The key insight of the DPO algorithm is that we can impose certain constraints on the under-constrained Plackett-Luce (and Bradley-Terry in particular) family of preference models, such that we preserve the class of representable reward models, but explicitly make the optimal policy in Eq. 4 analytically tractable for all prompts x .

5.2Instability of Actor-Critic Algorithms

We can also use our framework to diagnose instabilities with standard actor-critic algorithms used for the RLHF, such as PPO. We follow the RLHF pipeline and focus on the RL fine-tuning step outlined in Section 3. We can draw connections to the control as inference framework [ 20 ] for the constrained RL problem outlined in 3. We assume a parameterized model πθ ( y|x ) and minimize D KL [ πθ ( y|x ) ||π∗ ( y|x )] where π∗ is the optimal policy from Eq. 7 induced by the reward function ( y, x ) . With some algebra this leads to the optimization objective:

See also  Why are AI Products Doomed to Fail? Unveiling the Truth Behind the Hype Bubble

−β log __ πθ ( y|x )

max πθ E πθ ( y|x )

ref ( y|x ) exp 1

βrφ ( x, y )

(10)

( x, y ) −β log X

π ref ( y|x ) | {z } KL

| {z } f ( rφ,π ref )

This is the same objective optimized in prior works [ 49 , 38 , 1 , 26 ] using the DPO-equivalent reward for the reward class of . In this setting, we can interpret the normalization term in f ( rφ, π ref , β ) as the soft value function of the reference policy π ref . While this term does not affect the optimal solution, without it, the policy gradient of the objective could have high variance, making learning unstable. We can accommodate for the normalization term using a learned value function, but that can also be difficult to optimize. Alternatively, prior works have normalized rewards using a human completion baseline, essentially a single sample Monte-Carlo estimate of the normalizing term. In contrast the DPO reparameterization yields a reward function that does not require any baselines.

6

|IMDb Sentiment Generation 1.0 0.9 0.8 Reward 0.7 0.6 0.5 DPO (Ours) PPO-GT (Our impl.) Unlikelihood PPO-GT (TRL) 0.4 PPO (Our impl.) Preferred-FT 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 KL( ref)|TL;DR Summarization Win Rate vs Reference 0.7 DPO Preferred-FT GPT-J PPO SFT Best of 128 0.6 0.5 rate 0.4 Win 0.3 0.2 0.1 0.0 0.00 0.25 0.50 0.75 1.00 Sampling temperature| |—|—|

Figure 2: Left. The frontier of expected reward vs KL to the reference policy. DPO provides the highest expected reward for all KL values, demonstrating the quality of the optimization. Right. TL;DR summarization win rates vs. human-written summaries, using GPT-4 as evaluator. DPO exceeds PPO’s best-case performance on summarization, while being more robust to changes in the sampling temperature.

6. Experiments

In this section, we empirically evaluate DPO’s ability to train policies directly from preferences. First, in a well-controlled text-generation setting, we ask: how efficiently does DPO trade off maximizing reward and minimizing KL-divergence with the reference policy, compared to common preference learning algorithms such as PPO? Next, we evaluate DPO’s performance on larger models and more difficult RLHF tasks, including summarization and dialogue. We find that with almost no tuning of hyperparameters, DPO tends to perform as well or better than strong baselines like RLHF with PPO as well as returning the best of N sampled trajectories under a learned reward function. Before presenting these results, we describe the experimental set-up; additional details are in Appendix C.

Tasks. Our experiments explore three different open-ended text generation tasks. For all experiments, algorithms learn a policy from a dataset of preferences D = x ( i ) , y ( i ) w __ , y ( i ) lNi =1 . In controlledsentiment generation , x is a prefix of a movie review from the IMDb dataset [ 22 ], and the policy must generate y with positive sentiment. In order to perform a controlled evaluation, for this experiment we generate preference pairs over generations using a pre-trained sentiment classifier, where p ( positive |x, yw ) > p ( positive |x, yl ) . For SFT, we fine-tune GPT-2-large until convergence on reviews from the train split of the IMDB dataset (further details in App C.1). In summarization , x is a forum post from Reddit; the policy must generate a summary y of the main points in the post. Following prior work, we use the Reddit TL;DR summarization dataset [ 41 ] along with human preferences gathered by Stiennon et al.. We use an SFT model fine-tuned on human-written forum post summaries 2 with the TRLX [ 42 ] framework for RLHF. The human preference dataset was gathered by Stiennon et al. on samples from a different, but similarly-trained, SFT model. Finally, in single-turn dialogue , x is a human query, which may be anything from a question about astrophysics to a request for relationship advice. A policy must produce an engaging and helpful response y to a user’s query; we use the Anthropic Helpful and Harmless dialogue dataset [ 1 ], containing 170k dialogues between a human and an automated assistant. Each transcript ends with a pair of responses generated by a large (although unknown) language model along with a preference label denoting the human-preferred response. In this setting, no pre-trained SFT model is available; we therefore fine-tune an off-the-shelf language model on only the preferred completions to form the SFT model.

Evaluation. Our experiments use two different approaches to evaluation. In order to analyze the effectiveness of each algorithm in optimizing the constrained reward maximization objective, in the controlled sentiment generation setting we evaluate each algorithm by its frontier of achieved reward and KL-divergence from the reference policy; this frontier is computable because we have acccess to the ground-truth reward function (a sentiment classifier). However, in the real world, the ground truth reward function is not known; therefore, we evaluate algorithms with their win rate against a baseline policy, using GPT-4 as a proxy for human evaluation of summary quality and response helpfulness in the summarization and single-turn dialogue settings, respectively. For summarization, we use reference summaries in the test set as the baseline; for dialogue, we use the preferred response in the

2 https://huggingface.co/CarperAI/openai_summarize_tldr_sft

7

|Anthropic-HH Dialogue Win Rate vs Chosen 0.6 0.5 rate 0.4 Win 0.3 0.2 DPO Preferred-FT Best of 128 Pythia-2.8B 0.1 0.25 0.50 0.75 1.00 Sampling temperature|Dialogue Win Rate Evolution 0.70 0.65 0.60 0.55 rate 0.50 Win 0.45 0.40 0.35 DPO (temp = 1.0) DPO (temp = 0.7) 0.30 0 300 600 900 1200 1500 1800 2100 2400 2700 3000 3300 Fine-tuning step| |—|—|

Figure 3: Left. Win rates computed by GPT-4 for Anthropic-HH one-step dialogue; DPO is the only method that improves over chosen summaries in the Anthropic-HH test set. Right. Win rates for different sampling temperatures over the course of training. DPO’s improvement over the dataset labels is fairly stable over the course of training for different sampling temperatures.

test dataset as the baseline. While existing studies suggest LMs can be better automated evaluators than existing metrics [ 10 ], we conduct a human study to justify our usage of GPT-4 for evaluation in Sec. 6.4. We find GPT-4 judgments correlate strongly with humans, with human agreement with GPT-4 typically similar or higher than inter-human annotator agreement.

Methods. In addition to DPO, we evaluate several existing approaches to training language models to adhere to human preferences. Most simply, we explore zero-shot prompting with GPT-J [ 43 ] in the summarization task and 2-shot prompting with Pythia-2.8B [ 3 ] in the dialogue task. In addition, we evaluate the SFT model as well as Preferred-FT , which is a model fine-tuned with supervised learning on the chosen completion yw from either the SFT model (in controlled sentiment and summarization) or a generic LM (in single-turn dialogue). Another pseudo-supervised method is Unlikelihood [ 44 ], which simply optimizes the policy to maximize the probability assigned to yw and minimize the probability assigned to yl ; we use an optional coefficient α∈ [0 , 1] on the ‘unlikelihood’ term. We also consider PPO [ 37 ] using a reward function learned from the preference data and PPO-GT , which is an oracle that learns from the ground truth reward function available in the controlled sentiment setting. In our sentiment experiments, we use two implementations of PPO-GT, one of-the-shelf version [ 42 ] as well as a modified version that normalizes rewards and further tunes hyperparameters to improve performance (we also use these modifications when running ‘normal’ PPO with learned rewards). Finally, we consider the Best of N baseline, sampling N responses from the SFT model (or Preferred-FT in dialogue) and returning the highest-scoring response according to a reward function learned from the preference dataset. This high-performing method decouples the quality of the reward model from the PPO optimization, but is computationally impractical even for moderate N as it requires sampling N completions for every query at test time.

6.1How well can DPO optimize the RLHF objective?

The KL-constrained reward maximization objective used in typical RLHF algorithms balances exploitation of reward while restricting the policy from deviating far from the reference policy. Therefore, when comparing algorithms, we must take into account both reward achieved as well as the KL discrepancy; achieving slightly higher reward but with much higher KL is not necessarily desirable. Figure 2 shows the reward-KL frontier for various algorithms in the sentiment setting. We execute multiple training runs for each algorithm, using a different hyperparameter for policy conservativeness in each run (target KL ∈{ 3 , 6 , 9 , 12 } for PPO, β∈{ 0 . 05 , 0 . 1 , 1 , 5 } , α∈{ 0 . 05 , 0 . 1 , 0 . 5 , 1 } for unlikelihood, random seeds for preferred-FT). This sweep includes 22 runs in total. After each 100 training steps until convergence, we evaluate each policy on a set of test prompts, computing the average reward under the true reward function as well as the average sequence-level KL 3 with the reference policy KL ( π||π ref ) . We find that DPO produces by far the most efficient frontier, achieving the highest reward while still achieving low KL. This result is particularly notable for multiple reasons. First, DPO and PPO optimize the same objective, but DPO is notably more efficient;

3 That is, the sum of the per-timestep KL-divergences.

8

DPO’s reward/KL tradeoff strictly dominates PPO. Second, DPO achieves a better frontier than PPO, even when PPO can access ground truth rewards (PPO-GT).

6.2Can DPO scale to real preference datasets?

Next, we evaluate fine-tuning performance of DPO on summarization and single-turn dialogue. For summarization, automatic evaluation metrics such as ROUGE can be poorly correlated with human preferences [ 38 ], and prior work has found that fine-tuning LMs using PPO on human preferences to provide more effective summaries. We evaluate different methods by sampling completions on the test split of TL;DR summarization dataset, and computing the average win rate against reference completions in the test set. The completions for all methods are sampled at temperatures varying from 0.0 to 1.0, and the win rates are shown in Figure 2 (right). DPO, PPO and Preferred-FT all fine-tune the same GPT-J SFT model 4 . We find that DPO has a win rate of approximately 61% at a temperature of 0.0, exceeding the performance of PPO at 57% at its optimal sampling temperature of 0.0. DPO also achieves a higher maximum win rate compared to the best of N baseline. We note that we did not meaningfully tune DPO’s β hyperparameter, so these results may underestimate DPO’s potential. Moreover, we find DPO to be much more robust to the sampling temperature than PPO, the performance of which can degrade to that of the base GPT-J model at high temperatures. Preferred-FT does not improve significantly over the SFT model. We also compare DPO and PPO head-to-head in human evaluations in Section 6.4, where DPO samples at temperature 0.25 were preferred 58% times over PPO samples at temperature 0.

On single-turn dialogue, we evaluate the different methods on the subset of the test split of the Anthropic HH dataset [ 1 ] with one step of human-assistant interaction. GPT-4 evaluations use the preferred completions on the test as the reference to compute the win rate for different methods. As there is no standard SFT model for this task, we start with a pre-trained Pythia-2.8B, use Preferred-FT to train a reference model on the chosen completions such that completions are within distribution of the model, and then train using DPO. We also compare against the best of 128 Preferred-FT completions (we found the Best of N baseline plateaus at 128 completions for this task; see Appendix Figure 4) and a 2-shot prompted version of the Pythia-2.8B base model, finding DPO performs as well or better for the best-performing temperatures for each method. We also evaluate an RLHF model trained with PPO on the Anthropic HH dataset 5 from a well-known source 6 , but are unable to find a prompt or sampling temperature that gives performance better than the base Pythia-2.8B model. Based on our results from TL;DR and the fact that both methods optimize the same reward function, we consider Best of 128 a rough proxy for PPO-level performance. Overall, DPO is the only computationally efficient method that improves over the preferred completions in the Anthropic HH dataset, and provides similar or better performance to the computationally demanding Best of 128 baseline. Finally, Figure 3 shows that DPO converges to its best performance relatively quickly.

6.3Generalization to a new input distribution

Win rate vs. ground truth

Alg. Temp 0 Temp 0 . 25

DPO 0.36 0.31 PPO 0.26 0.23

Table 1: GPT-4 win rates vs. ground truth summaries for out-of-distribution CNN/DailyMail input articles.

To further compare the performance of PPO and DPO un- der distribution shifts, we evaluate the PPO and DPO poli- cies from our Reddit TL;DR summarization experiment on a different distribution, news articles in the test split of the CNN/DailyMail dataset [ 24 ], using the best sampling temper- atures from TL;DR (0 and 0.25). The results are presented in Table 1. We computed the GPT-4 win rate against the ground-truth summaries in the datasets, using the same GPT- 4 (C) prompt we used for Reddit TL;DR, but replacing the words “forum post” with “news article”. For this new distribution, DPO continues to outperform the PPO policy by a significant margin. This experiment provides initial evidence that DPO policies can generalize similarly well to PPO policies, even though DPO does not use the additional unlabeled Reddit TL;DR prompts that PPO uses.

4 https://huggingface.co/CarperAI/openai_summarize_tldr_sft 5 https://huggingface.co/reciprocate/ppo_hh_pythia-6B 6 https://github.com/CarperAI/trlx/tree/main/examples/hh

9

6.4Validating GPT-4 judgments with human judgments

We conduct a human study to verify the reliability of GPT-4’s judgments, using the results of the TL;DR summarization experiment and two different GPT-4 prompts. The GPT-4 (S) (sim- ple) prompt simply asks for which summary better-summarizes the important information in the post. The GPT-4 (C) (concise) prompt also asks for which summary is more concise; we eval- uate this prompt because we find that GPT-4 prefers longer, more repetitive summaries than hu- mans do with the GPT-4 (S) prompt. See Appendix C.2 for the complete prompts. We perform three comparisons, using the highest (DPO, temp. 0.25), the lowest (PPO, temp. 1.0), and a

DPOSFTPPO-1

N respondents 272 122 199

GPT-4 (S) win % 47 27 13 GPT-4 (C) win % 54 32 12 Human win % 58 43 17

GPT-4 (S)-H agree 70 77 86 GPT-4 (C)-H agree 67 79 85 H-H agree 65 – 87

Table 2: Comparing human and GPT-4 win rates and per-judgment agreement on TL;DR summariza- tion samples. Humans agree with GPT-4 about asmuch as they agree with each other. Each experi- ment compares a summary from the stated method with a summary from PPO with temperature 0.

middle-performing (SFT, temp. 0.25) method with the aim of covering a diversity of sample qualities; all three methods are compared against greedily- sampled PPO (its best-performing temperature). We find that with both prompts, GPT-4 tends to agree with humans about as often as humans agree with each other, suggesting that GPT-4 is a reason- able proxy for human evaluations (due to limited human raters, we only collect multiple human judg- ments for the DPO and PPO-1 comparisons). Over- all, the GPT-4 (C) prompt generally provides win rates more representative of humans; we therefore use this prompt for the main results in Section 6.2. For additional details about the human study, in- cluding the web interface presented to raters and the list of human volunteers, see Appendix D.3.

7. Discussion

Learning from preferences is a powerful, scalable framework for training capable, aligned language models. We have introduced DPO, a simple training paradigm for training language models from preferences without reinforcement learning. Rather than coercing the preference learning problem into a standard RL setting in order to use off-the-shelf RL algorithms, DPO identifies a mapping between language model policies and reward functions that enables training a language model to satisfy human preferences directly , with a simple cross-entropy loss, without reinforcement learning or loss of generality. With virtually no tuning of hyperparameters, DPO performs similarly or better than existing RLHF algorithms, including those based on PPO; DPO thus meaningfully reduces the barrier to training more language models from human preferences.

Limitations & Future Work. Our results raise several important questions for future work. How does the DPO policy generalize out of distribution, compared with learning from an explicit reward function? Our initial results suggest that DPO policies can generalize similarly to PPO-based models, but more comprehensive study is needed. For example, can training with self-labeling from the DPO policy similarly make effective use of unlabeled prompts? On another front, how does reward over-optimization manifest in the direct preference optimization setting, and is the slight decrease in performance in Figure 3-right an instance of it? Additionally, while we evaluate models up to 6B parameters, exploration of scaling DPO to state-of-the-art models orders of magnitude larger is an exciting direction for future work. Regarding evaluations, we find that the win rates computed by GPT-4 are impacted by the prompt; future work may study the best way to elicit high-quality judgments from automated systems. Finally, many possible applications of DPO exist beyond training language models from human preferences, including training generative models in other modalities.

Acknowledgements

EM gratefully acknowledges funding from a Knight-Hennessy Graduate Fellowship. CF and CM are CIFAR Fellows. This work was supported in part by the Stanford Accelerator for Learning (SAL) and Stanford Institute for Human-Centered Artificial Intelligence (HAI) Generative AI for theFuture of Learning seed grant program. The Stanford Center for Research on Foundation Models (CRFM) provided part of the compute resources used for the experiments in this work. This work was supported in part by ONR grant N00014-20-1-2675.

10

References

[1] Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield- Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. McCandlish, C. Olah, B. Mann, and J. Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.

[2] Y. Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R. Lasenby, R. Lar- son, S. Ringer, S. Johnston, S. Kravec, S. E. Showk, S. Fort, T. Lanham, T. Telleen-Lawton, T. Conerly, T. Henighan, T. Hume, S. R. Bowman, Z. Hatfield-Dodds, B. Mann, D. Amodei, N. Joseph, S. McCandlish, T. Brown, and J. Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022.

[3] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, and O. van der Wal. Pythia: A suite for analyzing large language models across training and scaling, 2023.

[4] H. Bong and A. Rinaldo. Generalized results for the existence and consistency of the MLE in the Bradley-Terry-Luce model. International Conference on Machine Learning , 2022. arXiv:2110.11487.

[5] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika , 39(3/4):324–345, 1952. doi: https://doi.org/10.2307/2334029.

[6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. Lan- guage models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems , volume 33, pages 1877– 1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_ files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf .

[7] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neuralinformation processing systems , 33:1877–1901, 2020.

[8] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y. Zhang. Sparks of artificial general intelligence: Early experiments with GPT-4, 2023. arXiv preprint arXiv:2303.12712.

[9] R. Busa-Fekete, B. Szörényi, P. Weng, W. Cheng, and E. Hüllermeier. Preference-based reinforcement learning: evolutionary direct policy search using a preference-based racing algorithm. Machine Learning , 97(3):327–351, July 2014. doi: 10.1007/s10994-014-5458-8. URL https://doi.org/10.1007/s10994-014-5458-8 .

[10] Y. Chen, R. Wang, H. Jiang, S. Shi, and R.-L. Xu. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. ArXiv , abs/2304.00723, 2023.

[11] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. Palm: Scaling language modeling with pathways. arXivpreprint arXiv:2204.02311 , 2022.

[12] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Sys-tems , volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/ paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf .

11

[13] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, and J. Wei. Scaling instruction-finetuned language models, 2022.

[14] M. Dudík, K. Hofmann, R. E. Schapire, A. Slivkins, and M. Zoghi. Contextual dueling bandits. In P. Grünwald, E. Hazan, and S. Kale, editors, Proceedings of The 28th Conference on LearningTheory , volume 40 of Proceedings of Machine Learning Research , pages 563–587, Paris, France, 03–06 Jul 2015. PMLR. URL https://proceedings.mlr.press/v40/Dudik15.html .

[15] D. Go, T. Korbak, G. Kruszewski, J. Rozen, N. Ryu, and M. Dymetman. Aligning language models with preferences through f-divergence minimization. In Proceedings of the 40thInternational Conference on Machine Learning , ICML’23. JMLR.org, 2023.

[16] A. Jain, B. Wojcik, T. Joachims, and A. Saxena. Learning trajectory preferences for manip- ulators via iterative improvement. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems , volume 26. Curran Associates, Inc., 2013. URL https://proceedings.neurips.cc/paper_files/paper/ 2013/file/c058f544c737782deacefa532d9add4c-Paper.pdf .

See also  Revolutionizing Consulting: Can ChatGPT Replace Traditional Methods?

[17] T. Korbak, H. Elsahar, G. Kruszewski, and M. Dymetman. On reinforcement learning and distribution matching for fine-tuning language models with no catastrophic forgetting. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances inNeural Information Processing Systems , volume 35, pages 16203–16220. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/ 67496dfa96afddab795530cc7c69b57a-Paper-Conference.pdf .

[18] J. Kreutzer, J. Uyheng, and S. Riezler. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. In Proceedings of the 56th Annual Meetingof the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1777–1788, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/ P18-1165. URL https://aclanthology.org/P18-1165 .

[19] A. Kupcsik, D. Hsu, and W. S. Lee. Learning Dynamic Robot-to-Human Object Handoverfrom Human Feedback , pages 161–176. Springer International Publishing, 01 2018. ISBN 978-3-319-51531-1. doi: 10.1007/978-3-319-51532-8_10.

[20] S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review, 2018.

[21] R. D. Luce. Individual choice behavior: A theoretical analysis. Courier Corporation , 2012.

[22] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Technologies , pages 142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/ anthology/P11-1015 .

[23] S. Mishra, D. Khashabi, C. Baral, and H. Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of theAssociation for Computational Linguistics (Volume 1: Long Papers) , pages 3470–3487, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long. 244. URL https://aclanthology.org/2022.acl-long.244 .

[24] R. Nallapati, B. Zhou, C. dos Santos, Ç. Gulçehre, and B. Xiang. Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of the 20th SIGNLL Conferenceon Computational Natural Language Learning , pages 280–290, Berlin, Germany, Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/K16-1028. URL https:// aclanthology.org/K16-1028 .

12

[25] D. Narayanan, M. Shoeybi, J. Casper, P. LeGresley, M. Patwary, V. Korthikanti, D. Vainbrand, P. Kashinkunti, J. Bernauer, B. Catanzaro, A. Phanishayee, and M. Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the InternationalConference for High Performance Computing, Networking, Storage and Analysis , SC ’21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384421. doi: 10.1145/3458817.3476209. URL https://doi.org/10.1145/3458817.3476209 .

[26] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems , volume 35, pages 27730–27744. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/ paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf .

[27] R. Paulus, C. Xiong, and R. Socher. A deep reinforced model for abstractive summarization. In International Conference on Learning Representations , 2018. URL https://openreview. net/forum?id=HkAClQgA- .

[28] X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177 , 2019.

[29] J. Peters and S. Schaal. Reinforcement learning by reward-weighted regression for operational space control. In Proceedings of the 24th international conference on Machine learning , pages 745–750, 2007.

[30] R. L. Plackett. The analysis of permutations. Journal of the Royal Statistical Society. Series C(Applied Statistics) , 24(2):193–202, 1975. doi: https://doi.org/10.2307/2346567.

[31] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners, 2019. Ms., OpenAI.

[32] R. Ramamurthy, P. Ammanabrolu, K. Brantley, J. Hessel, R. Sifa, C. Bauckhage, H. Hajishirzi, and Y. Choi. Is reinforcement learning (not) for natural language processing: Benchmarks, baselines, and building blocks for natural language policy optimization. In The EleventhInternational Conference on Learning Representations , 2023. URL https://openreview. net/forum?id=8aHzds2uUyB .

[33] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. CoRR , abs/1511.06732, 2015.

[34] D. Sadigh, A. D. Dragan, S. Sastry, and S. A. Seshia. Active preference-based learning of reward functions. In Robotics: Science and Systems (RSS) , 2017.

[35] A. Saha, A. Pacchiano, and J. Lee. Dueling rl: Reinforcement learning with trajectory preferences. In F. Ruiz, J. Dy, and J.-W. van de Meent, editors, Proceedings of The 26thInternational Conference on Artificial Intelligence and Statistics , volume 206 of Proceed-ings of Machine Learning Research , pages 6263–6289. PMLR, 25–27 Apr 2023. URL https://proceedings.mlr.press/v206/saha23a.html .

[36] V. Sanh, A. Webson, C. Raffel, S. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, A. Raja, M. Dey, M. S. Bari, C. Xu, U. Thakker, S. S. Sharma, E. Szczechla, T. Kim, G. Chh- ablani, N. Nayak, D. Datta, J. Chang, M. T.-J. Jiang, H. Wang, M. Manica, S. Shen, Z. X. Yong, H. Pandey, R. Bawden, T. Wang, T. Neeraj, J. Rozen, A. Sharma, A. Santilli, T. Fevry, J. A. Fries, R. Teehan, T. L. Scao, S. Biderman, L. Gao, T. Wolf, and A. M. Rush. Multi- task prompted training enables zero-shot task generalization. In International Conference onLearning Representations , 2022. URL https://openreview.net/forum?id=9Vrb9D0WI4 .

[37] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017.

[38] N. Stiennon, L. Ouyang, J. Wu, D. M. Ziegler, R. Lowe, C. Voss, A. Radford, D. Amodei, and P. Christiano. Learning to summarize from human feedback, 2022.

13

[39] R. Thoppilan, D. D. Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, Y. Li, H. Lee, H. S. Zheng, A. Ghafouri, M. Menegali, Y. Huang, M. Krikun, D. Lepikhin, J. Qin, D. Chen, Y. Xu, Z. Chen, A. Roberts, M. Bosma, V. Zhao, Y. Zhou, C.-C. Chang, I. Krivokon, W. Rusch, M. Pickett, P. Srinivasan, L. Man, K. Meier-Hellstern, M. R. Morris, T. Doshi, R. D. Santos, T. Duke, J. Soraker, B. Zevenbergen, V. Prabhakaran, M. Diaz, B. Hutchinson, K. Olson, A. Molina, E. Hoffman-John, J. Lee, L. Aroyo, R. Rajakumar, A. Butryna, M. Lamm, V. Kuzmina, J. Fenton, A. Cohen, R. Bernstein, R. Kurzweil, B. Aguera- Arcas, C. Cui, M. Croak, E. Chi, and Q. Le. Lamda: Language models for dialog applications, 2022.

[40] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXivpreprint arXiv:2302.13971 , 2023.

[41] M. Völske, M. Potthast, S. Syed, and B. Stein. TL;DR: Mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization , pages 59–63, Copenhagen, Denmark, Sept. 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-4508. URL https://aclanthology.org/W17-4508 .

[42] L. von Werra, J. Tow, reciprocated, S. Matiana, A. Havrilla, cat state, L. Castricato, Alan, D. V. Phung, A. Thakur, A. Bukhtiyarov, aaronrmm, F. Milo, Daniel, D. King, D. Shin, E. Kim, J. Wei, M. Romero, N. Pochinkov, O. Sanseviero, R. Adithyan, S. Siu, T. Simonini, V. Blagojevic, X. Song, Z. Witten, alexandremuzio, and crumb. CarperAI/trlx: v0.6.0: LLaMa (Alpaca), Benchmark Util, T5 ILQL, Tests, Mar. 2023. URL https://doi.org/10.5281/zenodo. 7790115 .

[43] B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax , May 2021.

[44] S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319 , 2019.

[45] R. J. Williams. Simple statistical gradient-following algorithms for connectionist rein- forcement learning. Mach. Learn. , 8(3–4):229–256, may 1992. ISSN 0885-6125. doi: 10.1007/BF00992696. URL https://doi.org/10.1007/BF00992696 .

[46] Y. Wu and B. Hu. Learning to extract coherent summary via deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and ThirtiethInnovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium onEducational Advances in Artificial Intelligence , AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018. ISBN 978-1-57735-800-8.

[47] X. Yan, C. Luo, C. L. A. Clarke, N. Craswell, E. M. Voorhees, and P. Castells. Human preferences as dueling bandits. In Proceedings of the 45th International ACM SIGIR Conferenceon Research and Development in Information Retrieval , SIGIR ’22, page 567–577, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450387323. doi: 10.1145/3477495.3531991. URL https://doi.org/10.1145/3477495.3531991 .

[48] Y. Yue, J. Broder, R. Kleinberg, and T. Joachims. The k-armed dueling bandits problem. Journal of Computer and System Sciences , 78(5):1538–1556, 2012. ISSN 0022-0000. doi: https: //doi.org/10.1016/j.jcss.2011.12.028. URL https://www.sciencedirect.com/science/ article/pii/S0022000012000281 . JCSS Special Issue: Cloud Computing 2011.

[49] D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving. Fine-tuning language models from human preferences, 2020.

14

Author Contributions

All authors provided valuable contributions to designing, analyzing, and iterating on experiments, writing and editing the paper, and generally managing the project’s progress.

RR proposed using autoregressive reward models in discussions with EM ; derived the DPO objective; proved the theoretical properties of the algorithm and wrote the relevant sections and appendices. He also suggested and helped with organizing experiments and contributed some of the PPO and reward learning baselines.

AS initiated the discussion on using weighted regression methods as an alternative to PPO; initiated project-related organization, wrote initial analysis connecting DPO with weighted regression and unlikelihood; design and iterations of DPO + baseline implementations, initial exploratory exper- iments for DPO; substantial experiment organization and design (datasets, baselines, evaluation); led model training and evaluation for controlled sentiment generation and summarization; design iterations for GPT-4 evaluation (particularly summarization); substantial writing contributions to abstract, prelims/method and experiments; editing contributions to other sections.

EM provided input on early discussions on learning autoregressive reward functions; wrote the first implementation of DPO and ran the first DPO experiments; trained the large-scale (summarization and dialogue) DPO models used in paper experiments; conducted initial GPT-4 win rate evaluations and set up related infrastructure; recruited participants for, conducted, and analyzed results from the human study; wrote the abstract, introduction, related work, discussion, and most of experiments; and assisted with editing the rest of the paper.

CF, CM, & SE supervised the research, suggested ideas and experiments, and assisted in writing the paper.

A. Mathematical Derivations

A.1Deriving the Optimum of the KL-Constrained Reward Maximization Objective

In this appendix, we will derive Eq. 4. Analogously to Eq. 3, we optimize the following objective:

max π E x∼D,y∼π

r ( x, y )

−β D KL

π ( y|x ) ||π ref ( y|x )

(11)

under any reward function r ( x, y ) , reference model π ref and a general non-parametric policy class. We now have:

max π E x∼D,y∼π

r ( x, y )

−β D KL

π ( y|x ) ||π ref ( y|x )

= max π E x∼D E y∼π ( y|x )

r ( x, y ) −β log __ π ( y|x )

π ref ( y|x )

= min π E x∼D E y∼π ( y|x )

log __ π ( y|x )

βr ( x, y )

π ref ( y|x ) __ 1

1

= min π E x∼D E y∼π ( y|x )

 log π ( y|x )

 (12)

Z ( x ) π ref ( y|x ) exp 1 β __ r ( x, y ) log Z ( x )

where we have partition function:

ref ( y|x ) exp 1

βr ( x, y ) .

Z ( x ) = X

Note that the partition function is a function of only x and the reference policy π ref , but does not depend on the policy π . We can now define

π∗ ( y|x ) = 1

Z ( x ) π ref ( y|x ) exp 1

βr ( x, y ) ,

15

which is a valid probability distribution as π∗ ( y|x ) 0 for all y and P y __ π∗ ( y|x ) = 1 . Since Z ( x ) is not a function of y , we can then re-organize the final objective in Eq 12 as:

min π E x∼D

E y∼π ( y|x )

log __ π ( y|x )

log Z ( x ) = (13)

π∗ ( y|x )

min π E x∼D [ D KL ( π ( y|x ) ||π∗ ( y|x )) log Z ( x )] (14)

Now, since Z ( x ) does not depend on π , the minimum is achieved by the policy that minimizes the first KL term. Gibbs’ inequality tells us that the KL-divergence is minimized at 0 if and only if the two distributions are identical. Hence we have the optimal solution:

π ( y|x ) = π∗ ( y|x ) = 1

Z ( x ) π ref ( y|x ) exp 1

βr ( x, y ) (15)

for all x∈D . This completes the derivation.

A.2Deriving the DPO Objective Under the Bradley-Terry Model

It is straightforward to derive the DPO objective under the Bradley-Terry preference model as we have

exp ( r∗ ( x, y 1 )) + exp ( r∗ ( x, y 2 )) (16)

p∗ ( y 1 ≻y 2 |x ) = exp ( r∗ ( x, y 1 ))

In Section 4 we showed that we can express the (unavailable) ground-truth reward through its corresponding optimal policy:

r∗ ( x, y ) = β log __ π∗ ( y|x )

π ref ( y|x ) + β log Z ( x ) (17)

Substituting Eq. 17 into Eq. 16 we obtain:

π ref ( y 1 |x ) + β log Z ( x )

p∗ ( y 1 ≻y 2 |x ) = exp β log __ π∗ ( y 1 |x )

exp β log __ π∗ ( y 1 |x )

π ref ( y 1 |x ) + β log Z ( x ) + exp β log __ π∗ ( y 2 |x )

π ref ( y 2 |x ) + β log Z ( x )

= 1

π ref ( y 2 |x ) __ −β log π∗ ( y 1 |x )

1 + exp β log __ π∗ ( y 2 |x )

π ref ( y 1 |x )

= σβ log __ π∗ ( y 1 |x )

.

π ref ( y 1 |x ) __ −β log π∗ ( y 2 |x )

π ref ( y 2 |x )

The last line is the per-instance loss in Equation 7.

A.3Deriving the DPO Objective Under the Plackett-Luce Model

The Plackett-Luce model [ 30 , 21 ] is a generalization of the Bradley-Terry model over rankings (rather than just pair-wise comparisons). Similar to to the Bradley-Terry model, it stipulates that when presented with a set of possible choices, people prefer a choice with probability proportional to the value of some latent reward function for that choice. In our context, when presented with a prompt x and a set of K answers y 1 , . . . , yK a user would output a permutation τ : [ K ] [ K ] , giving their ranking of the answers. The Plackett-Luce model stipulates that

exp( r∗ ( x, yτ ( k ) ))

p∗ ( τ|y 1 , . . . , yK, x ) =

k =1

K Y

P Kj = k exp( r∗ ( x, yτ ( j ) )) (18)

Notice that when K = 2 , Equation 18 reduces to the Bradley-Terry model. However, for the general Plackett-Luce model, we can still utilize the results of Eq. 5 and substitute the reward function parameterized by its optimal policy. Similarly to Appendix A.2, the normalization constant Z ( x ) cancels out and we’re left with:

exp β log π∗ ( ( k ) |x )

π ref ( ( k ) |x )

p∗ ( τ|y 1 , . . . , yK, x ) =

k =1

K Y

π ref ( ( j ) |x ) (19)

P Kj = k exp β log π∗ ( ( j ) |x )

16

Similarly to the approach of Section 4, if we have access to a dataset D = ( i ) , y ( i ) 1 __ , . . . , y ( i ) K __ , x ( i ) }Ni =1 of prompts and user-specified rankings, we can use a parameterized model and optimize this objective with maximum-likelihood.:

exp β log πθ ( ( k ) |x )

π ref ( ( k ) |x )

L DPO ( πθ, π ref ) = E τ,y 1 ,…,yK,x∼D

k =1

K Y

 log

 (20)

π ref ( ( j ) |x )

P Kj = k exp β log πθ ( ( j ) |x )

A.4Deriving the Gradient of the DPO Objective

In this section we derive the gradient of the DPO objective:

∇θL DPO ( πθ ; π ref ) = −∇θ E ( x,yw,yl ) ∼D

log σβ log __ πθ ( yl|x )

(21)

π ref ( yl|x ) __ −β log πθ ( yw|x )

π ref ( yw|x )

We can rewrite the RHS of Equation 21 as

∇θL DPO ( πθ ; π ref ) = E ( x,yw,yl ) ∼D

σ′ ( u )

σ ( u ) __ ∇θ ( u ) , (22)

where u = β log __ πθ ( yl|x )

π ref ( yw|x ) .

π ref ( yl|x ) __ −β log πθ ( yw|x )

Using the properties of sigmoid function σ′ ( x ) = σ ( x )(1 −σ ( x )) and σ ( −x ) = 1 −σ ( x ) , we obtain the final gradient

∇θL DPO ( πθ ; π ref ) =

E ( x,yw,yl ) ∼D

∇θ log π ( yw|x ) −∇θ log π ( yl|x ) ,

βσβ log __ πθ ( yw|x )

π ref ( yw|x ) __ −β log πθ ( yl|x )

π ref ( yl|x )

After using the reward substitution of ˆ ( x, y ) = β log __ πθ ( y|x )

π ref ( y|x ) we obtain the final form of the gradient from Section 4.

A.5 Proof of Lemma 1 and 2

In this section, we will prove the two lemmas from Section 5.

Lemma 1 Restated. Under the Plackett-Luce preference framework, and in particular the Bradley-Terry framework, two reward functions from the same equivalence class induce the same preferencedistribution.

Proof. We say that two reward functions r ( x, y ) and r′ ( x, y ) are from the same equivalence class if r′ ( x, y ) = r ( x, y ) + f ( x ) for some function f . We consider the general Plackett-Luce (with the Bradley-Terry model a special case for K = 2 ) and denote the probability distribution over rankings induced by a particular reward function r ( x, y ) as pr . For any prompt x , answers y 1 , . . . , yK and ranking τ we have:

exp( r′ ( x, yτ ( k ) ))

pr′ ( τ|y 1 , . . . , yK, x ) =

k =1

K Y

P Kj = k exp( r′ ( x, yτ ( j ) ))

exp( r ( x, yτ ( k ) ) + f ( x ))

=

k =1

K Y

P Kj = k exp( r ( x, yτ ( j ) ) + f ( x ))

exp( f ( x )) exp( r ( x, yτ ( k ) ))

=

k =1

K Y

exp( r ( x, yτ ( k ) ))

=

exp( f ( x )) P Kj = k exp( r ( x, yτ ( j ) ))

k =1

K Y

P Kj = k exp( r ( x, yτ ( j ) ))

= pr ( τ|y 1 , . . . , yK, x ) ,

which completes the proof.

17

Lemma 2 Restated. Two reward functions from the same equivalence class induce the same optimalpolicy under the constrained RL problem.

Proof. Let us consider two reward functions from the same class, such that r′ ( x, y ) = r ( x, y ) + f ( x ) and, let us denote as πr and πr′ the corresponding optimal policies. By Eq. 4, for all x, y we have

πr′ ( y|x ) = 1

βr′ ( x, y )

y __ π ref ( y|x ) exp 1 β __ r′ ( x, y ) π ref ( y|x ) exp 1

P

= 1

β ( r ( x, y ) + f ( x ))

y __ π ref ( y|x ) exp 1 β ( r ( x, y ) + f ( x )) π ref ( y|x ) exp 1

P

= 1

βr ( x, y ) exp 1

βf ( x )

y __ π ref ( y|x ) exp 1 β __ r ( x, y ) π ref ( y|x ) exp 1

exp 1 β __ f ( x ) P

= 1

βr ( x, y )

y __ π ref ( y|x ) exp 1 β __ r ( x, y ) π ref ( y|x ) exp 1

P

= πr ( y|x ) ,

which completes the proof.

A.6Proof of Theorem 1

In this section, we will expand on the results of Theorem 1.

Theorem 1 Restated. Assume, we have a reference model, such thatπref ( y|x ) > 0 for all pairs ofpromptsxand answersyand a parameterβ > 0 . All reward equivalence classes, as defined inSection 5 can be represented with the reparameterizationr ( x, y ) = β log π ( y|x )

πref ( y|x ) __ for some modelπ ( y|x ) .

Proof. Consider any reward function r ( x, y ) , which induces an optimal model πr ( y|x ) under the KL-constrained RL problem, with solution given by 4. Following Eq. 5, when we log-linearize both sides we obtain:

r ( x, y ) = β log __ πr ( y|x )

π ref ( y|x ) + β log Z ( x )

y __ π ref ( y|x ) exp 1 β __ r ( x, y ) (notice that Z ( x ) also depends on the reward function

r ). Using the operator r′ ( x, y ) = f ( r, π ref , β )( x, y ) = r ( x, y ) −β log Z ( x ) , we see that this new reward function is within the equivalence class of r and, we have:

where Z ( x ) = P

r′ ( x, y ) = β log __ πr ( y|x )

π ref ( y|x )

which completes the proof.

We can further expand on these results. We can see that if r and r′ are two reward functions in the same class, then

f ( r, π ref , β )( x, y ) = β log __ πr ( y|x )

π ref ( y|x ) = β log π′r ( y|x )

π ref ( y|x ) = f ( r′, π ref , β )( x, y )

where the second equality follows from Lemma 2. We have proven that the operator f maps all reward functions from a particular equivalence class to the same reward function. Next, we show that for every equivalence class of reward functions, the reward function that has the reparameterization outlined in Theorem 1 is unique. Proposition 1. Assume, we have a reference model, such thatπref ( y|x ) > 0 for all pairs of promptsxand answersyand a parameterβ > 0 . Then every equivalence class of reward functions,as defined in Section 5, has a unique reward functionr ( x, y ) , which can be reparameterized asr ( x, y ) = β log π ( y|x )

πref ( y|x ) __ for some modelπ ( y|x ) .

18

Proof. We will proceed using proof by contradiction. Assume we have two reward functions from the same class, such that r′ ( x, y ) = r ( x, y ) + f ( x ) . Moreover, assume that r′ ( x, y ) = β log __ π′ ( y|x )

π ref ( y|x ) for some model π′ ( y|x ) and r ( x, y ) = β log π ( y|x )

π ref ( y|x ) for some model π ( y|x ) , such that π̸ = π′ . We then have

β __ f ( x ))

r′ ( x, y ) = r ( x, y ) + f ( x ) = β log __ π ( y|x )

π ref ( y|x ) + f ( x ) = β log π ( y|x ) exp( 1

π ref ( y|x ) = β log __ π′ ( y|x )

π ref ( y|x )

for all prompts x and completions y . Then we must have π ( y|x ) exp( 1

β __ f ( x )) = π′ ( y|x ) . Since these are distributions, summing over y on both sides, we obtain that exp( 1

β __ f ( x )) = 1 and since β > 0 , we must have f ( x ) = 0 for all x . Therefore r ( x, y ) = r′ ( x, y ) . This completes the proof.

We have now shown that every reward class has a unique reward function that can be represented as outlined in Theorem 1, which is given by f ( r, π ref , β ) for any reward function in that class.

B. DPO Implementation Details and Hyperparameters

DPO is relatively straightforward to implement; PyTorch code for the DPO loss is provided below:

19

“` import torch.nn.functional as F

def dpo_loss(pi_logps, ref_logps, yw_idxs, yl_idxs, beta):

“”” pi_logps: policy logprobs, shape (B,) ref_logps: reference model logprobs, shape (B,) yw_idxs: preferred completion indices in [0, B-1], shape (T,) yl_idxs: dispreferred completion indices in [0, B-1], shape (T,) beta: temperature controlling strength of KL penalty

Each pair of (yw_idxs[i], yl_idxs[i]) represents the

indices of a single preference pair.

“””

pi_yw_logps, pi_yl_logps = pi_logps[yw_idxs], pi_logps[yl_idxs]

ref_yw_logps, ref_yl_logps = ref_logps[yw_idxs], ref_logps[yl_idxs]

pi_logratios = pi_yw_logps – pi_yl_logps

ref_logratios = ref_yw_logps – ref_yl_logps

losses = -F.logsigmoid(beta * (pi_logratios – ref_logratios))

rewards = beta * (pi_logps – ref_logps).detach()

return losses, rewards “ Unless noted otherwise, we use a _β_ = 0 _._ 1 , batch size of64and the RMSprop optimizer with a learning rate of1e-6by default. We linearly warmup the learning rate from0to1e-6over150` steps. For TL;DR summarization, we use β = 0 . 5 , while rest of the parameters remain the same.

C.Further Details on the Experimental Set-Up

In this section, we include additional details relevant to our experimental design.

C.1IMDb Sentiment Experiment and Baseline Details

The prompts are prefixes from the IMDB dataset of length 2-8 tokens. We use the pre-trained senti- ment classifier siebert/sentiment-roberta-large-english as a ground-truth reward model and gpt2-large as a base model. We use these larger models as we found the default ones to generate low-quality text and rewards to be somewhat inaccurate. We first use supervised fine-tuning on a subset of the IMDB data for 1 epoch. We then use this model to sample 4 completions for 25000 prefixes and create 6 preference pairs for each prefix using the ground-truth reward model. The RLHF reward model is initialized from the gpt2-large model and trained for 3 epochs on the preference datasets, and we take the checkpoint with the highest validation set accuracy. The “TRL” run uses the hyper-parameters in the TRL library. Our implementation uses larger batch samples of 1024 per PPO step.

C.2GPT-4 prompts for computing summarization and dialogue win rates

A key component of our experimental setup is GPT-4 win rate judgments. In this section, we include the prompts used to generate win rates for the summarization and dialogue experiments. We use gpt-4-0314 for all our experiments. The order of summaries or responses are randomly chosen for every evaluation.

Rafael Rafailov ∗† Archit Sharma ∗† Eric Mitchell ∗†

Stefano Ermon †‡ Christopher D. ManningChelsea Finn

† Stanford University ‡ CZ Biohub {rafailov,architsh,eric.mitchell}@cs.stanford.edu

Share This Article
Leave a comment

Leave a Reply

Your email address will not be published. Required fields are marked *