diff --git a/docs/user-guide/dpo.rst b/docs/user-guide/dpo.rst index b29318772..62d66fcba 100644 --- a/docs/user-guide/dpo.rst +++ b/docs/user-guide/dpo.rst @@ -2,20 +2,30 @@ .. _model-aligner-dpo: -Model Alignment by Direct Preference Optimization (DPO) -@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ +Model Alignment by DPO, RPO, and IPO +@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ The NeMo Framework supports efficient model alignment via the NeMo-Aligner codebase. -All algorithms in NeMo-Aligner will work with any GPT-based model that is from Megatron Core (in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire DPO pipeline using the newly released `2B GPT model with 4096 sequence length `__. The same tutorial also works for GPT models (such as LLaMa2) of any size. +All algorithms in NeMo-Aligner will work with any GPT-based model that is from Megatron Core (in the config it has ``mcore_gpt=True``). For the purposes of this tutorial, we will go through the entire Direct Preference Optimization (DPO) pipeline using the newly released `2B GPT model with 4096 sequence length `__. The same tutorial also works for GPT models (such as LLaMa2) of any size. + +DPO with LoRA +############# We support both full-parameter DPO training and LoRA DPO training. For full-parameter DPO, there exists an actor and a reference model. The actor is initialized with the reference model and is fully trainable. The reference model is frozen and used to calculate logprobs for KL-penalty loss (see `DPO paper `__). For LoRA-based DPO, the actor is initialized by the reference model plus LoRA weights, where only the LoRA weights are trainable. Therefore, it allows us to switch between the actor/reference models by simply enabling or disabling LoRA. In addition, there is no need to store two sets of LLM weights. -Besides the vanilla DPO algorithm, we support other variants of DPO algorithms including Identity preference optimization (IPO) and Reward-aware preference optimization (RPO). The algorithm is identified with the ``dpo.preference_loss`` config variable. We support three sorts of RPO algorithms based on the distance metric: ``rpo_sq`` for squared distance; ``rpo_bwd_kl`` for Bernoulli backward KL divergence; ``rpo_fwd_kl`` for Bernoulli forward KL divergence. To use the RPO algorithm, each dataset example should have ``chosen_reward`` and ``rejected_reward``, which might come from Human labelers or reward models. If ``chosen_reward`` and ``rejected_reward`` are not existent in the data, ``dpo.default_chosen_reward`` and ``dpo.default_rejected_reward`` are used. +RPO and IPO Variations +####################### + +Besides the vanilla DPO algorithm, we support other variants of DPO algorithms, including Identity preference optimization (IPO) and Reward-aware preference optimization (RPO). + +The algorithm is identified with the ``dpo.preference_loss`` config variable. We support three sorts of RPO algorithms based on the distance metric: ``rpo_sq`` for squared distance, ``rpo_bwd_kl`` for Bernoulli backward KL divergence, and ``rpo_fwd_kl`` for Bernoulli forward KL divergence. + +To use the RPO algorithm, each dataset example should have chosen_reward and rejected_reward, which might come from human labelers or reward models. If chosen_reward and rejected_reward are not existent in the data, dpo.default_chosen_reward and dpo.default_rejected_reward are used. -Obtaining a Pretrained Model +Obtain a Pretrained Model ############################ To start, we must first get a pretrained model to align. There are two models we recommend to get started. The rest of the tutorial will work with either model, but for demonstration purposes, we will use the smaller 2B model. diff --git a/docs/user-guide/index.rst b/docs/user-guide/index.rst index 151983b8c..03c476332 100644 --- a/docs/user-guide/index.rst +++ b/docs/user-guide/index.rst @@ -28,10 +28,10 @@ SteerLM is a novel approach developed by NVIDIA. SteerLM simplifies alignment compared to RLHF. It is based on SFT, but allows user-steerable AI by enabling you to adjust attributes at inference time. :ref:`Model Alignment by SteerLM 2.0 Method ` - SteerLM 2.0 is an extenstion to SteerLM method that introduces an iterative training procedure to explicitly enforce the generated responses to follow the desired attribute distribution. + SteerLM 2.0 is an extension to SteerLM method that introduces an iterative training procedure to explicitly enforce the generated responses to follow the desired attribute distribution. -:ref:`Model Alignment by Direct Preference Optimization (DPO) ` - DPO is a simpler alignment method compared to RLHF. DPO introduces a novel parameterization of the reward model in RLHF. This parameterization allows us to extract the corresponding optimal policy. +:ref:`Model Alignment by DPO, RPO and IPO ` + DPO, RPO, and IPO are simpler alignment methods compared to RLHF. DPO introduces a novel parameterization of the reward model in RLHF, which allows us to extract the corresponding optimal policy. Similarly, RPO and IPO provide alternative parameterizations or optimization strategies, each contributing unique approaches to refining model alignment. :ref:`Fine-tuning Stable Diffusion with DRaFT+ ` DRaFT+ is an algorithm for fine-tuning text-to-image generative diffusion models. It achieves this by directly backpropagating through a reward model. This approach addresses the mode collapse issues from the original DRaFT algorithm and improves diversity through regularization.