Previous: GPT-2 and GPT-3

We have seen how we can train pretty great language models like GPT-2 and GPT-3. Nevertheless, these models are not specifically trained to generate responses in a way that a human may prefer. In this article, we will study InstructGPT so we can understand how we can finetune a model to meet human preferences using Reinforcement Learning from Human Feedback (RLHF).

InstructGPT

InstructGPT shows how one can finetune language models such as GPT-3 so they respond in a way that is more aligned with what we humans prefer. For example, when a model is asked to summarize some text, we might want the style of the response to be different compared to when we ask a model to brainstorm together. InstructGPT starts from a pretrained GPT-3, followed by 3-steps approach as outlined below.

First, we update our language model via supervised finetuning. In terms of finetuning method, this step is pretty straight forward. We only need to collect prompt-response pairs written by human, and finetune our model using this dataset. However, the dataset quality needs to be monitored closely, as we would like the dataset to align with human preferences. For example, the dataset should include some “brainstorming”, “summarization”, and various other type of interactions (see paper for details).

Second, we train a reward model \(r\) – essentially a regression model – that gives a “score” to a prompt-response pair: it simply takes text as input, and outputs a scalar value. We first need to collect another human-labeled dataset. This time, human labelers are asked to rank their preferences between different responses from a given prompt. To train the model, we sample a pair of responses from a single prompt, and process each prompt-response through our reward model separately to get two scalar values. The training objective is to ensure that the predicted reward for the preferred response is higher than for the non-preferred one. From the paper, the loss function for a batch of all responses for a given prompt is defined as:

\[L = - \frac{1}{\binom{K}{2}} E_{(x, y_w, y_l) \sim D} [\log (\sigma(r(x, y_w) - r(x, y_l)))],\]

where \(K\) denotes the number of different responses available for a single prompt \(x\), \(D\) denotes the dataset, \(r(x, y_w)\) denotes the predicted score for the preferred response \(y_w\), and \(r(x, y_l)\) denotes the predicted score for the non-preferred response \(y_l\).

Lastly, once we have a trained reward model, we continue updating our language model again via reinforcement learning (RL) algorithms. In the paper, InstructGPT was trained with Proximal Policy Optimization (PPO). Here, an episode refers to the complete generation of a token sequence in response to a specific prompt. We can then use the trained reward model to give a reward value for the generated sequence. Perhaps one of the reasons to resort to RL might be that supervised learning alone can only capture so much about the nuance of what constitutes a good response. In addition, generating dataset for supervised finetuning is much more expensive since it requires a person to fully write proper prompt-responses pairs manually. In contrast, once a system that utilizes the LLM is deployed, dataset to update the reward model can be more easily udpated by asking users to provide feedback.

Summary

We can finetune a pretrained LLMs like GPT-3 to align better with human preferences. This process begins with supervised finetuning, which is then followed by RLHF. The result is a language model that generates responses humans prefer when interacting with it.

References

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe. Training Language Models to Follow Instructions with Human Feedback. Neural Information Processing Systems, 2022.

Next: TBD