Human Preferences collection for RLHF
Overview
This project will help you to get up your LLM to the ChatGPT quality level through collecting comparison data to establish human preferences for the responses generated by the supervised model.
Through ranking multiple responses based on quality, you can train a reward model that effectively captures human preferences. This reward model plays a crucial role in Reinforcement Learning, optimizing the performance of the fine-tuned foundational model.
Further Reading and Resources
- Gathering Human Feedback Tutorial A Jupyter Notebook tutorial that will guide you through the step-by-step process of collecting comparison data, establishing human preferences, and incorporating this feedback into the reward model training.
- RLHF Resources: A collection of links, tutorials and best practices on how collect data and build an end-to-end Reinforcement Learning from Human Feedback (RLHF) system to fine-tune Generative AI models.
- Awesome-Human-in-the-Loop List: An awesome list of human in the loop resources and references for retraining models.
- Talk: Improving Machine Learning from Human Feedback : A talk from PyData Berlin on how to improve machine learning from Human Feedback using RLHF.
- Workshop: Getting Started with Reinforcement Learning : A workshop on how to get started with Reinforcement Learning.
- Guide: Five Large Language Models you can Fine-Tune Today
How to collect the dataset
The dataset for RLHF consists of two parts:
- input prompts
- Alternative generated responses for each prompt.
To simplify the task for the human labeler, it is recommended to have 2 responses per prompt to select from.
Start with an initial set of prompts and responses, where each item is a JSON object with the following structure:
[{
"prompt": "The quick brown fox...",
"answer1": "jumps over the lazy dog.",
"answer2": "bags few lynx."
}, ...]
Collect examples either by generating them manually, or use your baseline model to generate multiple alternative hypotheses.
After your dataset has started to be collected in dataset.json
file, create a project and upload the dataset to Label Studio.
Starting your labeling project
- Create new project in Label Studio
- Go to
Settings > Labeling Interface > Browse Templates > Generative AI > Human Preference collection for RLHF
- Save the project
Import the dataset
Using python SDK you can import the dataset with input prompts into Label Studio. With the PROJECT_ID
of the project
you’ve just created, run the following code:
from label_studio_sdk import Client
ls = Client(url='<YOUR-LABEL-STUDIO-URL>', api_key='<YOUR-API_KEY>')
project = ls.get_project(id=PROJECT_ID)
project.import_tasks('dataset.json')
Then you can start annotating the dataset by creating the responses.
How to configure the labeling interface
The Human Preference collection for RLHF
template includes the following labeling interface in XML format:
<View className="root">
<Style>
.root {
box-sizing: border-box;
margin: 0;
padding: 0;
font-family: 'Roboto',
sans-serif;
line-height: 1.6;
background-color: #f0f0f0;
}
.container {
margin: 0 auto;
padding: 20px;
background-color: #ffffff;
border-radius: 5px;
box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.1), 0 6px 20px 0 rgba(0, 0, 0, 0.1);
}
.prompt {
padding: 20px;
background-color: #0084ff;
color: #ffffff;
border-radius: 5px;
margin-bottom: 20px;
box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
}
.answers {
display: flex;
justify-content: space-between;
flex-wrap: wrap;
gap: 20px;
}
.answer-box {
flex-basis: 49%;
padding: 20px;
background-color: rgba(44, 62, 80, 0.9);
color: #ffffff;
border-radius: 5px;
box-shadow: 0 2px 4px 0 rgba(0, 0, 0, 0.1), 0 3px 10px 0 rgba(0, 0, 0, 0.1);
}
.answer-box p {
word-wrap: break-word;
}
.answer-box:hover {
background-color: rgba(52, 73, 94, 0.9);
cursor: pointer;
transition: all 0.3s ease;
}
.lsf-richtext__line:hover {
background: unset;
}
.answer-box .lsf-object {
padding: 20px
}
</Style>
<View className="container">
<View className="prompt">
<Text name="prompt" value="$prompt" />
</View>
<View className="answers">
<Pairwise name="comparison" toName="answer1,answer2"
selectionStyle="background-color: #27ae60; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.2); border: 2px solid #2ecc71; cursor: pointer; transition: all 0.3s ease;" />
<View className="answer-box">
<Text name="answer1" value="$answer1" />
</View>
<View className="answer-box">
<Text name="answer2" value="$answer2" />
</View>
</View>
</View>
</View>
<!--{ "data" : {
"prompt": "What are the key benefits of using Reinforcement Learning from Human Feedback (RLHF) for dataset collection in the context of Large Language Model (LLM) generation?",
"answer1": "Reinforcement Learning from Human Feedback (RLHF) for dataset collection in Large Language Model (LLM) generation provides key benefits such as improved model performance through direct optimization, better alignment with human values by incorporating human feedback, and the ability to iteratively refine the model based on user interactions, resulting in a more user-friendly and efficient language model.",
"answer2": "Using Reinforcement Learning from Human Feedback (RLHF) for dataset collection in Large Language Model (LLM) generation offers advantages such as enhanced model capabilities by optimizing for desired outcomes, greater adaptability to human preferences through the inclusion of human feedback, and the opportunity to continuously improve the model based on user experiences, ultimately leading to a more effective and responsive language model."
}}
-->
The <Style>
section defines a custom UI design for the labeling interface, along with the layout provided by the <View>
tag.
In this example, we use a simple layout with a prompt and two answer boxes. <Pairwise>
tag defines the pairwise comparison between the answers written in the <Text>
tags.
Displayed text is taken from the $prompt
, $answer1
and $answer2
variables, which are defined in the <Text>
tags with the value
attribute.
Additionally, you can modify "prompt"
, $answer1
and $answer2
in XML comments section to see how it looks with your data.
Export the dataset
There have to be from hundreds to thousands of tasks labeled to get your LLM being fine-tuned, depending on the
complexity of your problem statement.
After you’ve labeled enough tasks, you can export the dataset in the following raw Label Studio JSON format:
[
{
"id": 1,
"data": {
"prompt": "Generate a Python function that takes a list of integers as input and returns the sum of all even numbers in the list."
},
"annotations": [
{
"id": 1,
"created_at": "2021-03-03T14:00:00.000000Z",
"result": [
{
"from_name": "instruction",
"to_name": "prompt",
"type": "textarea",
"value": {
"text": [
"def sum_even_numbers(numbers):\n return sum([n for n in numbers if n % 2 == 0])"
]
}
}
],
// other fields
The above represents the list of tasks with annotations. Each task has a data.prompt
field with the input prompt, and each “annotations” item contains a response result under result.value.text
field.
You can create more than one annotation per task.
Alternatively, you can download the same data in CSV format:
prompt,instruction
"Generate...","def sum..."
How to fine-tune the model
You generated examples can be used to finetune the opensource LLM models like GPT-2, T5, Falcon, LLaMa, etc. You can check the complete list of models on HuggingFace LLM leaderboard, download and finetune the model from Model Hub.
Alternatively, there are finetuning services available: OpenAI, CoHere, (AI21 Studio)[https://www.ai21.com/studio/foundation-models], MosaicML, Google Cloud AI Platform, AzureML, etc.