Reinforcement Learning from Human Feedback using Llama 2 7B and Vertex AI

ML and AI Blogs | Issue# 9 [November 07, 2024]

Expectations

We will use Vertex AI Pipelines from Google Cloud Platform (GCP) to fine-tune Llama 2 7B using Reinforcement Learning from Human Feedback (RLHF) and use it to perform batch inference for text summarization tasks. The goal is not however to focus on the performance aspect or the quality of the summarization outputs, but to go through the end-to-end workflow (see section “Workflow”) required for such a project. This tutorial takes inspiration from DeepLearning.AI’s course on RLHF.

RLHF tuning (or fine-tuning, which is essentially the same as tuning in this context) a base Large Language Model (LLM) – Meta’s Llama 2 7B in this case, can be quite a compute-, time-, and cost-heavy task. The human preference dataset (see section “Workflow”) we will be using (see section “Preparing the Data”) in this work is much bigger in size than the dataset size recommended in GCP’s documentation on RLHF Tuning LLMs with Vertex AI. We will therefore, tune Llama 2 7B for 1 epoch each for both the reward model and the Reinforcement Learning (RL) part or the reinforcer (see section How it works to learn more about these terms). One could tune longer for improved results (see section “Thoughts”). However, we confirm that 1 epoch of training is sufficient to demonstrate the RLHF tuning workflow using Llama 2 7B and Vertex AI’s RLHF pipelines for text summarization tasks.

NOTE: The RLHF tuning feature is in Preview at the time of writing this guide. More here.

Workflow

The RLHF model tuning workflow on Vertex AI includes the following steps for text summarization tasks.

Prepare the human preference dataset. Each example in the dataset records the preference between two options (summaries in this case) that were presented to a human. 5,000 to 10,000 examples are recommended. The preference dataset contains the following fields.
- prompt: Prompt or instruction, Reddit post in this case.
- candidate_0: First summary generated by a LLM.
- candidate_1: Second summary generated by a LLM.
- choice: Preference of the human expert expressed as 0 (candidate_0) or 1 (candidate_1).
Prepare the prompt dataset containing unlabeled prompts only. Prompts can be the same as those from the preference dataset, or different.
Provide an optional evaluation dataset that only includes unlabeled prompts for prediction (i.e., for generating summaries) after the model is tuned. If provided, inference is performed on it after the RLHF tuning job gets completed.
All of the datasets required for RLHF tuning LLMs using Vertex AI Pipelines should be in the JSON Lines (JSONL) format.
Upload the aforementioned datasets to a Cloud Storage bucket. The path to your bucket is gs://name-of_your_bucket. It is not necessary for them to be in the same storage bucket, but they are required to be in Cloud Storage buckets.
Create a RLHF model tuning job using a RLHF pipeline on Vertex AI.
NOTE: The RLHF Pipeline on Vertex AI exists in the Google Cloud Pipeline Components library. To run it, we will need to import, compile, and execute it.
Once tuned, the model is deployed to a Vertex AI endpoint with the same name as that of the tuned model.

The RLHF pipeline on Vertex AI outputs training logs with relevant metrics and curves to TensorBoard. Steps to be followed post RLHF tuning are as follows.

Access the TensorBoard logs from Cloud Storage to visualize and review the evaluation performance of the RLHF-tuned model.
Access the evaluation / inference results from Cloud Storage to view the Reddit posts and their summaries generated by the tuned model. Perform further analysis as required.

See DeepLearning.AI’s course, GCP’s documentation, and other resources in the subsection “Resources” to learn more.

How it Works

Key Concepts

Let us review some of the key aspects of the RLHF pipeline on Vertex AI that we will use to tune Llama 2 7B, which we will then use to perform batch or bulk inference for text summarization tasks.

Base Model: The base LLM to tune, Llama 2 7B in this case.
Prompt: Reddit posts in this case. See section “Preparing the Data” for more.
Completion: Text response / summary produced by the base model.
Learning Goal: To tune the base LLM to produce completions aligned with human preferences that maximize the rewards produced by the reward model.
Reward Model: Another LLM that takes in a prompt and a completion (candidate_0 and candidate_1) during inference and assigns a scalar value / score to indicate how good the completion is for the prompt. It is trained using the preference dataset. The reward model takes in a prompt and its two completions (candidate_0 and candidate_1) during training and assigns a score to each of the two completions. The loss function (rank_loss) is a combination of the scores that tries to maximize the difference of scores between the winning and losing candidates. Higher the score (reward), better the alignment of the completion with human preferences.
RL: In RL, an agent learns to achieve an objective by interacting with an environment and receiving rewards or penalties based on its actions. Through trial and error, the agent learns an optimal policy – a strategy for making decisions that maximizes cumulative rewards, even in tasks where the optimal solution is unknown or complex.
RL Loop: This takes in the prompt dataset. The base LLM (specifically, its weights) to tune is the policy in this case. It receives a reward from the reward model each time it generates a completion, indicating how aligned with human preferences the completion is. Using RL to learn the policy that maximizes the rewards results in a fine-tuned base LLM that generates completions with high scores / rewards from the reward model. In RLHF, the policy gradient method Proximal Policy Optimization (PPO) is used to learn the optimal policy and update the weights of the base LLM. The expectation is that each time the weight of the base LLM gets updated, the policy should get a little better at generating completions aligned with human preferences.
Evaluation Results: The evaluation dataset is used for inference after the RLHF tuning job is completed and the base LLM is tuned.

This work will utilize parameter-efficient fine-tuning. This paradigm aims to make the fine-tuning process compute-, time-, and cost-efficient by training only a subset of model parameters. This could be a subset of the existing parameters or a new set of parameters. The RLHF pipeline on Vertex AI takes care of this for us.

See subsection “Resources” to learn more.

Resources

The following resources should be helpful in navigating the GCP set-up, as well as understanding and managing the workflow for RLHF tuning Llama 2 7B in order to perform bulk inference using Vertex AI Pipelines.

DeepLearning.AI’s course on RLHF in association with Google Cloud.
GCP’s documentation about tuning Pathways Language Models (PaLM) text models by using RLHF tuning. Although PaLM models are deprecated, the information is very helpful in the context of RLHF tuning LLMs using Vertex AI.
GCP’s blog on “RLHF Tuning with Vertex AI.”
GCP’s documentation about “Introduction to Vertex AI Pipelines.”
GCP’s Colab notebook on “Vertex AI Model Garden – LLaMA2 (RLHF).”
GCP’s Colab notebook on “Vertex AI LLM Reinforcement Learning from Human Feedback.”
GCP’s Colab notebook on “Vertex AI LLM Batch Inference with RLHF-tuned Models.”

Let us load the relevant Python libaries.

Configurations

GCP

Follow the steps below to use GCP for RLHF tuning LLMs.

Create a GCP account if you do not already have one and log in to the account. Go for the 90-day, $300 free trial to use free credits made available to new GCP users (unless you are willing to pay).
Create a new GCP project. Make a note of the Project ID and Project number for later use. Information about the project can also be found under your GCP dashboard. More on creating and managing GCP projects here.
Set up a Cloud Billing account required to use GCP services that are chargeable. More on managing Cloud Billing here.
Enable the IAM, Cloud Storage, Vertex AI, and BigQuery APIs.
Create a service account for your GCP project. A service account is a special type of account that allows applications, services, and resources within GCP to authenticate and interact securely with other GCP services and resources. A service account is required to create Vertex AI Pipeline jobs. See this course and these [1 and 2] GCP resources for more.
Ensure that the project’s service account has the following roles enabled.
- Service Account User
- Storage Admin
- Vertex AI Administrator
- BigQuery User
Ensure that the Compute Engine default service account has the following roles enabled, if not already set by default.
- Editor
- Service Account User
Create and download a private key for your project’s service account that will be used as authentication credentials for using Vertex AI services. See this course to learn more.
Set up a Cloud Storage bucket to store the datasets. The bucket will also be used to store the information generated when we run the RLHF pipeline on Vertex AI. The Cloud Storage paths start with gs://. The best region to select will depend on where you are located, as well as the requirements of the service that will interact with the data. US or EU multi-region are good defaults. More on setting up Cloud Storage buckets and uploading objects to buckets here [1 and 2].

More on setting up GCP for RLHF tuning LLMs using Vertex AI can be found in this course and other resources in the subsection “Resources.”

Environment Variables

Information about the GCP authentication credentials, Cloud Storage bucket, and ports to be used to bind with TensorBoard for viewing the evaluation logs after RLHF tuning Llama 2 7B, will be treated as environment variables and stored in the .env file on Google Drive. The .env file will consist of the following environment variables.

SERVICE_ACCOUNT_KEY: Base64-encoded-version-of-your-private--service-account-key for the GCP project’s service account. To secure the private service account key, ensure it is Base64-encoded before storing it in the .env file.
PROJECT_ID: your-gcp-project_ID
BUCKET_NAME: your-cloud-storage-bucket-name
STAGING_BUCKET: gs://your-cloud-storage-bucket-name
PORT1: Port for binding with TensorBoard for viewing the reward model logs.
PORT2: Port for binding with TensorBoard for viewing the reinforcer (RL part) logs.

TensorBoard usually binds to the ports north of 6000 on the Virtual Machine (VM) allocated by Colab. The default port is generally 6006.

NOTE: Alternatively, one can consider storing the environment variables directly in the Colab notebook (temporary), using GCP’s Secret Manager for secure storage, using Colab Secrets (temporary), by uploading a Python config file (temporary), or by connecting to an external vault service. The temporary methods indicate that the environment variables will be lost as the Colab session resets.

Setup

Install the necessary libraries for data handling, cloud integration, ML pipelines, environment management, and logging.

!pip install datasets
!pip install google-cloud-pipeline-components
!pip install kfp
!pip3 install google-cloud-aiplatform
!pip install python-dotenv
!pip install tensorboard

Import libraries for dataset handling, data manipulation, model training, Google Drive integration, cloud storage, authentication, and RLHF pipeline setup.

from datasets import load_dataset
import pandas as pd
from sklearn.model_selection import train_test_split

from google.colab import drive

from dotenv import load_dotenv
import os
import base64
import json

from google.oauth2.service_account import Credentials
from google.auth.transport.requests import Request
from google.cloud import storage
import google.cloud.aiplatform as aiplatform
# Import (RLFH is currently in preview)
from google_cloud_pipeline_components.preview.llm import rlhf_pipeline
# Import from KubeFlow pipelines
from kfp import compiler

import math

Preparing the Data

Load the preference dataset and convert it to a pandas DataFrame for easier processing and visualization. This would be OpenAI’s summarize_from_feedback dataset from Hugging Face, which is a collection of Reddit posts based on the TL;DR dataset. We will work with the train split from the comparisons part of the dataset, containing over 92,000 examples.

# Load the preference dataset.
summarize_from_feedback_dataset = load_dataset("openai/summarize_from_feedback", "comparisons")

# Check the available splits in the preference dataset.
print(summarize_from_feedback_dataset)

# Convert the preference data into a pandas DataFrame.
sff_df = pd.DataFrame(summarize_from_feedback_dataset["train"])
sff_df.head()

Create and view a cleaner preference dataset aligned with the requirements of the RLHF pipeline on Vertex AI.

# Create and view a cleaner preference dataset aligned with the requirements of the RLHF pipeline on Vertex AI.
preference_dataset = pd.DataFrame({
    "input_text": sff_df["info"].apply(lambda x: x["post"]),
    "candidate_0": sff_df["summaries"].apply(lambda x: x[0]["text"] if len(x) > 0 else None),
    "candidate_1": sff_df["summaries"].apply(lambda x: x[1]["text"] if len(x) > 1 else None),
    "choice": sff_df["choice"]
})

# Display the first five rows of the new preference dataset.
print("Cleaned preference dataset:")
preference_dataset.head()

Check the size of the preference dataset.

# Check the size of the preference dataset.
preference_dataset.shape[0]

Check for NaN values in the preference dataset and summary information about the dataset.

# Check for NaN values in the preference dataset.
print("NaN values in preference dataset:")
print(preference_dataset.isnull().sum())

# Display summary information about the preference dataset.
print("Preference dataset information:")
print(preference_dataset.info())

Save the preference dataset as a JSONL file.

# Save the new preference dataset as a JSONL file.
preference_dataset.to_json("/tmp/preference_dataset.jsonl", orient="records", lines=True)

Load the prompt dataset and convert it to a pandas DataFrame for easier processing and visualization. This would be the reddit_tifu from Hugging Face, which is a collection of Reddit posts with short and long summaries. The train split is the only split available and the dataset has close to 80,000 examples. Note that this course uses a separate prompt dataset.

# Load the prompt dataset.
reddit_tifu_dataset = load_dataset("reddit_tifu", "short")

# Check the available splits in the prompt dataset.
print(reddit_tifu_dataset)

# Convert the prompt data into a pandas DataFrame.
rt_df = pd.DataFrame(reddit_tifu_dataset["train"])
rt_df.head()

Create and view a cleaner prompt dataset aligned with the requirements of the RLHF pipeline on Vertex AI.

# Create and view a cleaner prompt dataset aligned with the requirements of the RLHF pipeline on Vertex AI.
prompt_ds = pd.DataFrame({
    "input_text": rt_df["documents"]
})

# Display the first five rows of the new prompt dataset.
print("Cleaned prompt dataset:")
prompt_ds.head()

Check the size of the prompt dataset.

# Check the size of the prompt dataset.
prompt_ds.shape[0]

Check for NaN values in the prompt dataset and summary information about the dataset.

# Check for NaN values in the prompt dataset.
print("NaN values in prompt dataset:")
print(prompt_ds.isnull().sum())

# Display summary information about the prompt dataset.
print("Prompt dataset information:")
print(prompt_ds.info())

Split the original prompt dataset into train (80%) and eval (20%) sets. Going forward, we will refer to the original prompt dataset’s train set as the prompt dataset and the eval set as the evaluation dataset.

# Split the prompt dataset into train (80%) and eval (20%) sets.
prompt_dataset, eval_dataset = train_test_split(prompt_ds, test_size=0.2, random_state=42)

Display the first five rows of the new prompt dataset to verify.

# Display the first five rows of the new prompt dataset to verify.
# This is the train split of the original prompt dataset.
print("Train split of the original prompt dataset:")
prompt_dataset.head()

Display the first five rows of the evaluation dataset created from the original prompt dataset to verify.

# Display the first five rows of the evaluation dataset created from the original prompt dataset to verify.
print("Eval split of the prompt dataset:")
eval_dataset.head()

NOTE: Any objectionable content has been redacted in the image below.

Save the prompt and evaluation datasets as JSONL files.

# Save the prompt and evaluation datasets as JSONL files.
prompt_dataset.to_json('/tmp/prompt_dataset.jsonl', orient="records", lines=True)
eval_dataset.to_json('/tmp/eval_dataset.jsonl', orient="records", lines=True)

Authentication

Define the authenticate() function to obtain and handle the GCP service account credentials by authenticating with the private service account key for the project. These credentials allow the program to interact with any GCP service the account has permission to access.

The authenticate() function loads the environment variables from the .env file on Google Drive, decodes the private service account key (converted from a Base64 string into the JSON format) for secure GCP authentication, refreshes credentials if needed, and retrieves project-specific details (project ID, bucket name, and staging bucket) for further cloud operations.

def authenticate():
    drive.mount("/content/drive")

    # Load the .env file from Google Drive.
    load_dotenv("/content/drive/MyDrive/Projects/RLHF/GCP/Access/.env.txt")

    # Decode the private key from the service account and store it in a dictionary.
    # Retrieve the Base64-encoded private service account key from the .env file.
    SERVICE_ACCOUNT_KEY_STRING_B64 = os.getenv('SERVICE_ACCOUNT_KEY')

    # Encode the retrieved Base64 string into bytes using ASCII encoding.
    SERVICE_ACCOUNT_KEY_BYTES_B64 = SERVICE_ACCOUNT_KEY_STRING_B64.encode("ascii")

    # Decode the Base64-encoded bytes back into their original byte format.
    SERVICE_ACCOUNT_KEY_STRING_BYTES = base64.b64decode(SERVICE_ACCOUNT_KEY_BYTES_B64)

    # Decode the byte representation of the private service account key into a string.
    SERVICE_ACCOUNT_KEY_STRING = SERVICE_ACCOUNT_KEY_STRING_BYTES.decode("ascii")

    # Parse the JSON string into a Python dictionary for easy access to its contents.
    SERVICE_ACCOUNT_KEY = json.loads(SERVICE_ACCOUNT_KEY_STRING)

    # Create credentials based on the private key from the service account.
    credentials = Credentials.from_service_account_info(
        SERVICE_ACCOUNT_KEY,
        scopes=["https://www.googleapis.com/auth/cloud-platform"]
    )

    # Refresh credentials if expired.
    if credentials.expired:
        credentials.refresh(Request())

    # Set project ID, bucket name, and staging bucket according to the environment variables.
    PROJECT_ID = os.getenv("PROJECT_ID")
    BUCKET_NAME = os.getenv("BUCKET_NAME")
    STAGING_BUCKET = os.getenv("STAGING_BUCKET")

    return credentials, PROJECT_ID, BUCKET_NAME, STAGING_BUCKET

Authenticate and fetch credentials and project-specific information (redacted in the image below) from GCP.

# Authenticate and fetch credentials and project-specific information from GCP.
credentials, PROJECT_ID, BUCKET_NAME, STAGING_BUCKET = authenticate()

# Confirm the project ID, bucket name, and staging bucket.
print(PROJECT_ID)
print(BUCKET_NAME)
print(STAGING_BUCKET)

Store Data on Cloud Storage

Initialize the client to upload the JSONL datasets to our Cloud Storage bucket.

# Initialize the client to upload the JSONL datasets to the Cloud Storage bucket.
client = storage.Client(credentials=credentials, project=PROJECT_ID)

# Reference the bucket.
bucket = client.bucket(BUCKET_NAME)

Reference the target preference dataset file and upload it to our Cloud Storage bucket.

# Reference the target preference dataset file on Cloud Storage.
blob_preference = bucket.blob("datasets/preference/preference_dataset.jsonl")

# Upload the preference dataset to our Cloud Storage bucket.
blob_preference.upload_from_filename("/tmp/preference_dataset.jsonl")

Reference the target prompt and evaluation dataset files and upload them to our Cloud Storage bucket.

# Reference the target prompt dataset file on Cloud Storage.
blob_prompt_train = bucket.blob("datasets/prompt/train/prompt_dataset.jsonl")

# Upload the prompt dataset to our Cloud Storage bucket.
blob_prompt_train.upload_from_filename("/tmp/prompt_dataset.jsonl")

# Reference the target evaluation dataset file on Cloud Storage.
blob_prompt_eval = bucket.blob("datasets/prompt/eval/eval_dataset.jsonl")

# Upload the evaluation dataset to our Cloud Storage bucket.
blob_prompt_eval.upload_from_filename("/tmp/eval_dataset.jsonl")

Vertex AI Pipeline Job

Set us-central1 as the region for RLHF pipeline on Vertex AI. See GCP’s documentation and section “Keeping GCP Costs in Check” for more about setting the region.

# Set the region for the RLHF pipeline on Vertex AI.
REGION = "us-central1"

Initialize and connect to Vertex AI.

# Initialize Vertex AI.
aiplatform.init(project = PROJECT_ID, location = REGION, credentials = credentials)

Define a path to the RLHF pipeline’s YAML file (rlhf_pipeline.yaml). The YAML file is a blueprint for the pipeline. It contains key configurations defining the pipeline, such as model parameters, hyperparameters, data paths, training settings, and other options needed for training and evaluating the base LLM. Additionally, the pipeline definition includes tasks, their order, input / output arguments, and other metadata needed for pipeline orchestration.

# Define a path to the RLHF pipeline's YAML file.
RLHF_PIPELINE_PKG_PATH = "rlhf_pipeline.yaml"

Execute the compile() function to compile the RLHF pipeline on Vertex AI. The rlhf_pipeline is the pipeline function, which contains the steps and components of the ML pipeline. The compile() function compiles the rlhf_pipeline function into the rlhf_pipeline.yaml file. The YAML file describes the pipeline in a format that Kubeflow Pipelines can understand and execute.

# Execute the compile() function to compile the RLHF pipeline.
compiler.Compiler().compile(
    pipeline_func=rlhf_pipeline,
    package_path=RLHF_PIPELINE_PKG_PATH
)

View the first few lines of content in the RLHF pipeline’s YAML file.

# Print the first few lines of the RLHF pipeline's YAML file.
!head rlhf_pipeline.yaml

# View the contents of the RLHF pipeline's YAML file.
# !cat rlhf_pipeline.yaml

Set the batch size for the RLHF pipeline.

# Define the batch size for the RLHF pipeline.
BATCH_SIZE = 128

Determine the number of training steps for the reward model. See this course for more.

# Get the total number of rows in the preference dataset.
PREFERENCE_DATASET_SIZE = preference_dataset.shape[0]
print(PREFERENCE_DATASET_SIZE)

# Calculate the number of steps per epoch for reward model training.
REWARD_STEPS_PER_EPOCH = math.ceil(PREFERENCE_DATASET_SIZE / BATCH_SIZE)
print(REWARD_STEPS_PER_EPOCH)

# Define the number of epochs for reward model training.
REWARD_NUM_EPOCHS = 1

# Calculate the number of training steps for the reward model.
reward_model_train_steps = REWARD_STEPS_PER_EPOCH * REWARD_NUM_EPOCHS
print(reward_model_train_steps)

Determine the number of steps in the RL training. See this course for more.

# Get the number of rows in the prompt dataset.
PROMPT_DATASET_SIZE = prompt_dataset.shape[0]
print(PROMPT_DATASET_SIZE)

# Calculate the number of steps per epoch for RL training.
RL_STEPS_PER_EPOCH = math.ceil(PROMPT_DATASET_SIZE / BATCH_SIZE)
print(RL_STEPS_PER_EPOCH)

# Define the number of epochs to be used for RL training.
RL_NUM_EPOCHS = 1

# Calculate the number of steps in the RL training.
reinforcement_learning_train_steps = RL_STEPS_PER_EPOCH * RL_NUM_EPOCHS
print(reinforcement_learning_train_steps)

Set the paths to the preference, prompt, and evaluation datasets on Cloud Storage.

# Path to the preference dataset.
PREFERENCE_DATASET_PATH = "gs://rlhf-bucket_1/datasets/preference/preference_dataset.jsonl"

# Path to the prompt dataset.
PROMPT_DATASET_PATH = "gs://rlhf-bucket_1/datasets/prompt/train/prompt_dataset.jsonl"

# Path to the evaluation dataset.
EVAL_DATASET_PATH = "gs://rlhf-bucket_1/datasets/prompt/eval/eval_dataset.jsonl"

Set the parameter values required to run the Vertex AI Pipeline job for RLHF tuning Llama 2 7B. Learn more.

NOTE: In the context of RLHF, reward hacking occurs when the base LLM learns to exploit the reward function to maximize its score, rather than genuinely improving its responses. If given too many training steps for RL, the policy (base model) may figure out a way to do reward hacking. Reward hacking is controlled by the parameter kl_coeff. See these [1, 2, and 3] resources for more. For more information about kl_coeff, see subsection “Performance Monitoring.”

# Parameter values required to run the Vertex AI Pipeline job for RLHF tuning Llama 2 7B.
parameter_values={
        "preference_dataset": PREFERENCE_DATASET_PATH,
        "prompt_dataset": PROMPT_DATASET_PATH,
        "eval_dataset": EVAL_DATASET_PATH,
        "large_model_reference": "llama-2-7b",
        "reward_model_train_steps": reward_model_train_steps,
        "reinforcement_learning_train_steps": reinforcement_learning_train_steps,
        "reward_model_learning_rate_multiplier": 1.0,
        "reinforcement_learning_rate_multiplier": 0.2,
        "kl_coeff": 0.1, # Increased to reduce reward hacking.
        "instruction":\
        "Summarize in less than 50 words"}

Create, configure, and run a Vertex AI Pipeline job for RLHF tuning Llama 2 7B. After tuning, Vertex AI will store the results of batch prediction and other artifacts in our Cloud Storage bucket, as defined by the parameter pipeline_root. See more.

# Create and configure a Vertex AI Pipeline job for RLHF tuning Llama 2 7B.
job = aiplatform.PipelineJob(
    display_name="rlhf-tuning",
    pipeline_root=STAGING_BUCKET,
    template_path=RLHF_PIPELINE_PKG_PATH,
    parameter_values=parameter_values)

# Run the RLHF pipeline job on Vertex AI.
job.run()

To view the RLHF pipeline on Vertex AI and its components, navigate as follows. See this course to learn more.
GCP Console –> Your-GCP-project –> Vertex AI –> Pipelines –> RUNS –> Your-RLHF-pipeline-region –> Your-RLHF-pipeline

Evaluation

Performance Monitoring

The RLHF pipeline on Vertex AI outputs some logs with training curves for pertinent train-time metrics to TensorBoard. The TensorBoard logs for the reward model can be found by navigating to our Cloud Storage bucket in one of the following ways.

Directly: your-cloud-storage-bucket-name/your-gcp-project-number/your-rlhf-train-template_ID/reward-model-trainer_ID/tensorboard_metrics/logs/train/your-tensorboard-logs-events-file
URI from the Vertex AI Pipeline Job: GCP Console –> Your-GCP-project –> Vertex AI –> Pipelines –> RUNS –> Your-RLHF-pipeline-region –> Your-RLHF-pipeline –> RewardModelTrainer –> tensorboard_metrics –> Pipeline run analysis –> NODE INFO –> Artifact Info –> URI –> Your-reward-model-TensorBoard-logs

The TensorBoard logs for the reinforcer (RL training) can be found by navigating to our Cloud Storage bucket in one of the following ways.

Directly: your-cloud-storage-bucket-name/your-gcp-project-number/your-rlhf-train-template_ID/reinforcer_ID/tensorboard_metrics/logs/train/your-tensorboard-logs-events-file
URI from the Vertex AI Pipeline Job: GCP Console –> Your-GCP-project –> Vertex AI –> Pipelines –> RUNS –> Your-RLHF-pipeline-region –> Your-RLHF-pipeline –> Reinforcer –> tensorboard_metrics –> Pipeline run analysis –> NODE INFO –> Artifact Info –> URI –> Your-reinforcer-TensorBoard-logs

Store the logs for the reward model and the reinforcer (RL part) on Google Drive for easy and secured access.

# Path to the TensorBoard logs for the reward model.
reward_logs = "/content/drive/MyDrive/Projects/RLHF/GCP/Logs/Reward-Logs"

# Path to the TensorBoard logs for the reinforcer.
reinforcer_logs = "/content/drive/MyDrive/Projects/RLHF/GCP/Logs/Reinforcer-Logs"

Load TensorBoard to view the training curves after the Vertex AI Pipeline job completes execution.

# Load TensorBoard.
%load_ext tensorboard

Set port from the .env file to launch TensorBoard and view the training curves for the reward model. Specifically, we will look at the rank_loss metric for the reward model. Rank loss is a loss function used to train reward models by comparing pairs of model outputs and penalizing incorrect rankings, helping the base model better align with human preferences.

Ideally, the rank loss should decrease and then plateau when RLHF tuning is completed. See this course for more. See the rank loss below. Additionally, here is the Colab notebook covering this work for more training curves for the reward model reported by the RLHF pipeline on Vertex AI to TensorBoard.

# Set port from the .env file to launch TensorBoard and view the training curves for the reward model.
port = %env PORT1
%tensorboard --logdir $reward_logs --port $port --bind_all

Set port from the .env file to launch TensorBoard and view the training curves for the reinforcer (RL training). We will focus on the metrics kl_loss and reward.

Kullback-Leibler (KL) Divergence Loss or KL Loss (kl_loss): An evaluation metric that measures how much the tuned base model’s policy diverges from its initial policy. A lower KL loss indicates closer alignment with the original model, preserving base behavior and avoiding overly confident outputs, which improves stability.
Reward (reward): Reward represents the cumulative score the tuned base model achieves based on the reward model’s scoring system, indicating how well it aligns with human preferences. A higher reward score indicates that the tuned model’s outputs are more closely aligned with the human preferences.

Ideally, both KL loss and reward should increase and then stabilize (plateau) when the RLHF tuning is completed. See this course for more. See the training curves for the KL loss and the reward below. Additionally, here is the Colab notebook covering this work for more training curves for the reinforcer reported by the RLHF pipeline on Vertex AI to TensorBoard.

# Set port from the .env file to launch TensorBoard and view the training curves for the reinforcer (RL training).
port = %env PORT2
%tensorboard --logdir $reinforcer_logs --port $port --bind_all

Results

The evaluation results (a JSONL file) for the tuned model will be stored in Google Drive for easy and secured access. The evaluation results can be found by navigating to our Cloud Storage bucket in one of the following ways.

Directly: your-cloud-storage-bucket-name/your-gcp-project-number/your-rlhf-train-template_ID/bulk-inferrer_ID/output_prediction/merged/your-evaluation-results-file
URI from the Vertex AI Pipeline Job: GCP Console –> Your-GCP-project –> Vertex AI –> Pipelines –> RUNS –> Your-RLHF-pipeline-region –> Your-RLHF-pipeline –> Bulk Inferrer –> Output Parameters –> output_prediction_gcs_path –> Your-evaluation-results-file

Load each line of the evaluation results from Google Drive into a list as JSON objects for review.

# Define the file path for the tuned evaluation data.
eval_tuned_path = '/content/drive/MyDrive/Projects/RLHF/GCP/Eval/eval_tuned.jsonl'

# Initialize an empty list to store the loaded tuned evaluation data.
eval_data_tuned = []

# Open the file with the evaluation data and read it line by line.
with open(eval_tuned_path) as f:
    for line in f:
        # Parse each line as JSON and append it to the eval_data_tuned list.
        eval_data_tuned.append(json.loads(line))

Define a function to print dictionary contents with optional indentation.

def print_d(d, indent=0):
    # Iterate over each key-value pair in the dictionary.
    for key, val in d.items():
        # Create indentation based on the current indent level.
        indentation = "  " * indent

        # Print a line separator for visual clarity.
        print(f"{indentation}" + "-" * 50)

        # Print the key with current indentation.
        print(f"{indentation}key:{key}\n")

        # Check if the value is a nested dictionary.
        if isinstance(val, dict):
            # Print "val" to indicate there's more nested content.
            print(f"{indentation}val")

            # Recursively call print_d to print the nested dictionary with increased indentation.
            print_d(val, indent=indent + 1)
        else:
            # If the value is not a dictionary, print it directly.
            print(f"{indentation}val:{val}")

Look at any of the summaries produced by the tuned model. For the complete output, see the Colab notebook for this work.

# Look at any of the summaries produced by the tuned model.
print_d(eval_data_tuned[101])

NOTE: Any objectionable content has been redacted in the image below.

Extract all of the prompts (Reddit posts) and their completions (generated summaries) and visualize them side-by-side. For the complete output, see the Colab notebook for this work.

# Extract all of the prompts from eval_data_tuned.
prompts = [sample['inputs']['inputs_pretokenized']
           for sample in eval_data_tuned]

# Extract completions generated by the tuned model.
tuned_completions = [sample['prediction']
                     for sample in eval_data_tuned]

# Create a DataFrame to store prompts and model completions.
results = pd.DataFrame(
    data={'prompt': prompts,
          'tuned_model': tuned_completions})

# Set display options to show full column width for readability.
pd.set_option('display.max_colwidth', None)

# Display the results DataFrame.
results

NOTE: Any objectionable content has been redacted in the image below.

Thoughts

Following this tutorial would help one learn the intricacies of fine-tuning LLMs using RLHF on Vertex AI for text summarization tasks. However, here are some notes that might be beneficial.

Reward Model Performance: The decreasing rank loss of the reward model after being trained for 1 epoch indicates that it is likely that the metric would have approached the ideal behavior (decrease and plateau) after being trained for an optimal number of epochs. The optimal number of epochs for the reward model is 20-30, as recommended by the GCP team [1 and 2].
RL Loop: The increasing KL loss and reward of the RL part after being trained for 1 epoch indicates that the metrics are more likely to have approached their ideal behavior (increase and stabilize) after being trained for an optimal number of epochs. The GCP team suggests that the optimal number of epochs for RL training is 10-20, see more [1 and 2].
Summarization Quality: The performance of the RLHF tuning process in terms of the quality of the completions / summaries generated for the prompts / Reddit posts would certainly improve with the reward model and RL training being run for the optimal number of epochs. Additionally, using optimal values for parameter_values – the parameters required for running the Vertex AI Pipeline job, would further enhance the outcomes.

Training the reward model and RL components for just 1 epoch is insufficient for a LLM like Llama 2 to effectively learn from feedback data. Typically, LLMs require 20–30 epochs, especially with around 90,000 examples in the preference dataset, to stabilize their responses and align closely with human preferences encoded in the reward model. Insufficient training time can result in outputs that are unrefined, repetitive, nonsensical, or even missing altogether, as we are observing in this tutorial.
Optimal Parameter Values: This course from DeepLearning.AI recommends 10000 as the optimal value for the RLHF pipeline parameters reward_model_train_steps and reinforcement_learning_train_steps.
Models Supported: Apart from Meta’s Llama 2 and its variants, other text models (corresponding to the RLHF pipeline parameter large_model_reference) that support tuning LLMs using RLHF on Vertex AI include the text-bison@002, t5-small, t5-large, t5-xl, and t5-xxl Finetuning Language Models Text-To-Text Transfer Transformer (Flan-T5) models. Learn more.

If time and compute cost is not a constraint, we recommend RLHF tuning Llama 2 7B for the optimal parameter values recommended above for the best results.

Keeping GCP Costs in Check

Tuning LLMs using the RLHF pipeline on Vertex AI is a compute-, time-, and cost-heavy task. If the costs are not monitored and controlled, they can easily spin out of control. Here are some notes that can help.

For the RLHF pipeline on Vertex AI, developers can choose the region that is geographically closer to them to minimize latency, the choices being us-central1 and europe-west4. As per GCP’s documentation, for RLHF tuning, the accelerator type and count are determined by the selected region: jobs in us-central1 use eight Nvidia A100 80GB GPUs, while jobs in europe-west4 use 32 TPU v3s.
For this tutorial, running the Vertex AI Pipeline job in the us-central1 region took over 6 hours for a single epoch each for the reward model and RL training.
The pipeline jobs in the europe-west4 region use 32 TPU v3s. TPUs are generally more aligned with Google infrastructure and training could be faster (and probably cheaper). See here for more. We performed RLHF tuning with 20% of the preference dataset, 10% of the prompt dataset, 10 epochs for the reward model, and 7 epochs for RL training, and it took us a little over 4 hours. Of course, the performance was worse given the smaller training data.
You also pay for the GCP resources you use with the RLHF pipeline on Vertex AI, such as Compute Engine resources consumed by pipeline components (charged at the same rate as for Vertex AI training). Learn more.
You are responsible for the cost of any services (such as Cloud Storage, Dataflow, BigQuery, Cloud Key Management Service, etc.) called by the RLHF pipeline on Vertex AI. More here.
We did enable the BigQuery API and the BigQuery User role for the project’s service account, as suggested by DeepLearning.AI’s course. However, BigQuery services were not used for this work.
We recommend using the free trial to use free credits made available to new GCP users (unless you are willing to pay).
Monitor costs accrued under your Cloud Billing account at all times, set budgets and budget alerts.
Learn about the pricing of Vertex AI, Vertex AI Pipelines, Cloud Storage, and use the pricing calculator to generate a cost estimate based on your projected usage.
To prevent the accrual of charges, ensure that you delete all of the resources, such as the Vertex AI Pipeline job, Vertex AI endpoint and model, Cloud Storage bucket, and the RLHF pipeline’s YAML file. Alternatively, you can delete your GCP project and / or disable billing. For cleaning up resources programmatically, see the Colab notebooks [1 and 2] recommended by the GCP team.

Next Steps

Here are some of the things to try and improve the performance of the RLHF-tuned base model.

Epochs: Run the Vertex AI Pipeline job for the recommended optimal number of epochs for reward model and RL training. Adjust the number of epochs based on the plots for the rank loss, KL loss, and reward. Stop further training once the metrics converge and performance begins to plateau or stabilize.
RLHF Pipeline Parameter Values: Use the optimal parameter values (see section “Thoughts”) for reward_model_train_steps and reinforcement_learning_train_steps. However, since the prompt dataset differs from that of DeepLearning.AI’s course on RLHF, we recommend experimenting with different sets of values for the configurable parameters in parameter_values that are required to run the Vertex AI Pipeline job for RLHF tuning LLMs. This will help you find the optimal parameter values for RLHF tuning the base model.
Instructions: Try other prompts / instructions for the base model, i.e., values for the parameter instruction in parameter_values. You can use different instructions, but ensure the same instruction is included in the prompt when collecting your preference dataset, so responses and human preferences align with it.
Models: Try other models for the base LLM. Refer to GCP’s documentation on how to proceed.
Data: Experiment with other datasets. You can also try subsets of the datasets used in this tutorial or other smaller datasets and run the Vertex AI Pipeline job for the optimal number of epochs for reward model and RL training. Unlike larger dataset sizes, like the ones used in this tutorial, which result in the Vertex AI Pipeline job taking a longer time to RLHF-tune LLMs, smaller datasets would be more time- and cost-efficient. You could also consider experimenting with the recommended size for the preference dataset.
Resources: See the guides in the subsection “Resources” for a more comprehensive understanding about RLHF tuning LLMs using Vertex AI.

Here is the link to the GitHub repo for this work.

Thank you for reading through! I genuinely hope you found the content useful. Feel free to reach out to us at ankanatwork@gmail.com and share your feedback and thoughts to help us make it better for you next time.

Acronyms used in the blog that have not been defined earlier: (a) Machine Learning (ML), (b) Artificial Intelligence (AI), (c) Billion (B), (d) JavaScript Object Notation (JSON), (e) Identity (ID), (f) Identity and Access Management (IAM), (g) United States (US), (h) European Union (EU), (i) Not a Number (NaN), (j) eval (evaluation), (k) American Standard Code for Information Interchange (ASCII), (l) YAML Ain’t Markup Language (YAML), (m) Graphics Processing Unit (GPU), and (n) Tensor Processing Unit (TPU).