Fine-tuning GPT-4o for English-to-Bengali Translation

ML and AI Blogs | Issue# 1 [July 19, 2024]

The world of technology is abuzz with advancements in Generative AI (GenAI) and Large Language Models (LLMs), such as ChatGPT, Gemini, Claude, and Llama. While they assist us with our daily tasks (planning a holiday, summarizing documents, writing code, generating text and images, etc.), it is also important for us to leverage the powerful capabilities of the LLMs for custom, domain-specific tasks. Such tasks can be easily accomplished by guiding the LLMs well and by using domain-specific data. In that light, we will explore the use case of translation.

Motivation

The work presented here is a basic implementation of an English-to-Bengali translation service or chatbot using GPT-4o. We will use the Bn-EN PMIndia Parallel Corpus data for fine-tuning our LLM – GPT-4o.

Of course, the GPT models are already great at translation tasks. But if we have a domain-specific task at hand, fine-tuning the LLM will almost certainly yield better results. For example, there are many languages that have very less translation (let’s say to / from English) data available for pre-training and therefore the GPT models are limited in terms of their knowledge about such languages. If the goal is to translate between such a language and English or another language, it is of merit to create a smaller dataset with fine-tuning examples for the LLM to learn from.

Note: We will not be using the OpenAI Fine-tuning API for fine-tuning the LLM. Instead, we will prompt the LLM to use the examples we provide to get fine-tuned. Also, currently, as per the OpenAI documentation, fine-tuning for GPT-4 and GPT-4o through the Fine-tuning API is in an experimental access program and eligible users can request access in the fine-tuning UI when creating a new fine-tuning job. One can however fine-tune GPT-3.5 Turbo using the Fine-tuning API.

The idea behind this work is to help get warmed up to using the OpenAI ChatGPT APIs for valid use cases one could think of working on.

Getting Started

Let’s go ahead and install the relevant Python packages.

!pip install open
!pip install python-dotenv
!pip install ipywidgets

Next, we’ll load the relevant Python libaries.

import open
import os
import csv
import random
import ipywidgets as widgets
from dotenv import load_dotenv, find_dotenv
from google.colab import files, drive
from IPython.display import display, clear_output

We need to have the OpenAI API key for using the APIs offered by OpenAI. Load the OpenAI API key from the .env file. This is a secure way of using the API key without exposing it through code. We also need to ensure that the OpenAI API key is added to the .env file for the OS to read it and use it for our application.

OPENAI_API_KEY = your-OpenAI-api-key

One can upload the .env file directly into the ‘/content’ folder in Colab or use the following code to upload and load the file.

files.upload()
load_dotenv(‘/content/.env’)

# Find and load the .env file.
_ = load_dotenv(find_dotenv())

# Fetch the OpenAI API key from the .env file.
openai.api_key  = os.getenv('OPENAI_API_KEY')

Getting Started

We will import the dataset from Google Drive. We will use the Bn-EN PMIndia Parallel Corpus data for fine-tuning GPT-4o. Of course, we could have picked an earlier version of GPT that is still available as well, but we will pick GPT-4o since it happens to be the latest model to be released by OpenAI at the time of writing this blog.

# Import the dataset from Google Drive.
drive.mount('/content/drive')

# Path to the dataset in Google Drive.
input_file = '/content/drive/MyDrive/Datasets/pmindia.v1.bn-en.tsv'

Helper Functions

To ensure there aren’t any duplicate examples (English-Bengali translation pairs) in the fine-tuning data, we will use the remove_duplicates function to remove the duplicates from the dataset and copy the set of unique examples into a new file in Google Drive for further use.

Note: The English-Bengali PMIndia corpus dataset does not have duplicate examples, but most translation corpus / datasets do, and it is always a good idea to run a check for duplicate data. However, as mentioned, since this particular dataset does not have duplicate examples, checking for duplicate data for this particular dataset and application is optional.

Note that the dataset is a TSV file and we will use the tab character as the delimiter to read the file.

# Remove duplicate examples from the dataset, if any.
def remove_duplicates(input_file, output_file):
    # Read the dataset - a TSV file.
    with open(input_file, 'r', newline='', encoding='utf-8') as infile:
        reader = csv.reader(infile, delimiter='\t')
        unique_lines = set(tuple(line) for line in reader)

    # Write the unique lines to a new TSV file in Google Drive.
    with open(cleaned_file, 'w', newline='', encoding='utf-8') as outfile:
        writer = csv.writer(outfile, delimiter='\t')
        for line in unique_lines:
            writer.writerow(line)

# Path to the new TSV file with zero duplicates that we will create in Google Drive.
cleaned_file = '/content/drive/MyDrive/Datasets/Cleaned/pmindia.v1.bn-en_cleaned.tsv' # Change to the desired path.

remove_duplicates(input_file, cleaned_file)

We will now define the number of examples that we will use for fine-tuning the LLM. Ideally, the more the merrier. 1000 should be a decent number. We will go ahead and use 100 examples for illustration purposes.

Additionally, we need to ensure that we have enough credits in our OpenAI account for using the OpenAI Chat Completions API. In general, apart from our fine-tuning use-case (e.g., for processing chat prompts using an OpenAI LLM, moderation of chat prompts, image generation, etc.), it is a good idea to top the account up with maybe $10 or $20 to ensure we have enough credit balance to experiment with the ChatGPT APIs. The Chat Completions API will issue an error message if there isn’t enough credit balance to process a request.

Fine-tuning with 100 examples will not be costly. However, one must understand that the costs can go up with more examples, as we require more of OpenAI’s services. Precisely, the cost is based on the number of tokens processed. The OpenAI Pricing page provides with more information on the costs incurred when using their models.

Coming back to the point, we will use 100 examples to fine-tune gpt-4o. We will prompt the model to look at those examples and try and learn from them. Before that, we will select 100 examples randomly from our cleaned dataset and put those into a new file in Google Drive, from where the LLM would read through the examples for fine-tuning. The following functions achieves the same.

# Randomly select 100 examples from the cleaned file and put those into a new file.
def select_random_examples(input_file, output_file, num_examples=100):
    # Read all examples from the cleaned dataset TSV file.
    with open(input_file, 'r', newline='', encoding='utf-8') as infile:
        reader = list(csv.reader(infile, delimiter='\t'))

        # Check if the file has fewer examples than the number to select.
        if len(reader) <= num_examples:
            raise ValueError("The file has fewer examples than the number requested.")

        # Randomly select examples.
        selected_examples = random.sample(reader, num_examples)

    # Write the selected lines / examples to a new TSV file.
    with open(output_file, 'w', newline='', encoding='utf-8') as outfile:
        writer = csv.writer(outfile, delimiter='\t')
        writer.writerows(selected_examples)

# Path to the TSV file with 100 random examples that we will create in Google Drive.
output_file = '/content/drive/MyDrive/Datasets/Cleaned/pmindia.v1.bn-en_random.tsv'  # Change to the desired output path.

# Select 100 random examples.
select_random_examples(cleaned_file, output_file)

We will use the following function in our prompt to specify that the LLM has to read through the examples in the output_file and use that knowledge when prompted to translate English into Bengali.

# Read the randomly selected examples.
def read_examples(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        content = file.read()
    return content

# Read the file content.
train_examples = read_examples(output_file)

Translation using the LLM – GPT-4o

We will use the system prompt (a role, just like the user input / prompt will correspond to the *user* role) to help the LLM with any context it needs to be aware of. The system prompt will instruct the LLM to learn from the examples we will use for fine-tuning, translate the English text into Bengali, and handle any failure cases gracefully.

We will use a delimiter to make it easier for the LLM to understand the user’s prompt, which would be the English text to be translated into Bengali.

# Delimiter to use for our prompts.
delimiter = "####"

# Specify the role of the LLM and prompt it to
# use the fine-tuning examples to make the
# predictions.
system_message = f"""
You are a translation chatbot responsible for \
translating English text into Bengali. \
The user will provide the English text to \
be translated into Bengali, \
delimited with {delimiter} characters. \
You will perform this in three steps. \

Step 1: Study the examples of English-to-Bengali \
translation from the following file. \
\n{train_examples}\n\n \

Step 2: Use your own knowledge and the additional \
insights from <Step 1> to translate the English \
text provided by the user to Bengali. If you don't \
know the translation, return the message "Oops! \
I am not sure about the translation for this query. \
Please try a different text to translate." \
"""

Next, we will translate English into Bengali using OpenAI’s gpt-4omax_tokens=150 limits the number of response tokens to 150, while a temperature of 0 ensures there is no randomness to the responsed from the LLM. messages represents the combined system and user prompts that would be provided to the LLM for it to generate a response.

client = openai.OpenAI(api_key=openai.api_key)

# Translate English into Bengali using OpenAI's gpt-4o.
def translate_english_to_bengali(prompt, messages, model="gpt-4o"):
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        max_tokens=150,
        temperature=0
    )
    translation = response.choices[0].message.content
    return translation

The Translation UI

We will now implement a very basic UI comprising of widgets in order to accept the user input in English and return a translated response in Bengali.

# Create the UI elements.
text_prompt = widgets.Label("Enter the English text to translate")
text_input = widgets.Text(placeholder='Enter English text here')
translate_button = widgets.Button(description="Translate")
output_label = widgets.Label("")

We will go ahead and implement the functionality of the button responsible for generating the translation.

We will first get the English input (the prompt for the LLM) from the user and the Translate button will cause the LLM to initiate the English-to-Bengali translation. The UI will then display the response from the LLM, i.e., the translated Bengali text.

# Define the function to handle button click.
def on_translate_button_clicked(b):
    clear_output(wait=True)
    display(text_prompt, text_input, translate_button, output_label)
    english_text = text_input.value

    prompt = f"""
    {english_text}
    """
    # Messages to send to the model.
    messages =  [
    {'role':'system',
     'content': system_message},
    {'role':'user',
     'content': f"{delimiter}{prompt}{delimiter}"},
]
    if english_text:
        translation = translate_english_to_bengali(prompt, messages)
        output_label.value = f"Translated Bengali Text: {translation}"
    else:
        # Handle a blank input / prompt.
        output_label.value = "Please enter some text to translate."

# Attach the function to the button click event.
translate_button.on_click(on_translate_button_clicked)

# Display the UI.
display(text_prompt, text_input, translate_button, output_label)

Here is the link to the GitHub repo for this work.

Thank you for reading through! I genuinely hope you found the content useful. Feel free to reach out to us at [email protected] and share your feedback and thoughts to help us make it better for you next time.


Acronyms used in the blog that have not been defined earlier: (a) Machine Learning (ML), (b) Artificial Intelligence (AI), (c) Generative Pre-trained Transformer (GPT), (d) Application Programming Interface (API), and (e) User Interface (UI).