Bengali-to-English Translation using Neural Machine Translation with Attention

ML and AI Blogs | Issue# 8 [September 20, 2024]

What?

This tutorial demonstrates how to create and use a sequence-to-sequence (seq2seq) model for Neural Machine Translation (NMT), precisely, Bengali-to-English text translation. The work is based on TensorFlow’s tutorial on “Neural machine translation with attention” that shows how to use seq2seq models for Spanish-to-English text translation.

We will reuse the content (text, images, and code) from the original TensorFlow tutorial in most parts. The authors of this original TensorFlow tutorial encourage exploring translation between different language pairs as a next step based on the tutorial. We acknowledge that this tutorial is adapted from TensorFlow’s Spanish-to-English translation tutorial and extend our gratitude and thanks to the TensorFlow team to honor the original authors. We will reuse and modify the code to achieve the Bengali-to-English translation. The text and comments from the original TensorFlow tutorial are retained as is to maintain consistency. Let us clarify the modifications made to adapt the original tutorial for Bengali-to-English translation.

Note: Due to the advances in Natural Language Processing (NLP), Large Language Models (LLMs), and machine translation, the low-level NMT modeling approach used in the original TensorFlow tutorial is not used a lot today for translation tasks. The original authors too point out the same. However, we would like to reiterate the effectiveness of the approach as a means to learn and understand how translation actually works under the hood of today’s technology. The original tutorial covers an the implementation of NMT with seq2seq models using Bidirectional Recurrent Neural Networks (RNNs), specifically Gated Recurrent Units (GRUs), and multi-head attention. The tutorial goes in great depths while implementing NMT with attention and discusses ways to save and reload the final model for future use.

The code

Here is the Google Colab notebook for this tutorial. The tutorial explains tricky concepts and is not on the shorter side. Therefore, we would not add the content (text, images, and code) into this blog to ensure that the readers are not overwhelmed. It is recommended that the readers view and use the code in a Colab notebook while going through this tutorial so that it is easier to run the code and see the outputs as well.

Note: The Colab notebook for this tutorial contains all of the sections from this blog, except this one. The “Adapted sections” section is modified in this blog to align with the flow of content in this blog.

Modifications

Prerequisites

We strongly recommend spending time to review TensorFlow’s tutorial on NMT with attention as a prerequisite to this Bengali-to-English text translation tutorial. Here is the Google Colab notebook and source code from GitHub for TensorFlow’s Spanish-to-English text translation tutorial. Additionally, here is the Google Colab notebook for this tutorial for a quick comparison. The modifications introduced in this Bengali-to-English text translation tutorial (see subsection “Changes”), in comparison to the original Spanish-to-English text translation tutorial, can be found in the Colab notebook for this tutorial.

Note: When referring to the hierarchy of the sections of content in this tutorial, we will use “section” and “subsection” for the two top levels and use “subsection” thereon for levels below subsections.

Changes

The following modifications have been made in this Bengali-to-English text translation tutorial (see the Google Colab notebook), in comparison to the original Spanish-to-English text translation tutorial.

The introductory part under the title “Bengali-to-English translation using Neural Machine Translation with attention” and the sections “Modifications”, “Summary of the text translation process”, “Adapted sections”, and “Thoughts” were not present in the original tutorial and have been added in this one. Additionally, the subsection “Download and prepare the dataset” from the original TensorFlow tutorial has been renamed to “Fetch and prepare the dataset” to align with the flow of this tutorial.
The latest version of the tensorflow-text library has directly been installed instead of the condition-based installation via pip install "tensorflow-text>=2.11.
The images from the introduction and the section “The encoder/decoder” in the original TensorFlow tutorial that refer to the architecture of the model (encoder / decoder with attention) with regards to the Spanish translations have been removed. The readers are encouraged to review the images from the original tutorial, since the architecture remains the same in this tutorial. However, they have been removed here as they had Spanish inputs to the model, while this article focuses on Bengali inputs.
The attention plot for the Spanish-to-English translation example in the introduction of the original tutorial has been removed.
The imports typing, Any, and Tuple were unused in the original TensorFlow tutorial and have been removed in this tutorial.
The introduction has been modified to ensure it refers to Bengali-to-English translation, as compared to references to Spanish-to-English translation in the original.
The English-Spanish dataset from Anki’s language datasets has been replaced by the English-Bengali dataset from the same source.
The original tutorial uses Google Cloud to store the dataset. This dataset is downloaded, and then processed and loaded using the function load_data. However, the dataset has been stored on Google Drive in this tutorial and minor modifications in the way the data is processed and loaded have been made to the load_data function.
Unlike the original tutorial, the lengths of the raw context and target variables (context_raw and target_raw, respectively) output by the function load_data and the first five sequences in each of those variables have been printed for review in this tutorial.
The dataset in the original Tensorflow tutorial is a text file which has English text and the Spanish translation. However, the datasets available from Anki are text files with content in the format Language 1, Translated Language 2, and Attribution. Therefore, this tutorial uses the clean_translation_file function to remove the attribution part from the downloaded English-Bengali dataset and stores the data in a new file for processing.
The original tutorial uses a boolean condition to split the data into training (80%) and validation (20%) sets in a random manner. This tutorial, however, makes use of the train_test_split function from scikit-learn to split the data into training (80%) and validation (20%) sets.
The original tutorial leveraged the fact that the variables (tensors) example_context_strings and example_target_strings, which are local to the containing for loop, were only used once (the for loop does only one iteration) and could therefore be reused globally. However, the global variables example_context and example_target have been introduced in this tutorial to replace example_context_strings and example_target_strings and serve their purpose. The for loop has been modified to iterate through example_context and example_target and numpy().decode() has been used to decode the byte strings in them to a human-readable format.
The decode function has been used at various places to convert the byte string (output of TensorFlow operations) into a regular Unicode string for proper human-readable text display. This is particularly useful for printing the raw Bengali characters, which are not similar to English characters.
The tf_lower_and_split_punct function used to process the text and implement text standardization in the original tutorial has been replaced with tf_lower_and_split_punct_bengali and tf_lower_and_split_punct_english to standardize the Bengali and English text sequences, respectively. Bengali and English text standardization approaches have been separated, as, in contrast to the Spanish characters, Bengali characters and scripts are very different from English characters.
After text vectorization, the colormap and normalization of the token IDs / values of the example text data used in this tutorial have been customized for consistent color scaling and a colorbar has been used for informative visualization. In contrast, the original tutorial used the default colormap and normalization.
In addition to the data processing steps used in the original tutorial, the following important functions have additionally been used when loading data to make sure that I/O does not become blocking. Moreover, AUTOTUNE has been declared as a separate variable for ease of use.
1. cache() keeps data in memory after it’s loaded off disk. This will ensure the dataset does not become a bottleneck while training the model.
2. prefetch() overlaps data preprocessing and model execution while training.
The validation loss and validation accuracy have been printed separately in this tutorial before and after fitting the model. The original tutorial does not print the aforementioned metrics.
The original tutorial had links to the subsections “Inference” and “The decoder” within the subsection “Translate”. The links have been removed in this tutorial (i.e., from the Google Colab notebook for this tutorial).
Noto Sans Bengali has been added to Matplotlib in order to generate the attention plots with Bengali characters. The font can be downloaded from the Google Fonts website. While the downloaded package contains multiple Noto Sans Bengali fonts, we would be particularly using the file NotoSansBengali-VariableFont_wdth,wght.ttf in this tutorial. The downloaded font is stored in Google Drive and fetched from in there.
The variable fontdict representing the Matplotlib dictionary used to specify the properties of text (font-related values) in plots has been updated in the plot_attention function to align with the Noto Sans Bengali font.
The Spanish text examples used to plot attention using the function plot_attention or for translation using the functions translate and translate_dynamic are replaced by Bengali text inputs.
The original authors used two versions of the function translate to facilitate the Spanish-to-English translation. While one uses a python for loop, the other one uses a more computationally optimal TensorFlow for loop implemented for exporting the model using tf.function. We have renamed the optimized function as translate_dynamic when using the same for Bengali-to-English translation.
In the original tutorial, the optimal translate function uses ShapeChecker() to check the dimensions of the tensors next_token and tokens. The authors denote one of the dimensions of the aforementioned tensors as t1. We have renamed this dimension as 1 in the translate_dynamic function in this tutorial. This is because throughout the process defined in the original tutorial, each next_token generated is of the shape (batch_size, 1), i.e., the token dimension is always 1. Therefore, the accumulated tokens will have a shape of (t, batch, 1).
The “Next steps” section has been completely modified to reflect the context and content of this tutorial.

Summary of the text translation process

Here is the summary of how text translation has been implemented in this tutorial (and the original one). The Bengali sequences would be undergoing transformations and be referred to as the context in this tutorial, while the English translations would be referred to as the target. Once the dataset is preprocessed, it is fed to the model, which consists of the following components.

Text Vectorization: In text vectorization, the input text is first converted into tokens using a tokenizer. A vocabulary of tokens is created and each of the tokens is mapped to a unique index based on the vocabulary. This helps transform text sequences into sequences of token indices.
Encoder: The encoder converts the token sequences into dense vector representations through embeddings. A Bidirectional GRU processes the vectors in both forward and backward directions. This allows the encoder to capture context from both past and future tokens in the sequence, distilling essential features of the input. The output of the GRU is a sequence of vectors that encapsulate the context of the input text, capturing both the immediate and global context of the sentence. helping the model understand the meaning and structure of the input. These vectors help the model understand the meaning and structure of the input and serve as the “context” for the translation process.
Decoder Cross-Attention: In the decoding phase, cross-attention is used to align the output sequence with the context from the encoder. This mechanism computes attention scores between the decoder’s current state and the encoder’s output (context), weighting the encoder’s output to focus on relevant parts. This mechanism computes attention scores between the decoder’s current state and the encoder’s output (context). The encoder’s context is weighted so that the model can focus on different parts of the context when generating each word in the translation. It ensures that the decoder pays attention to the most relevant parts of the input sequence at each step and generates translations that are contextually accurate based on the text input. Internally, the attention mechanism computes Query (Q), Key (K), and Value (V) scores to determine the importance of each part of the input, helping the model align words between the source and target languages effectively.
Decoder: The decoder takes the context from the encoder and a GRU in the decoder generates the output sequence, one token at a time. At each time step, it takes the previously generated token (its hidden state) and the context from the encoder via cross-attention to focus on different parts of the context and ensure that the translation is accurate and contextually relevant. The decoder then produces the next token in the translation. The GRU captures dependencies within the target language sequence, helping maintain grammatical structure and coherence in the output.
Translation: Finally, the translation process combines all of the aforementioned components. The input text is encoded into a meaningful context, attention mechanisms guide the decoder to focus on specific parts of the context, and the decoder generates the translated text, token by token. The output tokens are converted back into human-readable text using the reverse of the initial tokenization process. This step-by-step translation process allows for the generation of coherent and contextually appropriate translations.

Adapted sections

The sections between the sections “Adapted sections” and “Thoughts” in the Google Colab notebook for this tutorial reflect content (text, images, and code) adapted from the original TensorFlow tutorial. The code and text from the original tutorial have been used as is, except for the adaptations and modifications mentioned under the subsection “Changes” (in this blog or in the Google Colab notebook) that have been made in this tutorial to fit its requirements. As mentioned earlier, the content in the “Thoughts” and “Next steps” sections for this tutorial (see Google Colab notebook or refer to those sections in this blog) have been created in a way that is completely specific to this tutorial. Again, as highlighted earlier, the Colab notebook for this tutorial houses all of the sections from this blog as well.

Note: While the original tutorial probably used a CPU for executing the code, this code in this tutorial has been run on a CPU. This is because there are a lot of low-level APIs used in this code and the shapes of the tensors vary quite a bit. It is possible to run into issues concerning the incompatibility of tensor shapes. TPUs are inherently designed to work with tensors, and therefore, using a TPU ensures that there is a much lower risk of running into issues related to mismatches in tensor shapes. Additionally, TPUs help in speeding up the data processing, model training, and inference.

Thoughts

The model has an accuracy of about 65%. The accuracy may slightly vary across separate occasions of training. However, it is important to note that this is a quick and dirty implementation of a very simple text translation model. Moreover, there is a limited amount of training data at our disposal. The results are not too bad based on those factors and can be considered about average. However, there is definitely a decent scope for improvement.

An important point to note is that the model loses attention when working with long sequences and falters with the translations. This is attributed to the model not being fed predictions from the previous time steps and the possibility of the RNN losing track of its context.

It can be observed that the model exhibits a high training accuracy and a lower validation accuracy. This is caused by overfitting, where the model essentially memorizes the training data and performs poorly on new, unseen data. This can in turn be attributed to the limited data available for training and inference. Overfitting can be controlled by leveraging more data (or by using data augmentation techniques) or by using a regularization technique, such as dropout or L2 regularization.

Next steps

As indicated in the “Thoughts” section above, there is a definite scope for improvement in this work. There are multiple ways to further develop this dirty implementation and improve the model’s performance / accuracy for Bengali-to-English text translation. Some of those are listed below.

Modifying the Architecture: Adding more layers, increasing the number of units in each layer, or using more attention heads and mechanisms can help the model learn more complex patterns. Using transformers can improve the performance by allowing the decoder to look at its previous outputs.
Vocabulary Size: A larger vocabulary can help capture more nuances and meanings.
Hyperparameter Tuning: Experimenting with different learning rates, batch sizes, epochs, etc. can help.
Regularization: Implementing dropout or L2 regularization can be useful in controlling overfitting.
Enriching the Dataset: Getting more data for training or using data augmentation techniques can help.
Pre-trained Embeddings: Using pre-trained word embeddings like FastText for Bengali and English can help improve the performance of the model.
Fine-tuning Pre-trained LLMs: Using pre-trained LLMs trained on translation tasks using similar language pairs can be useful. Examples include BERT, GPT-4, or similar models from the Hugging Face Transformers library. Such models can significantly uplift the performance of word-based models.
Hyperparameter Optimization: Using libraries, such as Keras Tuner or Optuna to find the best hyperparameters for the model.

Here is the link to the GitHub repo for this work.

Thank you for reading through! I genuinely hope you found the content useful. Feel free to reach out to us at ankanatwork@gmail.com and share your feedback and thoughts to help us make it better for you next time.

Acronyms used in the blog and the associated Google Colab notebook that have not been defined earlier: (a) Artificial Intelligence (AI), (b) Identity (ID), (c) Input/Output (I/O), (d) Central Processing Unit (CPU), (e) Application Programming Interface, (f) Tensor Processing Unit (TPU), and (g) American Standard Code for Information Interchange (ASCII).