Meet PaliGemma 2: Google’s Latest Set of Powerful Vision-Language Models

AI Ecosystem Updates | Issue# 10 [December 18, 2024]

What?

PaliGemma 2, Google’s recently released series of tunable vision-language-models, marks a significant upgrade to its predecessor, PaliGemma – the first vision-language model in the Gemma family. This new iteration integrates advanced capabilities, providing more efficient and flexible AI for tasks that require both vision and language understanding. PaliGemma 2 builds upon the powerful Gemma 2 models by incorporating vision capabilities, making fine-tuning easier for exceptional performance. The models can see, understand, and interact with visual input. By enhancing the model’s architecture and incorporating larger parameters, PaliGemma 2 offers powerful tools for developers and researchers to create cutting-edge AI applications. See blog posts from Google and Hugging Face to learn more.

Key Features and Advancements

PaliGemma 2 builds on the success of PaliGemma by incorporating several notable features:

Improved Scalability: The new version offers models with different parameter sizes – 3 billion (3B), 10 billion (10B), and 28 billion (28B), and image resolution options (224px, 448px, 896px). The PaliGemma 3B, 10B, and 28B variants have been built using the Gemma 2 2B, 9B, and 27B language models, respectively, and have been named by taking into account the additional parameters of the (compact) image encoder. These variations cater to a wide range of use cases, from general applications to more specific, resource-intensive tasks. However, PaliGemma was only available in a 3B parameter variant.
Enhanced Captioning: PaliGemma 2 is capable of generating highly detailed captions that go beyond basic object identification. The models can describe images / scenes with more context, including emotions, actions, and overall narratives, offering a deeper level of visual comprehension.
Expanding Capabilities: PaliGemma 2 demonstrates outstanding performance in tasks such as chemical formula recognition, music score interpretation, spatial reasoning, and chest X-ray report generation. These achievements push the boundaries of what AI can understand in both visual and textual domains.
Improved Flexibility: The new models make fine-tuning for specific downstream tasks easy for developers. Whether it is generating captions or performing complex image analyses, users can adjust the models for precise needs without major modifications to the existing code.

Architecture and Training Data

Architecture

PaliGemma 2 utilizes the latest advances in AI architecture. It merges the powerful SigLIP image encoder with the Gemma 2 Large Language Model (LLM) for the text decoder part, resulting in more effective processing of both visual and textual inputs. The hybrid architecture ensures that the models excel in both image recognition and Natural Language Processing (NLP) tasks. With multiple model sizes and input resolutions, PaliGemma 2 offers flexibility in fine-tuning and optimizing its pre-trained variants (3B, 10B, and 28B) for a wide range of downstream tasks and use cases with ease.

Training Data

The models have been trained on diverse datasets, which enhances their ability to generalize across a wide range of visual-language understanding tasks, and enables fine-tuning on related tasks with fewer examples. This makes PaliGemma 2 suitable for applications in various industries, from healthcare to entertainment. More on the datasets below.

WebLI: Web-scale multilingual image-text dataset built from the public web.
CC3M-35L: Curated pairs of English images and alt (alternative) text from web pages, see more.
Visual Question Generation with Question Answering Validation (VQ2A): Dataset for question answering.
OpenImages: Object detection and object-aware question answering on the OpenImages dataset created using handcrafted rules, see more.
WIT: Dataset based on texts and images collected from Wikipedia.

The pre-trained models have been fine-tuned on various visual-language tasks and the benchmarks have been included in the model card and the technical report.

Note: The PaliGemma 2 team used the Google Cloud Translation API to translate the CC3M-35L and VQ2A datasets into 34 additional languages.

Adoption and Fine-Tuning

For users of the original PaliGemma, upgrading to PaliGemma 2 is simple. The new version is designed to be a drop-in replacement, offering seamless integration without the need for extensive code changes. Additionally, the model’s architecture allows for straightforward fine-tuning, ensuring that users can quickly adapt it to specific datasets or downstream tasks.

Expanding the Ecosystem

Since its launch, the Gemma family has flourished, with the Gemmaverse becoming a hub for AI development for developers in the Gemma ecosystem. The community has embraced PaliGemma’s vision-language capabilities, using it for innovative tasks, such as visual document retrieval (example), real-time object tracking (example), and fine-tuning techniques (example). PaliGemma 2 provides developers with greater flexibility through the new variants and improved pre-trained quality, enabling the creation of more applications that solve for diverse AI use cases. Google encourages feedback from the community on PaliGemma 2 to help refine and enhance the models.

Note: To demonstrate the applicability and effectiveness of PaliGemma 2, Google has released two models fine-tuned on the DOCCI dataset, optimized for 3B and 10B variants with 448×448 resolution. These models excel in detailed captioning tasks, such as text rendering, spatial relationships, and integrating world knowledge. See the technical report and the Hugging Face blog for more.

Getting Started

To begin using PaliGemma 2, developers can access the models and code through platforms like Hugging Face and Kaggle. Comprehensive documentation and example notebooks are available to guide users in integrating PaliGemma 2 into their projects. The PaliGemma 2 series of models are compatible with popular AI tools and frameworks, such as Hugging Face Transformers, PyTorch, Keras, JAX, and Gemma.cpp, providing flexibility for developers working within their preferred environments.

The release includes open model repositories, transformers integration, a technical report, as well as a fine-tuning script (notebook) and a demo created by the Hugging Face team for visual question answering on the VQAv2 dataset. For the demo, the Hugging Face team fine-tuned PaliGemma 2 3B with 448×448 resolution on a small portion of the VQAv2 dataset using Low-Rank Adaptation (LoRA), model here. The aforementioned resources provide an excellent starting point for exploring PaliGemma 2.

Note: PaliGemma 2 is available under the Gemma license, permitting redistribution, commercial use, fine-tuning, and the creation of derivative models.

Thoughts

The competition for vision-language models is intensifying, and PaliGemma 2 represents a significant step forward in their development. Enhanced scalability, advanced image analysis and captioning capabilities, easier fine-tuning across diverse model sizes and input resolutions, and seamless integration with popular AI tools and frameworks enable PaliGemma 2 to expand the scope of AI applications in research and industry.

With its rich ecosystem and support for a wide range of tasks, PaliGemma 2 is poised to be a game-changer for those looking to build advanced visual AI systems. We are excited to see the innovative applications the Gemma developer community will create with PaliGemma 2. Lastly, feedback from the community will go a long way in improving PaliGemma 2 and in shaping its future iterations.

If you are interested to learn more about working with the PaliGemma 2 series of models and their performance, feel free to check out this blog post and this technical report by Google, and this blog post by Hugging Face.

Thank you for reading through! I genuinely hope you found the content useful. Feel free to reach out to us at ankanatwork@gmail.com and share your feedback and thoughts to help us make it better for you next time.

Acronyms used in the blog that have not been defined earlier: (a) Artificial Intelligence (AI), (b) pixel (px), and (c) Sigmoid loss for
Language-Image Pre-training (SigLIP).