Introducing T5Gemma 2: The Next Generation of Efficient, Multimodal AI
As AI developers, we’re constantly striving for models that are both powerful and accessible. Today, we’re excited to introduce T5Gemma 2, a significant leap forward in encoder-decoder models, building upon the foundation of Gemma 3. This isn’t just an iteration; it’s a reimagining of what’s possible with compact, versatile AI.
T5Gemma 2 marks the arrival of the first multi-modal and long-context encoder-decoder models in our family. It’s designed to empower you with cutting-edge capabilities, whether you’re prototyping rapidly or deploying to on-device applications.
Why T5Gemma 2 Matters: A New Approach to Efficiency
With the original T5Gemma, we proved that adapting modern decoder-only models into an encoder-decoder architecture unlocks unbelievable versatility. We bypassed the immense computational cost of training from scratch by leveraging pre-trained decoder weights and continued pre-training. T5Gemma 2 takes this success and expands it into the realm of vision-language understanding, incorporating key innovations from the Gemma 3 family.
But T5Gemma 2 is more then just a re-training. We’ve implemented substantial architectural changes to maximize efficiency without sacrificing performance.
Key Architectural Innovations
To deliver powerful capabilities in a smaller footprint,we focused on these core refinements:
* Tied Embeddings: We’ve tied the word embeddings between the encoder and decoder. This dramatically reduces the parameter count, allowing our new 270M-270M model to pack a significant punch.
* Merged Attention: The decoder now utilizes a merged attention mechanism. This combines self- and cross-attention into a single layer, reducing parameters and simplifying the architecture for improved parallelization and faster inference.
These changes result in compact pre-trained models available in these sizes:
* 270M-270M (~370M total, excluding vision encoder)
* 1B-1B (~1.7B)
* 4B-4B (~7B)
Unleashing Next-Generation Capabilities
T5Gemma 2 doesn’t just refine the architecture; it elevates the core capabilities, inheriting the strengths of Gemma 3. Here’s what you can expect:
* Multimodality: Imagine a model that can see and understand. T5Gemma 2 models process both images and text, enabling tasks like visual question answering and complex multimodal reasoning. This is achieved through a highly efficient vision encoder.
* Extended Long Context: We’ve dramatically increased the context window to up to 128K tokens. Leveraging Gemma 3’s alternating local and global attention, you can now process considerably longer documents and conversations.
* Massively Multilingual: T5Gemma 2 supports over 140 languages out of the box. This is thanks to training on a larger, more diverse dataset, making your applications truly global.
Performance You Can Rely On
T5Gemma 2 sets a new benchmark for compact encoder-decoder models. You’ll experience strong performance across key areas, benefiting from the powerful multimodal and long-context features inherited from the Gemma 3 architecture.
We believe T5Gemma 2 empowers you to build more smart, versatile, and accessible AI applications.We’re excited to see what you create with it.
Learn more and get started with T5Gemma 2: https://arxiv.org/abs/2512.14856 and explore the original T5Gemma declaration: https://developers.googleblog.com/en/t5gemma/









