Zum Inhalt

Building with Gemini Embedding 2: Agentic multimodal RAG and beyond

Patrick Löber, Member of the Technical Staff, Gemini API. Lucia Loher, Product Manager for the Gemini API. Roberto Santana, Product Manager Lead at Google Cloud. Mojtaba Seyedhosseini Engineering Director Google DeepMind. Last week, we launched the General Availability (GA) of Gemini Embedding 2 through the Gemini API and the Gemini Enterprise Agent Platform. This is the first embedding model in the Gemini API that projects text, images, video, audio, and documents into a unified embedding space while supporting more than 100 languages. In this post, we’ll examine the wide range of use cases it enables—from agentic multimodal RAG to visual search—and show you exactly how to start building them. The model can process a broad mix of inputs in one call: up to 8,192 text tokens, 6 images, 120 seconds of video, 143 seconds of audio, and 6 pages of PDFs. By aligning various modalities within a shared semantic space, developers can create rich experiences that can „see“ and „hear“ proprietary data. Link to Youtube Video (visible only when JS is disabled). The real strength of Gemini Embedding 2 lies in its capacity to handle interleaved inputs—like mixtures of text and images—within a single request.

  Google Developers Blog