Time to sync up! Tune in for Google I/O on May 14, 2024. Register now.

Leveraging the Gemini Pro Vision model for image understanding, multimodal prompts and accessibility

Explore how you can use the new Gemini Pro Vision model with the Gemini API to handle multimodal input data including text and image prompts to receive a text result. In this solution, you will learn how to access the Gemini API with image and text data, explore a variety of examples of prompts that can be achieved using images using Gemini Pro Vision and finally complete a codelab exploring how to use the API for a real-world problem scenario involving accessibility and basic web development.

Leveraging the Gemini Pro Vision model for image understanding, multimodal prompts and accessibility

Video

Learn how to use the multimodal features of the Gemini model to analyze HTML documents and image files for the purpose of adding accessible descriptions to a webpage in a NodeJS script.

Quickstart: Get started with the Gemini API in Node.js applications

Article

Learn how to generate text from multimodal text-and-image input data using the Gemini Pro Vision model in NodeJS.

Read article

Multimodal text and image prompting

Article

Explore various examples of interesting ways that Gemini’s multimodal image and text inputs can be combined to extract text output about images across a variety of different use cases.

Read article

Prompting with images and text using the Gemini API for accessibility

Codelab

In this codelab, you will write a NodeJS script that leverages the Gemini Pro Vision model to analyze a local HTML document and generate accessible descriptions of the images in the page if necessary. By leveraging Gemini we can verify whether existing descriptions are accurate for a given image and, if not, generate entirely new descriptions.

Take codelab