Stay organized with collections
Save and categorize content based on your preferences.
Leveraging the Gemini Pro Vision model for image understanding, multimodal prompts and accessibility
Explore how you can use the new Gemini Pro Vision model with the Gemini API to handle multimodal input data including text and image prompts to receive a text result. In this solution, you will learn how to access the Gemini API with image and text data, explore a variety of examples of prompts that can be achieved using images using Gemini Pro Vision and finally complete a codelab exploring how to use the API for a real-world problem scenario involving accessibility and basic web development.
Go back
check_circle
Leveraging the Gemini Pro Vision model for image understanding, multimodal prompts and accessibility
ondemand_video
Video
Learn how to use the multimodal features of the Gemini model to analyze HTML documents and image files for the purpose of adding accessible descriptions to a webpage in a NodeJS script.
check_circle
Quickstart: Get started with the Gemini API in Node.js applications
Learn how to generate text from multimodal text-and-image input data using the Gemini Pro Vision model in NodeJS.
check_circle
Multimodal text and image prompting
subject
Article
Explore various examples of interesting ways that Gemini's multimodal image and text inputs can be combined to extract text output about images across a variety of different use cases.
Prompting with images and text using the Gemini API for accessibility
emoji_objects
Codelab
In this codelab, you will write a NodeJS script that leverages the Gemini Pro Vision model to analyze a local HTML document and generate accessible descriptions of the images in the page if necessary. By leveraging Gemini we can verify whether existing descriptions are accurate for a given image and, if not, generate entirely new descriptions.
[null,null,[],[[["\u003cp\u003eGemini supports multimodal prompts, accepting text and image inputs while providing text-only responses.\u003c/p\u003e\n"],["\u003cp\u003eMultimodal prompts allow for a variety of use cases like image classification, object recognition, and creative text generation based on images.\u003c/p\u003e\n"],["\u003cp\u003eGemini can interpret and reason about images, enabling tasks like counting objects, understanding handwriting, and inferring temporal information from scenes.\u003c/p\u003e\n"],["\u003cp\u003eAdvanced multimodal prompts can combine multiple skills like handwriting recognition, logical reasoning, and world knowledge for creative and practical applications.\u003c/p\u003e\n"],["\u003cp\u003eExperimentation with different multimodal prompts is encouraged to explore the full potential of Gemini's capabilities.\u003c/p\u003e\n"]]],["Multimodal prompts, combining text and images, enable LLMs like Gemini to perform diverse tasks. Key actions include entity recognition, classification, and counting objects in images. More advanced applications demonstrate text recognition from handwriting, reasoning, calculation, interpreting scene lighting for time inference, and creative tasks like haiku generation. Additionally, LLMs can identify logical progressions, understand object attributes, and infer real-world practicality. These capabilities highlight the power of multimodal prompts for understanding and extracting information from combined input.\n"],null,["# Multimodal text and image prompting\n\nMultimodal prompts are a type of prompts for large language models (LLMs) that\ncombine multiple input type formats. By formats we refer to inputs such as text\nor images. Multimodal prompts and LLMs that support them enable a variety of\ninteresting use cases including image classification, handwriting recognition,\ntranslation and other creative scenarios.\n\nIn this document, we'll focus on the types of prompts that can be achieved when\ninputting text and image into the Gemini model and receiving a text-only\nresponse by exploring 8 interesting examples.\n\n**What is a multimodal prompt?**\n--------------------------------\n\nToday, Gemini can now be prompted with a mixture of text and/or image input and\nreturn a text-only response. The text can be used to provide context for the\nimage or it could be used to request the model to operate or generate a\nresponse about the image.\n\nFor example, the following is a multimodal prompt:\n\n***\"What color is the cat?\"***\n\nThe text in this prompt provides context for the image by clarifying that there\nis a cat in the image and that we are mainly interested in the cat only and the\ncolor of the cat.\n\n3 simple multimodal prompts\n---------------------------\n\nThe following are three very simple examples of providing an image along with\ntext to an LLM and accepting a text-only response. We will explore the concepts\nof recognising a type of entity in an image, classifying entities in an image\nwithout predetermined knowledge and finally an example of combining recognition\nwith calculation.\n\n### Classification\n\n***\"Does this image contain a cat? Respond with either true or false.\"***\n\nResponse from Gemini:\n\u003e True\n\nThis is a very simple example that highlights the LLM's ability to recognise\nthe existence of something in an image or not and to respond to the developer\nin a boolean manner. This approach can be useful for detection of specific\ncontent for filtering or other purposes.\n\nIt is worth noting that dedicated image recognition models can be potentially\nmore effective at very specific tasks. However, creating custom ML models takes\ntime and expertise. Solutions using Gemini can be built more rapidly.\nTherefore it is important to compare the quality of responses available\nbetween different models and decide based on these trade offs.\n\n### Recognition\n\n***\"Give me a list of all the important things in this picture.\"***\n\nResponse from Gemini:\n\u003e 1. A Google notebook\n\u003e 2. A Google pen\n\u003e 3. A mug\n\nThe above example highlights the fact we can request an open question to the\nLLM regarding the content appearing in the image. Using Gemini, the image\nclassification process does not require different models for different types\nof objects (i.e. books or mugs). Instead, using a single model, we have\nidentified all the significant elements in a single pass.\n\n### Counting\n\n***\"Count the number of cats in this picture. Give me only the final number in\nthe response.\nOnly give me back a number\"***\n\nResponse from Gemini:\n\u003e 3\n\nIn this example we contain the request to only a particular type object and\nspecify that we want it counted in the picture. Furthermore, we've asked that\nthe response be contained to a pure numerical response in the prompt. However,\ndepending on the number of objects in the image that need to be counted the\nquality may vary. As Gemini is a general purpose multimodal model its precision\nbeyond small numbers may be limited.\n| **Note:** The above image was sourced from Wikimedia. [Link to original image.](https://commons.wikimedia.org/wiki/File:Cats_Family.JPG)\n\n5 advanced multimodal prompts\n-----------------------------\n\nNext we'll walk through some interesting examples of more advanced ways that\nmultimodal prompting with text and image input can be utilized.\n\n### Text Recognition, Reasoning \\& Calculation\n\n***\"How much money did we make today total? Explain your reasoning.\"***\n\nResponse from Gemini:\n\u003e The total amount of money made today is $100.\n\u003e\n\u003e This is because the note says that 5 calendars were sold at $20 each.\n\u003e 5 x $20 = $100.\n\nThis more advanced prompt and image demonstrates several features of Gemini at\nonce. First is handwriting recognition, the AI has recognised that the text\nwritten in English referred to how many calendars were sold and at what price\neach was sold at. Furthermore, with our prompt indicating the concept of\ntallying numbers, the AI has recognised the notation for a 5 point tally (i.e.\n4 vertical lines and a single horizontal strikethrough). Finally, with our\nprompt guiding the AI through the steps of calculation we are able to receive\na walkthrough of how the AI calculated the revenue.\n\n### World Context \\& Reasoning\n\n***\"What time of day is it in this picture in 24 hour time.\"***\n\nResponse from Gemini:\n\u003e 19:30\n\nThis example highlights the model's ability to recognise a scene's lighting and\ninfer temporal information from the imagery within a certain timeframe.\n\n### Interpretation \\& Creativity\n\n***\"Write a haiku about this photo.\"***\n\nResponse from Gemini:\n\u003e A bench by the lake,\n\u003e\n\u003e A view of the mountains beyond,\n\u003e\n\u003e A moment of peace.\n\nHere we see the power of Gemini for storytelling and more creative ways of\nusing generative AI. By providing an image as the starting point we can surprise\nourselves with where the AI takes the storyline or what features from the image\nthe AI decides to integrate in particular with the story. Pure text prompting\nfor story generation is often limited to our own initial idea but a real world\nphoto can serve as a powerful starting point for the AI's thought process.\n| **Note:** The above image was sourced from Pexels. [Link to original image.](https://www.pexels.com/photo/wooden-bench-beside-a-lake-6016611/)\n\n### Logical Progression\n\n***\"What comes next? Explain your reasoning\"***\n\nResponse from Gemini:\n\u003e The triangle has 3 sides, the square has 4 sides, and the pentagon has 5\n\u003e sides. The number of sides increases by 1 for each shape. Therefore, the\n\u003e next shape should have 6 sides, which is a hexagon.\n\nThis example yet again combines multiple skills of Gemini. Not only does it\nrecognise the shapes but it understands that the shapes, while slightly crudely\ndrawn, are intended to be distinct mathematically bound shapes with specific\nattributes (e.g. 3 sides, 4 sides, 5 sides).\n\nOn top of this, the presence of the question mark does not confuse Gemini in\nits interpretation of the logical progression of the geometric shapes. Instead\nGemini \"sees\" that this is a mathematical progression from 3, 4 to 5 and that\ntherefore the last shape would be 6-sided and fittingly proposes a hexagon.\n\n### World Interpretation \\& Understanding\n\n***Which of these would be better for writing on?\nWhich would be better for making into a scarf?***\n\nResponse from Gemini:\n\u003e The paper would be better for writing on because it is a smooth surface and\n\u003e the ink would not bleed through. The fabric would be better for making into\n\u003e a scarf because it is soft and would keep you warm.\n\nHere we are leveraging Gemini in a very inventive way. The AI not only\nunderstands the imagery but it infers attributes and real-world physicality\nfrom what it sees. The reasoning is based on the visual understanding of the\nmaterials of the objects and how they interact with the purposes outlined in\nthe prompt.\n\n**Conclusion**\n--------------\n\nThese are just a few examples of the ways that multimodal prompts can be used.\nWith LLMs, you can use multimodal prompts to not only understand text and\nimages but to form new ideas or extract understanding from them. We encourage\nyou to experiment with different types of multimodal prompts and see what you\ncan create."]]