Gemini Vision Skill

This skill has been merged into AI Multimodal which provides comprehensive multimedia capabilities.

Redirecting to AI Multimodal

The AI Multimodal skill now covers:

  • Image analysis - Captioning, OCR, object detection, visual Q&A, segmentation
  • Audio processing - Transcription, summarization, up to 9.5 hours
  • Video understanding - Scene detection, temporal analysis, up to 6 hours
  • Document extraction - PDF tables, forms, charts, diagrams
  • Image generation - Text-to-image with Imagen 4
  • Video generation - Text-to-video with Veo 3

Quick Examples

"Analyze this product image and extract name, color, condition"
"Extract text from invoice.jpg and return as JSON"
"Compare these before/after photos and list differences"
"Detect all objects in image with bounding boxes"

Go to AI Multimodal for full documentation.


Key Takeaway

Use AI Multimodal for all Gemini-powered image, audio, video, and document processing.