Gemini Vision Skill
This skill has been merged into AI Multimodal which provides comprehensive multimedia capabilities.
Redirecting to AI Multimodal
The AI Multimodal skill now covers:
- Image analysis - Captioning, OCR, object detection, visual Q&A, segmentation
- Audio processing - Transcription, summarization, up to 9.5 hours
- Video understanding - Scene detection, temporal analysis, up to 6 hours
- Document extraction - PDF tables, forms, charts, diagrams
- Image generation - Text-to-image with Imagen 4
- Video generation - Text-to-video with Veo 3
Quick Examples
"Analyze this product image and extract name, color, condition"
"Extract text from invoice.jpg and return as JSON"
"Compare these before/after photos and list differences"
"Detect all objects in image with bounding boxes"
→ Go to AI Multimodal for full documentation.
Key Takeaway
Use AI Multimodal for all Gemini-powered image, audio, video, and document processing.