AI Multimodal Skill
Process audio, images, videos, documents, and generate images/videos using Google Gemini’s multimodal API. Supports analysis up to 9.5 hours of audio, 6 hours of video, and 1000-page PDFs with context windows up to 2M tokens.
When to Use
- Analyzing audio files (transcription, summarization, music analysis)
- Understanding images (captioning, OCR, object detection, design extraction)
- Processing videos (scene detection, Q&A, temporal analysis, YouTube URLs)
- Extracting from documents (PDF tables, forms, charts, diagrams)
- Generating images (text-to-image with Imagen 4)
- Generating videos (text-to-video with Veo 3, 8-second clips with audio)
Setup
export GEMINI_API_KEY="your-key" # Get from https://aistudio.google.com/apikey
pip install google-genai python-dotenv pillow
Verify setup:
python scripts/check_setup.py
Quick Start
Analyze Media
# Using Gemini CLI (if available)
echo "Describe this image" | gemini -y -m gemini-2.5-flash
# Using scripts
python scripts/gemini_batch_process.py --files image.png --task analyze
python scripts/gemini_batch_process.py --files audio.mp3 --task transcribe
python scripts/gemini_batch_process.py --files document.pdf --task extract
Generate Content
# Generate image with Imagen 4
python scripts/gemini_batch_process.py --task generate --prompt "A futuristic city at sunset"
# Generate video with Veo 3
python scripts/gemini_batch_process.py --task generate-video --prompt "Ocean waves at golden hour"
Stdin Support
# Pipe files directly (auto-detects PNG/JPG/PDF/WAV/MP3)
cat image.png | python scripts/gemini_batch_process.py --task analyze --prompt "Describe this"
Models
| Task | Model | Notes |
|---|---|---|
| Analysis | gemini-2.5-flash | Recommended for speed |
| Analysis | gemini-2.5-pro | Advanced reasoning |
| Image Gen | imagen-4.0-generate-001 | Standard quality |
| Image Gen | imagen-4.0-ultra-generate-001 | Best quality |
| Image Gen | imagen-4.0-fast-generate-001 | Fastest |
| Video Gen | veo-3.1-generate-preview | 8s clips with audio |
Scripts
| Script | Purpose |
|---|---|
gemini_batch_process.py | CLI orchestrator for transcribe/analyze/extract/generate tasks |
media_optimizer.py | Compress/resize media to stay within Gemini limits |
document_converter.py | Convert PDFs/images to markdown |
check_setup.py | Verify API key and dependencies |
Limits
| Format | Limits |
|---|---|
| Audio | WAV/MP3/AAC, up to 9.5 hours |
| Images | PNG/JPEG/WEBP, up to 3600 images |
| Video | MP4/MOV, up to 6 hours |
| Up to 1000 pages | |
| Size | 20MB inline, 2GB via File API |
Use Cases
Design Extraction
Extract design guidelines from screenshots:
python scripts/gemini_batch_process.py \
--files screenshot.png \
--task analyze \
--prompt "Extract: colors (hex), typography, spacing, layout patterns"
Video Transcription
python scripts/gemini_batch_process.py \
--files meeting.mp4 \
--task transcribe \
--prompt "Include speaker labels and timestamps"
Batch Processing
python scripts/gemini_batch_process.py \
--files images/*.png \
--task analyze \
--prompt "Describe each image"
Integration with ClaudeKit
The ai-multimodal skill integrates with:
- frontend-design: Extract design guidelines from screenshots before implementation
- media-processing: Optimize media files before Gemini analysis
- document-skills: Convert extracted content to structured formats
Best Practices
- Use appropriate models:
gemini-2.5-flashfor speed,gemini-2.5-profor complex analysis - Optimize media first: Use
media_optimizer.pyfor large files - Batch when possible: Process multiple files in one call
- Structure prompts: Be specific about output format (JSON, markdown, etc.)
Resources
Key Takeaway
The AI Multimodal skill provides comprehensive multimedia processing through Google Gemini, handling everything from audio transcription to AI-generated images and videos with support for massive context windows.