AI Multimodal Skill

Process audio, images, videos, documents, and generate images/videos using Google Gemini’s multimodal API. Supports analysis up to 9.5 hours of audio, 6 hours of video, and 1000-page PDFs with context windows up to 2M tokens.

When to Use

  • Analyzing audio files (transcription, summarization, music analysis)
  • Understanding images (captioning, OCR, object detection, design extraction)
  • Processing videos (scene detection, Q&A, temporal analysis, YouTube URLs)
  • Extracting from documents (PDF tables, forms, charts, diagrams)
  • Generating images (text-to-image with Imagen 4)
  • Generating videos (text-to-video with Veo 3, 8-second clips with audio)

Setup

export GEMINI_API_KEY="your-key"  # Get from https://aistudio.google.com/apikey
pip install google-genai python-dotenv pillow

Verify setup:

python scripts/check_setup.py

Quick Start

Analyze Media

# Using Gemini CLI (if available)
echo "Describe this image" | gemini -y -m gemini-2.5-flash

# Using scripts
python scripts/gemini_batch_process.py --files image.png --task analyze
python scripts/gemini_batch_process.py --files audio.mp3 --task transcribe
python scripts/gemini_batch_process.py --files document.pdf --task extract

Generate Content

# Generate image with Imagen 4
python scripts/gemini_batch_process.py --task generate --prompt "A futuristic city at sunset"

# Generate video with Veo 3
python scripts/gemini_batch_process.py --task generate-video --prompt "Ocean waves at golden hour"

Stdin Support

# Pipe files directly (auto-detects PNG/JPG/PDF/WAV/MP3)
cat image.png | python scripts/gemini_batch_process.py --task analyze --prompt "Describe this"

Models

TaskModelNotes
Analysisgemini-2.5-flashRecommended for speed
Analysisgemini-2.5-proAdvanced reasoning
Image Genimagen-4.0-generate-001Standard quality
Image Genimagen-4.0-ultra-generate-001Best quality
Image Genimagen-4.0-fast-generate-001Fastest
Video Genveo-3.1-generate-preview8s clips with audio

Scripts

ScriptPurpose
gemini_batch_process.pyCLI orchestrator for transcribe/analyze/extract/generate tasks
media_optimizer.pyCompress/resize media to stay within Gemini limits
document_converter.pyConvert PDFs/images to markdown
check_setup.pyVerify API key and dependencies

Limits

FormatLimits
AudioWAV/MP3/AAC, up to 9.5 hours
ImagesPNG/JPEG/WEBP, up to 3600 images
VideoMP4/MOV, up to 6 hours
PDFUp to 1000 pages
Size20MB inline, 2GB via File API

Use Cases

Design Extraction

Extract design guidelines from screenshots:

python scripts/gemini_batch_process.py \
  --files screenshot.png \
  --task analyze \
  --prompt "Extract: colors (hex), typography, spacing, layout patterns"

Video Transcription

python scripts/gemini_batch_process.py \
  --files meeting.mp4 \
  --task transcribe \
  --prompt "Include speaker labels and timestamps"

Batch Processing

python scripts/gemini_batch_process.py \
  --files images/*.png \
  --task analyze \
  --prompt "Describe each image"

Integration with ClaudeKit

The ai-multimodal skill integrates with:

  • frontend-design: Extract design guidelines from screenshots before implementation
  • media-processing: Optimize media files before Gemini analysis
  • document-skills: Convert extracted content to structured formats

Best Practices

  1. Use appropriate models: gemini-2.5-flash for speed, gemini-2.5-pro for complex analysis
  2. Optimize media first: Use media_optimizer.py for large files
  3. Batch when possible: Process multiple files in one call
  4. Structure prompts: Be specific about output format (JSON, markdown, etc.)

Resources


Key Takeaway

The AI Multimodal skill provides comprehensive multimedia processing through Google Gemini, handling everything from audio transcription to AI-generated images and videos with support for massive context windows.