init
This commit is contained in:
387
.opencode/skills/ai-multimodal/references/audio-processing.md
Normal file
387
.opencode/skills/ai-multimodal/references/audio-processing.md
Normal file
@@ -0,0 +1,387 @@
|
||||
# Audio Processing Reference
|
||||
|
||||
Comprehensive guide for audio analysis and speech generation using Gemini API.
|
||||
|
||||
## Audio Understanding
|
||||
|
||||
### Supported Formats
|
||||
|
||||
| Format | MIME Type | Best Use |
|
||||
|--------|-----------|----------|
|
||||
| WAV | `audio/wav` | Uncompressed, highest quality |
|
||||
| MP3 | `audio/mp3` | Compressed, widely compatible |
|
||||
| AAC | `audio/aac` | Compressed, good quality |
|
||||
| FLAC | `audio/flac` | Lossless compression |
|
||||
| OGG Vorbis | `audio/ogg` | Open format |
|
||||
| AIFF | `audio/aiff` | Apple format |
|
||||
|
||||
### Specifications
|
||||
|
||||
- **Maximum length**: 9.5 hours per request
|
||||
- **Multiple files**: Unlimited count, combined max 9.5 hours
|
||||
- **Token rate**: 32 tokens/second (1 minute = 1,920 tokens)
|
||||
- **Processing**: Auto-downsampled to 16 Kbps mono
|
||||
- **File size limits**:
|
||||
- Inline: 20 MB max total request
|
||||
- File API: 2 GB per file, 20 GB project quota
|
||||
- Retention: 48 hours auto-delete
|
||||
- **Important:** if you are going to generate a transcript of the audio, and the audio length is longer than 15 minutes, the transcript often gets truncated due to output token limits in the Gemini API response. To get the full transcript, you need to split the audio into smaller chunks (max 15 minutes per chunk) and transcribe each segment for a complete transcript.
|
||||
|
||||
## Transcription
|
||||
|
||||
### Basic Transcription
|
||||
|
||||
```python
|
||||
from google import genai
|
||||
import os
|
||||
|
||||
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
|
||||
|
||||
# Upload audio
|
||||
myfile = client.files.upload(file='meeting.mp3')
|
||||
|
||||
# Transcribe
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Generate a transcript of the speech.', myfile]
|
||||
)
|
||||
print(response.text)
|
||||
```
|
||||
|
||||
### With Timestamps
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Generate transcript with timestamps in MM:SS format.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
### Multi-Speaker Identification
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Transcribe with speaker labels. Format: [Speaker 1], [Speaker 2], etc.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
### Segment-Specific Transcription
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Transcribe only the segment from 02:30 to 05:15.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
## Audio Analysis
|
||||
|
||||
### Summarization
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Summarize key points in 5 bullets with timestamps.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
### Non-Speech Audio Analysis
|
||||
|
||||
```python
|
||||
# Music analysis
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Identify the musical instruments and genre.', myfile]
|
||||
)
|
||||
|
||||
# Environmental sounds
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Identify all sounds: voices, music, ambient noise.', myfile]
|
||||
)
|
||||
|
||||
# Birdsong identification
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Identify bird species based on their calls.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
### Timestamp-Based Analysis
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['What is discussed from 10:30 to 15:45? Provide key points.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
## Input Methods
|
||||
|
||||
### File Upload (>20MB or Reuse)
|
||||
|
||||
```python
|
||||
# Upload once, use multiple times
|
||||
myfile = client.files.upload(file='large-audio.mp3')
|
||||
|
||||
# First query
|
||||
response1 = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Transcribe this', myfile]
|
||||
)
|
||||
|
||||
# Second query (reuses same file)
|
||||
response2 = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Summarize this', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
### Inline Data (<20MB)
|
||||
|
||||
```python
|
||||
from google.genai import types
|
||||
|
||||
with open('small-audio.mp3', 'rb') as f:
|
||||
audio_bytes = f.read()
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Describe this audio',
|
||||
types.Part.from_bytes(data=audio_bytes, mime_type='audio/mp3')
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Speech Generation (TTS)
|
||||
|
||||
### Available Models
|
||||
|
||||
| Model | Quality | Speed | Cost/1M tokens |
|
||||
|-------|---------|-------|----------------|
|
||||
| `gemini-2.5-flash-native-audio-preview-09-2025` | High | Fast | $10 |
|
||||
| `gemini-2.5-pro` TTS mode | Premium | Slower | $20 |
|
||||
|
||||
### Basic TTS
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-native-audio-preview-09-2025',
|
||||
contents='Generate audio: Welcome to today\'s episode.'
|
||||
)
|
||||
|
||||
# Save audio
|
||||
with open('output.wav', 'wb') as f:
|
||||
f.write(response.audio_data)
|
||||
```
|
||||
|
||||
### Controllable Voice Style
|
||||
|
||||
```python
|
||||
# Professional tone
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-native-audio-preview-09-2025',
|
||||
contents='Generate audio in a professional, clear tone: Welcome to our quarterly earnings call.'
|
||||
)
|
||||
|
||||
# Casual and friendly
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-native-audio-preview-09-2025',
|
||||
contents='Generate audio in a friendly, conversational tone: Hey there! Let\'s dive into today\'s topic.'
|
||||
)
|
||||
|
||||
# Narrative style
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash-native-audio-preview-09-2025',
|
||||
contents='Generate audio in a narrative, storytelling tone: Once upon a time, in a land far away...'
|
||||
)
|
||||
```
|
||||
|
||||
### Voice Control Parameters
|
||||
|
||||
- **Style**: Professional, casual, narrative, conversational
|
||||
- **Pace**: Slow, normal, fast
|
||||
- **Tone**: Friendly, serious, enthusiastic
|
||||
- **Accent**: Natural language control (e.g., "British accent", "Southern drawl")
|
||||
|
||||
## Best Practices
|
||||
|
||||
### File Management
|
||||
|
||||
1. Use File API for files >20MB
|
||||
2. Use File API for repeated queries (saves tokens)
|
||||
3. Files auto-delete after 48 hours
|
||||
4. Clean up manually when done:
|
||||
```python
|
||||
client.files.delete(name=myfile.name)
|
||||
```
|
||||
|
||||
### Prompt Engineering
|
||||
|
||||
**Effective prompts**:
|
||||
- "Transcribe from 02:30 to 03:29 in MM:SS format"
|
||||
- "Identify speakers and extract dialogue with timestamps"
|
||||
- "Summarize key points with relevant timestamps"
|
||||
- "Transcribe and analyze sentiment for each speaker"
|
||||
|
||||
**Context improves accuracy**:
|
||||
- "This is a medical interview - use appropriate terminology"
|
||||
- "Transcribe this legal deposition with precise terminology"
|
||||
- "This is a technical podcast about machine learning"
|
||||
|
||||
**Combined tasks**:
|
||||
- "Transcribe and summarize in bullet points"
|
||||
- "Extract key quotes with timestamps and speaker labels"
|
||||
- "Transcribe and identify action items with timestamps"
|
||||
|
||||
### Cost Optimization
|
||||
|
||||
**Token calculation**:
|
||||
- 1 minute audio = 1,920 tokens
|
||||
- 1 hour audio = 115,200 tokens
|
||||
- 9.5 hours = 1,094,400 tokens
|
||||
|
||||
**Model selection**:
|
||||
- Use `gemini-2.5-flash` ($1/1M tokens) for most tasks
|
||||
- Upgrade to `gemini-2.5-pro` ($3/1M tokens) for complex analysis
|
||||
- For high-volume: `gemini-1.5-flash` ($0.70/1M tokens)
|
||||
|
||||
**Reduce costs**:
|
||||
- Process only relevant segments using timestamps
|
||||
- Use lower-quality audio when possible
|
||||
- Batch multiple short files in one request
|
||||
- Cache context for repeated queries
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def transcribe_with_retry(file_path, max_retries=3):
|
||||
"""Transcribe audio with exponential backoff retry"""
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
myfile = client.files.upload(file=file_path)
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Transcribe with timestamps', myfile]
|
||||
)
|
||||
return response.text
|
||||
except Exception as e:
|
||||
if attempt == max_retries - 1:
|
||||
raise
|
||||
wait_time = 2 ** attempt
|
||||
print(f"Retry {attempt + 1} after {wait_time}s")
|
||||
time.sleep(wait_time)
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. Meeting Transcription
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Transcribe this meeting with:
|
||||
1. Speaker labels
|
||||
2. Timestamps for topic changes
|
||||
3. Action items highlighted
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Podcast Summary
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Create podcast summary with:
|
||||
1. Main topics with timestamps
|
||||
2. Key quotes from each speaker
|
||||
3. Recommended episode highlights
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Interview Analysis
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Analyze interview:
|
||||
1. Questions asked with timestamps
|
||||
2. Key responses from interviewee
|
||||
3. Overall sentiment and tone
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Content Verification
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Verify audio content:
|
||||
1. Check for specific keywords or phrases
|
||||
2. Identify any compliance issues
|
||||
3. Note any concerning statements with timestamps
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Multilingual Transcription
|
||||
|
||||
```python
|
||||
# Gemini auto-detects language
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Transcribe this audio and translate to English if needed.', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
## Token Costs
|
||||
|
||||
**Audio Input** (32 tokens/second):
|
||||
- 1 minute = 1,920 tokens
|
||||
- 10 minutes = 19,200 tokens
|
||||
- 1 hour = 115,200 tokens
|
||||
- 9.5 hours = 1,094,400 tokens
|
||||
|
||||
**Example costs** (Gemini 2.5 Flash at $1/1M):
|
||||
- 1 hour audio: 115,200 tokens = $0.12
|
||||
- Full day podcast (8 hours): 921,600 tokens = $0.92
|
||||
|
||||
## Limitations
|
||||
|
||||
- Maximum 9.5 hours per request
|
||||
- Auto-downsampled to 16 Kbps mono (quality loss)
|
||||
- Files expire after 48 hours
|
||||
- No real-time streaming support
|
||||
- Non-speech audio less accurate than speech
|
||||
|
||||
---
|
||||
|
||||
## Related References
|
||||
|
||||
**Current**: Audio Processing
|
||||
|
||||
**Related Capabilities**:
|
||||
- [Video Analysis](./video-analysis.md) - Extract audio from videos
|
||||
- [Video Generation](./video-generation.md) - Generate videos with native audio
|
||||
- [Image Understanding](./vision-understanding.md) - Analyze audio with visual context
|
||||
|
||||
**Back to**: [AI Multimodal Skill](../SKILL.md)
|
||||
1002
.opencode/skills/ai-multimodal/references/image-generation.md
Normal file
1002
.opencode/skills/ai-multimodal/references/image-generation.md
Normal file
File diff suppressed because it is too large
Load Diff
141
.opencode/skills/ai-multimodal/references/minimax-generation.md
Normal file
141
.opencode/skills/ai-multimodal/references/minimax-generation.md
Normal file
@@ -0,0 +1,141 @@
|
||||
# MiniMax Generation Reference
|
||||
|
||||
## Overview
|
||||
|
||||
MiniMax provides image, video (Hailuo), speech (TTS), and music generation APIs.
|
||||
Base URL: `https://api.minimax.io/v1` | Auth: `Bearer {MINIMAX_API_KEY}`
|
||||
|
||||
## Image Generation
|
||||
|
||||
**Endpoint**: `POST /image_generation`
|
||||
**Models**: `image-01` (standard), `image-01-live` (enhanced)
|
||||
**Rate**: 10 RPM | **Cost**: ~$0.03/image
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "image-01",
|
||||
"prompt": "A girl looking into the distance",
|
||||
"aspect_ratio": "16:9",
|
||||
"n": 2,
|
||||
"response_format": "url",
|
||||
"prompt_optimizer": true,
|
||||
"subject_reference": [{"type": "character", "image_file": "url", "weight": 0.8}]
|
||||
}
|
||||
```
|
||||
|
||||
**Aspect ratios**: 1:1, 16:9, 4:3, 3:2, 2:3, 3:4, 9:16, 21:9
|
||||
**Custom dims**: 512-2048px (divisible by 8)
|
||||
**Batch**: 1-9 images per request
|
||||
|
||||
## Video Generation (Hailuo)
|
||||
|
||||
**Endpoints**: POST `/video_generation` → GET `/query/video_generation` → GET `/files/retrieve`
|
||||
**Async workflow**: Submit task → poll every 10s → download file (URL valid 9h)
|
||||
|
||||
### Models
|
||||
| Model | Features | Resolution |
|
||||
|-------|----------|-----------|
|
||||
| `MiniMax-Hailuo-2.3` | Text/image-to-video | 720p/1080p |
|
||||
| `MiniMax-Hailuo-2.3-Fast` | Same, 50% faster+cheaper | 720p/1080p |
|
||||
| `MiniMax-Hailuo-02` | First+last frame mode | 720p |
|
||||
| `S2V-01` | Subject reference | 720p |
|
||||
|
||||
**Rate**: 5 RPM | **Cost**: $0.25 (6s/768p), $0.52 (10s/768p)
|
||||
|
||||
```json
|
||||
// Text-to-video
|
||||
{"prompt": "A dancer", "model": "MiniMax-Hailuo-2.3", "duration": 6, "resolution": "1080P"}
|
||||
|
||||
// Image-to-video
|
||||
{"prompt": "Scene desc", "first_frame_image": "url", "model": "MiniMax-Hailuo-2.3", "duration": 6}
|
||||
|
||||
// First+last frame
|
||||
{"prompt": "Transition", "first_frame_image": "url", "last_frame_image": "url", "model": "MiniMax-Hailuo-02"}
|
||||
|
||||
// Subject reference
|
||||
{"prompt": "Scene with character", "subject_reference": [{"type": "character", "image": ["url"]}], "model": "S2V-01"}
|
||||
```
|
||||
|
||||
## Speech/TTS
|
||||
|
||||
**Endpoint**: `POST /speech/speech_t2a_input`
|
||||
**Models**: `speech-2.8-hd` (best), `speech-2.8-turbo` (fast), `speech-2.6-hd/turbo`, `speech-02-hd/turbo`
|
||||
**Rate**: 60 RPM | **Cost**: $30-50/1M chars
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "speech-2.8-hd",
|
||||
"text": "Your text here",
|
||||
"voice": "English_Warm_Bestie",
|
||||
"emotion": "happy",
|
||||
"rate": 1.0,
|
||||
"volume": 1.0,
|
||||
"pitch": 1.0,
|
||||
"output_format": "mp3"
|
||||
}
|
||||
```
|
||||
|
||||
**Voices**: 300+ system voices, 40+ languages
|
||||
**Emotions**: happy, sad, angry, fearful, disgusted, surprised, neutral
|
||||
**Formats**: mp3, wav, pcm, flac
|
||||
**Text limit**: 10,000 chars
|
||||
|
||||
### Voice Cloning
|
||||
```json
|
||||
POST /voice_clone
|
||||
{"audio_url": "https://sample.wav", "clone_name": "my_voice"}
|
||||
```
|
||||
Requires 10+ seconds of reference audio. Rate: 60 RPM.
|
||||
|
||||
## Music Generation
|
||||
|
||||
**Endpoint**: `POST /music_generation`
|
||||
**Models**: `music-2.5` (latest, vocals+accompaniment, 4min songs)
|
||||
**Rate**: 120 RPM | **Cost**: $0.03-0.075/generation
|
||||
|
||||
```json
|
||||
{
|
||||
"model": "music-2.5",
|
||||
"lyrics": "Verse 1\nLine one\n\n[Chorus]\nChorus line",
|
||||
"prompt": "Upbeat pop with electronic elements",
|
||||
"output_format": "url",
|
||||
"audio_setting": {"sample_rate": 44100, "bitrate": 128000, "format": "mp3"}
|
||||
}
|
||||
```
|
||||
|
||||
**Lyrics**: 1-3500 chars, supports structure tags ([Verse], [Chorus], etc.)
|
||||
**Prompt**: 0-2000 chars, style/mood description
|
||||
**Sample rates**: 16000, 24000, 32000, 44100 Hz
|
||||
**Bitrates**: 32000, 64000, 128000, 256000 bps
|
||||
|
||||
## Error Codes
|
||||
|
||||
| Code | Meaning |
|
||||
|------|---------|
|
||||
| 0 | Success |
|
||||
| 1002 | Rate limit exceeded |
|
||||
| 1008 | Insufficient balance |
|
||||
| 2013 | Invalid parameters |
|
||||
|
||||
## CLI Examples
|
||||
|
||||
```bash
|
||||
# Image
|
||||
python minimax_cli.py --task generate --prompt "A cyberpunk city" --model image-01 --aspect-ratio 16:9
|
||||
|
||||
# Video
|
||||
python minimax_cli.py --task generate-video --prompt "A dancer" --model MiniMax-Hailuo-2.3 --duration 6
|
||||
|
||||
# Speech
|
||||
python minimax_cli.py --task generate-speech --text "Hello world" --model speech-2.8-hd --voice English_Warm_Bestie --emotion happy
|
||||
|
||||
# Music
|
||||
python minimax_cli.py --task generate-music --lyrics "La la la\nOh yeah" --prompt "upbeat pop" --model music-2.5
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- [API Overview](https://platform.minimax.io/docs/api-reference/api-overview)
|
||||
- [Video Guide](https://platform.minimax.io/docs/guides/video-generation)
|
||||
- [Speech API](https://platform.minimax.io/docs/api-reference/speech-t2a-intro)
|
||||
- [Music API](https://platform.minimax.io/docs/api-reference/music-generation)
|
||||
311
.opencode/skills/ai-multimodal/references/music-generation.md
Normal file
311
.opencode/skills/ai-multimodal/references/music-generation.md
Normal file
@@ -0,0 +1,311 @@
|
||||
# Music Generation Reference
|
||||
|
||||
Real-time music generation using Lyria RealTime via WebSocket API.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
- **Real-time streaming**: Bidirectional WebSocket for continuous generation
|
||||
- **Dynamic control**: Modify music in real-time during generation
|
||||
- **Style steering**: Genre, mood, instrumentation guidance
|
||||
- **Audio output**: 48kHz stereo 16-bit PCM
|
||||
|
||||
## Model
|
||||
|
||||
**Lyria RealTime** (Experimental)
|
||||
- WebSocket-based streaming
|
||||
- Real-time parameter adjustment
|
||||
- Instrumental only (no vocals)
|
||||
- Watermarked output
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Python
|
||||
|
||||
```python
|
||||
from google import genai
|
||||
import asyncio
|
||||
|
||||
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
|
||||
|
||||
async def generate_music():
|
||||
async with client.aio.live.music.connect() as session:
|
||||
# Set style prompts with weights (0.0-1.0)
|
||||
await session.set_weighted_prompts([
|
||||
{"prompt": "Upbeat corporate background music", "weight": 0.8},
|
||||
{"prompt": "Modern electronic elements", "weight": 0.5}
|
||||
])
|
||||
|
||||
# Configure generation parameters
|
||||
await session.set_music_generation_config(
|
||||
guidance=4.0, # Prompt adherence (0.0-6.0)
|
||||
bpm=120, # Tempo (60-200)
|
||||
density=0.6, # Note density (0.0-1.0)
|
||||
brightness=0.5 # Tonal quality (0.0-1.0)
|
||||
)
|
||||
|
||||
# Start playback and collect audio
|
||||
await session.play()
|
||||
|
||||
audio_chunks = []
|
||||
async for chunk in session:
|
||||
audio_chunks.append(chunk.audio_data)
|
||||
|
||||
return b''.join(audio_chunks)
|
||||
```
|
||||
|
||||
### JavaScript
|
||||
|
||||
```javascript
|
||||
const client = new GenaiClient({ apiKey: process.env.GEMINI_API_KEY });
|
||||
|
||||
async function generateMusic() {
|
||||
const session = await client.live.music.connect();
|
||||
|
||||
await session.setWeightedPrompts([
|
||||
{ prompt: "Calm ambient background", weight: 0.9 },
|
||||
{ prompt: "Nature sounds influence", weight: 0.3 }
|
||||
]);
|
||||
|
||||
await session.setMusicGenerationConfig({
|
||||
guidance: 3.5,
|
||||
bpm: 80,
|
||||
density: 0.4,
|
||||
brightness: 0.6
|
||||
});
|
||||
|
||||
session.onAudio((audioChunk) => {
|
||||
// Process 48kHz stereo PCM audio
|
||||
audioBuffer.push(audioChunk);
|
||||
});
|
||||
|
||||
await session.play();
|
||||
}
|
||||
```
|
||||
|
||||
## Configuration Parameters
|
||||
|
||||
| Parameter | Range | Default | Description |
|
||||
|-----------|-------|---------|-------------|
|
||||
| `guidance` | 0.0-6.0 | 4.0 | Prompt adherence (higher = stricter) |
|
||||
| `bpm` | 60-200 | 120 | Tempo in beats per minute |
|
||||
| `density` | 0.0-1.0 | 0.5 | Note/sound density |
|
||||
| `brightness` | 0.0-1.0 | 0.5 | Tonal quality (higher = brighter) |
|
||||
| `scale` | 12 keys | C Major | Musical key |
|
||||
| `mute_bass` | bool | false | Remove bass elements |
|
||||
| `mute_drums` | bool | false | Remove drum elements |
|
||||
| `mode` | enum | QUALITY | QUALITY, DIVERSITY, VOCALIZATION |
|
||||
| `temperature` | 0.0-2.0 | 1.0 | Sampling randomness |
|
||||
| `top_k` | int | 40 | Sampling top-k |
|
||||
| `seed` | int | random | Reproducibility seed |
|
||||
|
||||
## Weighted Prompts
|
||||
|
||||
Control generation direction with weighted prompts:
|
||||
|
||||
```python
|
||||
await session.set_weighted_prompts([
|
||||
{"prompt": "Main style description", "weight": 1.0}, # Primary
|
||||
{"prompt": "Secondary influence", "weight": 0.5}, # Supporting
|
||||
{"prompt": "Subtle element", "weight": 0.2} # Accent
|
||||
])
|
||||
```
|
||||
|
||||
**Weight guidelines**:
|
||||
- 0.8-1.0: Dominant influence
|
||||
- 0.5-0.7: Secondary contribution
|
||||
- 0.2-0.4: Subtle accent
|
||||
- 0.0-0.1: Minimal effect
|
||||
|
||||
## Style Prompts by Use Case
|
||||
|
||||
### Corporate/Marketing
|
||||
|
||||
```python
|
||||
prompts = [
|
||||
{"prompt": "Professional corporate background music, modern", "weight": 0.9},
|
||||
{"prompt": "Uplifting, optimistic mood", "weight": 0.6},
|
||||
{"prompt": "Clean production, minimal complexity", "weight": 0.5}
|
||||
]
|
||||
config = {"bpm": 100, "brightness": 0.6, "density": 0.5}
|
||||
```
|
||||
|
||||
### Social Media/Short-form
|
||||
|
||||
```python
|
||||
prompts = [
|
||||
{"prompt": "Trending pop electronic beat", "weight": 0.9},
|
||||
{"prompt": "Energetic, catchy rhythm", "weight": 0.7},
|
||||
{"prompt": "Bass-heavy, punchy", "weight": 0.5}
|
||||
]
|
||||
config = {"bpm": 128, "brightness": 0.7, "density": 0.7}
|
||||
```
|
||||
|
||||
### Emotional/Cinematic
|
||||
|
||||
```python
|
||||
prompts = [
|
||||
{"prompt": "Cinematic orchestral underscore", "weight": 0.9},
|
||||
{"prompt": "Emotional, inspiring", "weight": 0.7},
|
||||
{"prompt": "Building tension and release", "weight": 0.5}
|
||||
]
|
||||
config = {"bpm": 70, "brightness": 0.4, "density": 0.4}
|
||||
```
|
||||
|
||||
### Ambient/Background
|
||||
|
||||
```python
|
||||
prompts = [
|
||||
{"prompt": "Calm ambient soundscape", "weight": 0.9},
|
||||
{"prompt": "Minimal, atmospheric", "weight": 0.6},
|
||||
{"prompt": "Lo-fi textures", "weight": 0.4}
|
||||
]
|
||||
config = {"bpm": 80, "brightness": 0.4, "density": 0.3}
|
||||
```
|
||||
|
||||
## Real-time Transitions
|
||||
|
||||
Smoothly transition between styles during generation:
|
||||
|
||||
```python
|
||||
async def dynamic_music_generation():
|
||||
async with client.aio.live.music.connect() as session:
|
||||
# Start with intro style
|
||||
await session.set_weighted_prompts([
|
||||
{"prompt": "Soft ambient intro", "weight": 0.9}
|
||||
])
|
||||
await session.play()
|
||||
|
||||
# Collect intro (4 seconds)
|
||||
intro_chunks = []
|
||||
for _ in range(192): # ~4 seconds at 48kHz
|
||||
chunk = await session.__anext__()
|
||||
intro_chunks.append(chunk.audio_data)
|
||||
|
||||
# Transition to main section
|
||||
await session.set_weighted_prompts([
|
||||
{"prompt": "Building energy", "weight": 0.7},
|
||||
{"prompt": "Full beat drop", "weight": 0.5}
|
||||
])
|
||||
|
||||
# Continue with new style...
|
||||
```
|
||||
|
||||
## Output Specifications
|
||||
|
||||
- **Format**: Raw 16-bit PCM
|
||||
- **Sample Rate**: 48,000 Hz
|
||||
- **Channels**: 2 (stereo)
|
||||
- **Bit Depth**: 16 bits
|
||||
- **Watermarking**: Always enabled (SynthID)
|
||||
|
||||
### Save to WAV
|
||||
|
||||
```python
|
||||
import wave
|
||||
|
||||
def save_pcm_to_wav(pcm_data, filename):
|
||||
with wave.open(filename, 'wb') as wav_file:
|
||||
wav_file.setnchannels(2) # Stereo
|
||||
wav_file.setsampwidth(2) # 16-bit
|
||||
wav_file.setframerate(48000) # 48kHz
|
||||
wav_file.writeframes(pcm_data)
|
||||
```
|
||||
|
||||
### Convert to MP3
|
||||
|
||||
```bash
|
||||
# Using FFmpeg
|
||||
ffmpeg -f s16le -ar 48000 -ac 2 -i input.pcm output.mp3
|
||||
```
|
||||
|
||||
## Integration with Video Production
|
||||
|
||||
### Generate Background Music for Video
|
||||
|
||||
```python
|
||||
async def generate_video_background(duration_seconds, mood):
|
||||
"""Generate background music matching video length"""
|
||||
|
||||
# Configure for video background
|
||||
prompts = [
|
||||
{"prompt": f"{mood} background music for video", "weight": 0.9},
|
||||
{"prompt": "Non-distracting, supportive underscore", "weight": 0.6}
|
||||
]
|
||||
|
||||
async with client.aio.live.music.connect() as session:
|
||||
await session.set_weighted_prompts(prompts)
|
||||
await session.set_music_generation_config(
|
||||
guidance=4.0,
|
||||
density=0.4, # Keep sparse for background
|
||||
brightness=0.5
|
||||
)
|
||||
await session.play()
|
||||
|
||||
# Calculate chunks needed (48kHz stereo = 192000 bytes/second)
|
||||
total_chunks = duration_seconds * 48000 // 512 # Chunk size estimate
|
||||
|
||||
audio_data = []
|
||||
async for i, chunk in enumerate(session):
|
||||
audio_data.append(chunk.audio_data)
|
||||
if i >= total_chunks:
|
||||
break
|
||||
|
||||
return b''.join(audio_data)
|
||||
```
|
||||
|
||||
### Sync with Storyboard Timing
|
||||
|
||||
```python
|
||||
async def generate_scene_music(scenes):
|
||||
"""Generate music with transitions matching scene changes"""
|
||||
|
||||
all_audio = []
|
||||
|
||||
async with client.aio.live.music.connect() as session:
|
||||
for scene in scenes:
|
||||
# Update style for each scene
|
||||
await session.set_weighted_prompts([
|
||||
{"prompt": scene['mood'], "weight": 0.9},
|
||||
{"prompt": scene['style'], "weight": 0.5}
|
||||
])
|
||||
|
||||
if scene['index'] == 0:
|
||||
await session.play()
|
||||
|
||||
# Collect audio for scene duration
|
||||
chunks = int(scene['duration'] * 48000 / 512)
|
||||
for _ in range(chunks):
|
||||
chunk = await session.__anext__()
|
||||
all_audio.append(chunk.audio_data)
|
||||
|
||||
return b''.join(all_audio)
|
||||
```
|
||||
|
||||
## Limitations
|
||||
|
||||
- **Instrumental only**: No vocal/singing generation
|
||||
- **WebSocket required**: Real-time streaming connection
|
||||
- **Safety filtering**: Prompts undergo safety review
|
||||
- **Watermarking**: All output contains SynthID watermark
|
||||
- **Experimental**: API may change
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Buffer audio**: Implement robust buffering for smooth playback
|
||||
2. **Gradual transitions**: Avoid drastic prompt changes mid-stream
|
||||
3. **Sparse for backgrounds**: Lower density for video backgrounds
|
||||
4. **Test prompts**: Iterate on prompt combinations
|
||||
5. **Cross-fade transitions**: Blend audio at style changes
|
||||
6. **Match video mood**: Align music tempo/energy with visuals
|
||||
|
||||
## Resources
|
||||
|
||||
- [Lyria RealTime Docs](https://ai.google.dev/gemini-api/docs/music-generation)
|
||||
- [Audio Processing Guide](./audio-processing.md)
|
||||
- [Video Generation](./video-generation.md)
|
||||
|
||||
---
|
||||
|
||||
**Related**: [Audio Processing](./audio-processing.md) | [Video Generation](./video-generation.md)
|
||||
|
||||
**Back to**: [AI Multimodal Skill](../SKILL.md)
|
||||
515
.opencode/skills/ai-multimodal/references/video-analysis.md
Normal file
515
.opencode/skills/ai-multimodal/references/video-analysis.md
Normal file
@@ -0,0 +1,515 @@
|
||||
# Video Analysis Reference
|
||||
|
||||
Comprehensive guide for video understanding, temporal analysis, and YouTube processing using Gemini API.
|
||||
|
||||
> **Note**: This guide covers video *analysis* (understanding existing videos). For video *generation* (creating new videos), see [Video Generation Reference](./video-generation.md).
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
- **Video Summarization**: Create concise summaries
|
||||
- **Question Answering**: Answer specific questions about content
|
||||
- **Transcription**: Audio transcription with visual descriptions
|
||||
- **Timestamp References**: Query specific moments (MM:SS format)
|
||||
- **Video Clipping**: Process specific segments
|
||||
- **Scene Detection**: Identify scene changes and transitions
|
||||
- **Multiple Videos**: Compare up to 10 videos (2.5+)
|
||||
- **YouTube Support**: Analyze YouTube videos directly
|
||||
- **Custom Frame Rate**: Adjust FPS sampling
|
||||
|
||||
## Supported Formats
|
||||
|
||||
- MP4, MPEG, MOV, AVI, FLV, MPG, WebM, WMV, 3GPP
|
||||
|
||||
## Model Selection
|
||||
|
||||
### Gemini 3 Series (Latest)
|
||||
- **gemini-3-pro-preview**: Latest, agentic workflows, 1M context, dynamic thinking
|
||||
|
||||
### Gemini 2.5 Series (Recommended)
|
||||
- **gemini-2.5-pro**: Best quality, 1M-2M context
|
||||
- **gemini-2.5-flash**: Balanced, 1M-2M context (recommended)
|
||||
|
||||
### Context Windows
|
||||
- **2M token models**: ~2 hours (default) or ~6 hours (low-res)
|
||||
- **1M token models**: ~1 hour (default) or ~3 hours (low-res)
|
||||
|
||||
## Basic Video Analysis
|
||||
|
||||
### Local Video
|
||||
|
||||
```python
|
||||
from google import genai
|
||||
import os
|
||||
|
||||
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
|
||||
|
||||
# Upload video (File API for >20MB)
|
||||
myfile = client.files.upload(file='video.mp4')
|
||||
|
||||
# Wait for processing
|
||||
import time
|
||||
while myfile.state.name == 'PROCESSING':
|
||||
time.sleep(1)
|
||||
myfile = client.files.get(name=myfile.name)
|
||||
|
||||
if myfile.state.name == 'FAILED':
|
||||
raise ValueError('Video processing failed')
|
||||
|
||||
# Analyze
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Summarize this video in 3 key points', myfile]
|
||||
)
|
||||
print(response.text)
|
||||
```
|
||||
|
||||
### YouTube Video
|
||||
|
||||
```python
|
||||
from google.genai import types
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Summarize the main topics discussed',
|
||||
types.Part.from_uri(
|
||||
uri='https://www.youtube.com/watch?v=VIDEO_ID',
|
||||
mime_type='video/mp4'
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Inline Video (<20MB)
|
||||
|
||||
```python
|
||||
with open('short-clip.mp4', 'rb') as f:
|
||||
video_bytes = f.read()
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'What happens in this video?',
|
||||
types.Part.from_bytes(data=video_bytes, mime_type='video/mp4')
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Video Clipping
|
||||
|
||||
```python
|
||||
# Analyze specific time range
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Summarize this segment',
|
||||
types.Part.from_video_metadata(
|
||||
file_uri=myfile.uri,
|
||||
start_offset='40s',
|
||||
end_offset='80s'
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Custom Frame Rate
|
||||
|
||||
```python
|
||||
# Lower FPS for static content (saves tokens)
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Analyze this presentation',
|
||||
types.Part.from_video_metadata(
|
||||
file_uri=myfile.uri,
|
||||
fps=0.5 # Sample every 2 seconds
|
||||
)
|
||||
]
|
||||
)
|
||||
|
||||
# Higher FPS for fast-moving content
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Analyze rapid movements in this sports video',
|
||||
types.Part.from_video_metadata(
|
||||
file_uri=myfile.uri,
|
||||
fps=5 # Sample 5 times per second
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Multiple Videos (2.5+)
|
||||
|
||||
```python
|
||||
video1 = client.files.upload(file='demo1.mp4')
|
||||
video2 = client.files.upload(file='demo2.mp4')
|
||||
|
||||
# Wait for processing
|
||||
for video in [video1, video2]:
|
||||
while video.state.name == 'PROCESSING':
|
||||
time.sleep(1)
|
||||
video = client.files.get(name=video.name)
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-pro',
|
||||
contents=[
|
||||
'Compare these two product demos. Which explains features better?',
|
||||
video1,
|
||||
video2
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Temporal Understanding
|
||||
|
||||
### Timestamp-Based Questions
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'What happens at 01:15 and how does it relate to 02:30?',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Timeline Creation
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Create a timeline with timestamps:
|
||||
- Key events
|
||||
- Scene changes
|
||||
- Important moments
|
||||
Format: MM:SS - Description
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Scene Detection
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Identify all scene changes with timestamps and describe each scene',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Transcription
|
||||
|
||||
### Basic Transcription
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Transcribe the audio from this video',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### With Visual Descriptions
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Transcribe with visual context:
|
||||
- Audio transcription
|
||||
- Visual descriptions of important moments
|
||||
- Timestamps for salient events
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Speaker Identification
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Transcribe with speaker labels and timestamps',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. Video Summarization
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Summarize this video:
|
||||
1. Main topic and purpose
|
||||
2. Key points with timestamps
|
||||
3. Conclusion or call-to-action
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Educational Content
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Create educational materials:
|
||||
1. List key concepts taught
|
||||
2. Create 5 quiz questions with answers
|
||||
3. Provide timestamp for each concept
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Action Detection
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'List all actions performed in this tutorial with timestamps',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Content Moderation
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Review video content:
|
||||
1. Identify any problematic content
|
||||
2. Note timestamps of concerns
|
||||
3. Provide content rating recommendation
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Interview Analysis
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Analyze interview:
|
||||
1. Questions asked (timestamps)
|
||||
2. Key responses
|
||||
3. Candidate body language and demeanor
|
||||
4. Overall assessment
|
||||
''',
|
||||
myfile
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 6. Sports Analysis
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Analyze sports video:
|
||||
1. Key plays with timestamps
|
||||
2. Player movements and positioning
|
||||
3. Game strategy observations
|
||||
''',
|
||||
types.Part.from_video_metadata(
|
||||
file_uri=myfile.uri,
|
||||
fps=5 # Higher FPS for fast action
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## YouTube Specific Features
|
||||
|
||||
### Public Video Requirements
|
||||
|
||||
- Video must be public (not private or unlisted)
|
||||
- No age-restricted content
|
||||
- Valid video ID required
|
||||
|
||||
### Usage Example
|
||||
|
||||
```python
|
||||
# YouTube URL
|
||||
youtube_uri = 'https://www.youtube.com/watch?v=dQw4w9WgXcQ'
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Create chapter markers with timestamps',
|
||||
types.Part.from_uri(uri=youtube_uri, mime_type='video/mp4')
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Rate Limits
|
||||
|
||||
- **Free tier**: 8 hours of YouTube video per day
|
||||
- **Paid tier**: No length-based limits
|
||||
- Public videos only
|
||||
|
||||
## Token Calculation
|
||||
|
||||
Video tokens depend on resolution and FPS:
|
||||
|
||||
**Default resolution** (~300 tokens/second):
|
||||
- 1 minute = 18,000 tokens
|
||||
- 10 minutes = 180,000 tokens
|
||||
- 1 hour = 1,080,000 tokens
|
||||
|
||||
**Low resolution** (~100 tokens/second):
|
||||
- 1 minute = 6,000 tokens
|
||||
- 10 minutes = 60,000 tokens
|
||||
- 1 hour = 360,000 tokens
|
||||
|
||||
**Context windows**:
|
||||
- 2M tokens ≈ 2 hours (default) or 6 hours (low-res)
|
||||
- 1M tokens ≈ 1 hour (default) or 3 hours (low-res)
|
||||
|
||||
## Best Practices
|
||||
|
||||
### File Management
|
||||
|
||||
1. Use File API for videos >20MB (most videos)
|
||||
2. Wait for ACTIVE state before analysis
|
||||
3. Files auto-delete after 48 hours
|
||||
4. Clean up manually:
|
||||
```python
|
||||
client.files.delete(name=myfile.name)
|
||||
```
|
||||
|
||||
### Optimization Strategies
|
||||
|
||||
**Reduce token usage**:
|
||||
- Process specific segments using start/end offsets
|
||||
- Use lower FPS for static content
|
||||
- Use low-resolution mode for long videos
|
||||
- Split very long videos into chunks
|
||||
|
||||
**Improve accuracy**:
|
||||
- Provide context in prompts
|
||||
- Use higher FPS for fast-moving content
|
||||
- Use Pro model for complex analysis
|
||||
- Be specific about what to extract
|
||||
|
||||
### Prompt Engineering
|
||||
|
||||
**Effective prompts**:
|
||||
- "Summarize key points with timestamps in MM:SS format"
|
||||
- "Identify all scene changes and describe each scene"
|
||||
- "Extract action items mentioned with timestamps"
|
||||
- "Compare these two videos on: X, Y, Z criteria"
|
||||
|
||||
**Structured output**:
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from typing import List
|
||||
|
||||
class VideoEvent(BaseModel):
|
||||
timestamp: str # MM:SS format
|
||||
description: str
|
||||
category: str
|
||||
|
||||
class VideoAnalysis(BaseModel):
|
||||
summary: str
|
||||
events: List[VideoEvent]
|
||||
duration: str
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Analyze this video', myfile],
|
||||
config=genai.types.GenerateContentConfig(
|
||||
response_mime_type='application/json',
|
||||
response_schema=VideoAnalysis
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def upload_and_process_video(file_path, max_wait=300):
|
||||
"""Upload video and wait for processing"""
|
||||
myfile = client.files.upload(file=file_path)
|
||||
|
||||
elapsed = 0
|
||||
while myfile.state.name == 'PROCESSING' and elapsed < max_wait:
|
||||
time.sleep(5)
|
||||
myfile = client.files.get(name=myfile.name)
|
||||
elapsed += 5
|
||||
|
||||
if myfile.state.name == 'FAILED':
|
||||
raise ValueError(f'Video processing failed: {myfile.state.name}')
|
||||
|
||||
if myfile.state.name == 'PROCESSING':
|
||||
raise TimeoutError(f'Processing timeout after {max_wait}s')
|
||||
|
||||
return myfile
|
||||
```
|
||||
|
||||
## Cost Optimization
|
||||
|
||||
**Token costs** (Gemini 2.5 Flash at $1/1M):
|
||||
- 1 minute video (default): 18,000 tokens = $0.018
|
||||
- 10 minute video: 180,000 tokens = $0.18
|
||||
- 1 hour video: 1,080,000 tokens = $1.08
|
||||
|
||||
**Strategies**:
|
||||
- Use video clipping for specific segments
|
||||
- Lower FPS for static content
|
||||
- Use low-resolution mode for long videos
|
||||
- Batch related queries on same video
|
||||
- Use context caching for repeated queries
|
||||
|
||||
## Limitations
|
||||
|
||||
- Maximum 6 hours (low-res) or 2 hours (default)
|
||||
- YouTube videos must be public
|
||||
- No live streaming analysis
|
||||
- Files expire after 48 hours
|
||||
- Processing time varies by video length
|
||||
- No real-time processing
|
||||
- Limited to 10 videos per request (2.5+)
|
||||
|
||||
---
|
||||
|
||||
## Related References
|
||||
|
||||
**Current**: Video Analysis
|
||||
|
||||
**Related Capabilities**:
|
||||
- [Video Generation](./video-generation.md) - Creating videos from text/images
|
||||
- [Audio Processing](./audio-processing.md) - Extract and analyze audio tracks
|
||||
- [Image Understanding](./vision-understanding.md) - Analyze individual frames
|
||||
|
||||
**Back to**: [AI Multimodal Skill](../SKILL.md)
|
||||
457
.opencode/skills/ai-multimodal/references/video-generation.md
Normal file
457
.opencode/skills/ai-multimodal/references/video-generation.md
Normal file
@@ -0,0 +1,457 @@
|
||||
# Video Generation Reference
|
||||
|
||||
Comprehensive guide for video creation using Veo models via Gemini API.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
- **Text-to-Video**: Generate 8-second videos from text prompts
|
||||
- **Image-to-Video**: Animate images with text direction
|
||||
- **Video Extension**: Continue previously generated videos
|
||||
- **Frame Control**: Precise camera movements and effects
|
||||
- **Native Audio**: Synchronized audio generation
|
||||
- **Multiple Resolutions**: 720p and 1080p output
|
||||
- **Aspect Ratios**: 16:9, 9:16, 1:1
|
||||
|
||||
## Models
|
||||
|
||||
### Veo 3.1 Preview (Latest)
|
||||
|
||||
**veo-3.1-generate-preview** - Latest with advanced controls
|
||||
- Frame-specific generation
|
||||
- Up to 3 reference images for image-to-video
|
||||
- Video extension capability
|
||||
- Native audio generation
|
||||
- Resolution: 720p, 1080p
|
||||
- Duration: 8 seconds at 24fps
|
||||
- Status: Preview (API may change)
|
||||
- Updated: September 2025
|
||||
|
||||
**veo-3.1-fast-generate-preview** - Speed-optimized
|
||||
- Optimized for business use cases
|
||||
- Programmatic ad creation
|
||||
- Social media content
|
||||
- Same features as standard but faster
|
||||
- Status: Preview
|
||||
- Updated: September 2025
|
||||
|
||||
### Veo 3.0 Stable
|
||||
|
||||
**veo-3.0-generate-001** - Production-ready
|
||||
- Native audio generation
|
||||
- Text-to-video and image-to-video
|
||||
- 720p and 1080p (16:9 only)
|
||||
- 8 seconds at 24fps
|
||||
- Status: Stable
|
||||
- Updated: July 2025
|
||||
|
||||
**veo-3.0-fast-generate-001** - Stable fast variant
|
||||
- Speed-optimized stable version
|
||||
- Same reliability as 3.0
|
||||
- Status: Stable
|
||||
- Updated: July 2025
|
||||
|
||||
## Model Comparison
|
||||
|
||||
| Model | Speed | Features | Audio | Status | Best For |
|
||||
|-------|-------|----------|-------|--------|----------|
|
||||
| veo-3.1-preview | Medium | All | ✓ | Preview | Latest features |
|
||||
| veo-3.1-fast | Fast | All | ✓ | Preview | Business/speed |
|
||||
| veo-3.0-001 | Medium | Standard | ✓ | Stable | Production |
|
||||
| veo-3.0-fast | Fast | Standard | ✓ | Stable | Production/speed |
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Text-to-Video
|
||||
|
||||
```python
|
||||
from google import genai
|
||||
from google.genai import types
|
||||
import os
|
||||
|
||||
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
|
||||
|
||||
# Basic generation
|
||||
response = client.models.generate_video(
|
||||
model='veo-3.1-generate-preview',
|
||||
prompt='A serene beach at sunset with gentle waves rolling onto the shore',
|
||||
config=types.VideoGenerationConfig(
|
||||
resolution='1080p',
|
||||
aspect_ratio='16:9'
|
||||
)
|
||||
)
|
||||
|
||||
# Save video
|
||||
with open('output.mp4', 'wb') as f:
|
||||
f.write(response.video.data)
|
||||
```
|
||||
|
||||
### Image-to-Video
|
||||
|
||||
```python
|
||||
import PIL.Image
|
||||
|
||||
# Load reference image
|
||||
ref_image = PIL.Image.open('beach.jpg')
|
||||
|
||||
# Animate the image
|
||||
response = client.models.generate_video(
|
||||
model='veo-3.1-generate-preview',
|
||||
prompt='Camera slowly pans across the scene from left to right',
|
||||
reference_images=[ref_image],
|
||||
config=types.VideoGenerationConfig(
|
||||
resolution='1080p'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Multiple Reference Images
|
||||
|
||||
```python
|
||||
# Use up to 3 reference images for complex scenes
|
||||
img1 = PIL.Image.open('foreground.jpg')
|
||||
img2 = PIL.Image.open('background.jpg')
|
||||
img3 = PIL.Image.open('subject.jpg')
|
||||
|
||||
response = client.models.generate_video(
|
||||
model='veo-3.1-generate-preview',
|
||||
prompt='Combine these elements into a cohesive animated scene',
|
||||
reference_images=[img1, img2, img3],
|
||||
config=types.VideoGenerationConfig(
|
||||
resolution='1080p',
|
||||
aspect_ratio='16:9'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Video Extension
|
||||
|
||||
```python
|
||||
# Continue from previously generated video
|
||||
previous_video = open('part1.mp4', 'rb').read()
|
||||
|
||||
response = client.models.extend_video(
|
||||
model='veo-3.1-generate-preview',
|
||||
video=previous_video,
|
||||
prompt='The scene transitions to nighttime with stars appearing'
|
||||
)
|
||||
```
|
||||
|
||||
### Frame Control
|
||||
|
||||
```python
|
||||
# Precise camera movements
|
||||
response = client.models.generate_video(
|
||||
model='veo-3.1-generate-preview',
|
||||
prompt='A mountain landscape',
|
||||
config=types.VideoGenerationConfig(
|
||||
resolution='1080p',
|
||||
camera_motion='zoom_in', # Options: zoom_in, zoom_out, pan_left, pan_right, tilt_up, tilt_down, static
|
||||
motion_speed='slow' # Options: slow, medium, fast
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Prompt Engineering
|
||||
|
||||
### Effective Video Prompts
|
||||
|
||||
**Structure**:
|
||||
1. **Subject**: What's in the scene
|
||||
2. **Action**: What's happening
|
||||
3. **Camera**: How it's filmed
|
||||
4. **Style**: Visual treatment
|
||||
5. **Timing**: Pacing details
|
||||
|
||||
**Example**:
|
||||
```
|
||||
"A hummingbird [subject] hovers near a red flower, then flies away [action].
|
||||
Slow-motion close-up shot [camera] with vibrant colors and soft focus background [style].
|
||||
Gentle, peaceful pacing [timing]."
|
||||
```
|
||||
|
||||
### Action Verbs
|
||||
|
||||
**Movement**:
|
||||
- "walks", "runs", "flies", "swims", "dances"
|
||||
- "rotates", "spins", "rolls", "bounces"
|
||||
- "emerges", "disappears", "transforms"
|
||||
|
||||
**Camera**:
|
||||
- "zoom in on", "pull back from", "follow"
|
||||
- "orbit around", "track alongside"
|
||||
- "tilt up to reveal", "pan across"
|
||||
|
||||
**Transitions**:
|
||||
- "gradually changes from... to..."
|
||||
- "morphs into", "dissolves into"
|
||||
- "cuts to", "fades to"
|
||||
|
||||
### Timing Control
|
||||
|
||||
```python
|
||||
# Explicit timing in prompt
|
||||
prompt = '''
|
||||
0-2s: Close-up of a seed in soil
|
||||
2-4s: Time-lapse of sprout emerging
|
||||
4-6s: Growing into a small plant
|
||||
6-8s: Zoom out to show garden context
|
||||
'''
|
||||
```
|
||||
|
||||
## Configuration Options
|
||||
|
||||
### Resolution
|
||||
|
||||
```python
|
||||
config = types.VideoGenerationConfig(
|
||||
resolution='1080p' # Options: 720p, 1080p
|
||||
)
|
||||
```
|
||||
|
||||
**Considerations**:
|
||||
- 1080p: Higher quality, longer generation time, larger file
|
||||
- 720p: Faster generation, smaller file, good for drafts
|
||||
|
||||
### Aspect Ratios
|
||||
|
||||
```python
|
||||
config = types.VideoGenerationConfig(
|
||||
aspect_ratio='16:9' # Options: 16:9, 9:16, 1:1
|
||||
)
|
||||
```
|
||||
|
||||
**Use Cases**:
|
||||
- 16:9: Landscape, YouTube, traditional video
|
||||
- 9:16: Mobile, TikTok, Instagram Stories
|
||||
- 1:1: Square, Instagram feed, versatile
|
||||
|
||||
### Audio Control
|
||||
|
||||
```python
|
||||
config = types.VideoGenerationConfig(
|
||||
include_audio=True # Default: True
|
||||
)
|
||||
```
|
||||
|
||||
Native audio is generated automatically and synchronized with video content.
|
||||
|
||||
## Best Practices
|
||||
|
||||
### 1. Prompt Quality
|
||||
|
||||
**Be specific**:
|
||||
- ❌ "A person walking"
|
||||
- ✅ "A young woman in a red coat walking through a park in autumn"
|
||||
|
||||
**Include motion**:
|
||||
- ❌ "A city street"
|
||||
- ✅ "A busy city street with cars passing and people crossing"
|
||||
|
||||
**Specify camera**:
|
||||
- ❌ "A mountain"
|
||||
- ✅ "Aerial drone shot slowly ascending over a snow-capped mountain"
|
||||
|
||||
### 2. Reference Images
|
||||
|
||||
**Quality**:
|
||||
- Use high-resolution images (1080p+)
|
||||
- Clear, well-lit subjects
|
||||
- Minimal motion blur
|
||||
|
||||
**Composition**:
|
||||
- Match desired final aspect ratio
|
||||
- Leave room for motion/movement
|
||||
- Consider camera angle in prompt
|
||||
|
||||
### 3. Performance Optimization
|
||||
|
||||
**Generation Time**:
|
||||
- 720p: ~30-60 seconds
|
||||
- 1080p: ~60-120 seconds
|
||||
- Fast models: 30-50% faster
|
||||
|
||||
**Strategies**:
|
||||
- Use 720p for iteration/drafts
|
||||
- Use fast models for rapid feedback
|
||||
- Batch multiple requests
|
||||
- Use async processing for UI responsiveness
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. Product Demos
|
||||
|
||||
```python
|
||||
response = client.models.generate_video(
|
||||
model='veo-3.0-fast-generate-001',
|
||||
prompt='''
|
||||
Professional product video:
|
||||
- Sleek smartphone rotating on a pedestal
|
||||
- Clean white background with soft shadows
|
||||
- Slow 360-degree rotation
|
||||
- Spotlight highlighting premium design
|
||||
- Modern, minimalist aesthetic
|
||||
''',
|
||||
config=types.VideoGenerationConfig(
|
||||
resolution='1080p',
|
||||
aspect_ratio='1:1'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Social Media Content
|
||||
|
||||
```python
|
||||
response = client.models.generate_video(
|
||||
model='veo-3.1-fast-generate-preview',
|
||||
prompt='''
|
||||
Trendy social media clip:
|
||||
- Text overlay "NEW ARRIVAL" appears
|
||||
- Fashion product showcase
|
||||
- Quick cuts and dynamic camera
|
||||
- Vibrant colors, high energy
|
||||
- Upbeat pacing
|
||||
''',
|
||||
config=types.VideoGenerationConfig(
|
||||
resolution='1080p',
|
||||
aspect_ratio='9:16' # Mobile
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Explainer Animations
|
||||
|
||||
```python
|
||||
response = client.models.generate_video(
|
||||
model='veo-3.1-generate-preview',
|
||||
prompt='''
|
||||
Educational animation:
|
||||
- Simple diagram illustrating data flow
|
||||
- Arrows and icons animating in sequence
|
||||
- Clean, clear visual hierarchy
|
||||
- Smooth transitions between steps
|
||||
- Professional corporate style
|
||||
''',
|
||||
config=types.VideoGenerationConfig(
|
||||
resolution='720p',
|
||||
aspect_ratio='16:9'
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
## Safety & Content Policy
|
||||
|
||||
### Safety Settings
|
||||
|
||||
```python
|
||||
config = types.VideoGenerationConfig(
|
||||
safety_settings=[
|
||||
types.SafetySetting(
|
||||
category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
|
||||
threshold=types.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Prohibited Content
|
||||
|
||||
- Violence, gore, harm
|
||||
- Sexually explicit content
|
||||
- Hate speech, harassment
|
||||
- Copyrighted characters/brands
|
||||
- Real people (without consent)
|
||||
- Misleading/deceptive content
|
||||
|
||||
## Limitations
|
||||
|
||||
- **Duration**: Fixed 8 seconds (as of Sept 2025)
|
||||
- **Frame Rate**: 24fps only
|
||||
- **File Size**: ~5-20MB per video
|
||||
- **Generation Time**: 30s-2min depending on resolution
|
||||
- **Reference Images**: Max 3 images
|
||||
- **Preview Status**: API may change (3.1 models)
|
||||
- **Audio**: Cannot upload custom audio (native only)
|
||||
- **No real-time**: Pre-generation required
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Long Generation Times
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
# Track generation progress
|
||||
start = time.time()
|
||||
response = client.models.generate_video(...)
|
||||
duration = time.time() - start
|
||||
print(f"Generated in {duration:.1f}s")
|
||||
```
|
||||
|
||||
**Expected times**:
|
||||
- Fast models + 720p: 30-45s
|
||||
- Standard models + 720p: 45-90s
|
||||
- Fast models + 1080p: 45-60s
|
||||
- Standard models + 1080p: 60-120s
|
||||
|
||||
### Safety Filter Blocking
|
||||
|
||||
```python
|
||||
try:
|
||||
response = client.models.generate_video(...)
|
||||
except Exception as e:
|
||||
if 'safety' in str(e).lower():
|
||||
print("Video blocked by safety filters")
|
||||
# Modify prompt and retry
|
||||
```
|
||||
|
||||
### Quota Exceeded
|
||||
|
||||
```python
|
||||
# Implement exponential backoff
|
||||
import time
|
||||
|
||||
def generate_with_retry(model, prompt, max_retries=3):
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
return client.models.generate_video(model=model, prompt=prompt)
|
||||
except Exception as e:
|
||||
if '429' in str(e): # Rate limit
|
||||
wait = 2 ** attempt
|
||||
print(f"Rate limited, waiting {wait}s...")
|
||||
time.sleep(wait)
|
||||
else:
|
||||
raise
|
||||
raise Exception("Max retries exceeded")
|
||||
```
|
||||
|
||||
## Cost Estimation
|
||||
|
||||
**Pricing**: TBD (preview models)
|
||||
|
||||
**Estimated based on compute**:
|
||||
- Fast + 720p: ~$0.05-$0.10 per video
|
||||
- Standard + 1080p: ~$0.15-$0.25 per video
|
||||
|
||||
**Monitor**: https://ai.google.dev/pricing
|
||||
|
||||
## Resources
|
||||
|
||||
- [Veo API Docs](https://ai.google.dev/gemini-api/docs/video)
|
||||
- [Video Generation Guide](https://ai.google.dev/gemini-api/docs/video#model-versions)
|
||||
- [Content Policy](https://ai.google.dev/gemini-api/docs/safety)
|
||||
- [Get API Key](https://aistudio.google.com/apikey)
|
||||
|
||||
---
|
||||
|
||||
## Related References
|
||||
|
||||
**Current**: Video Generation
|
||||
|
||||
**Related Capabilities**:
|
||||
- [Video Analysis](./video-analysis.md) - Understanding existing videos
|
||||
- [Image Generation](./image-generation.md) - Creating static images
|
||||
- [Image Understanding](./vision-understanding.md) - Analyzing reference images
|
||||
|
||||
**Back to**: [AI Multimodal Skill](../SKILL.md)
|
||||
@@ -0,0 +1,492 @@
|
||||
# Vision Understanding Reference
|
||||
|
||||
Comprehensive guide for image analysis, object detection, and visual understanding using Gemini API.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
- **Captioning**: Generate descriptive text for images
|
||||
- **Classification**: Categorize and identify content
|
||||
- **Visual Q&A**: Answer questions about images
|
||||
- **Object Detection**: Locate objects with bounding boxes (2.0+)
|
||||
- **Segmentation**: Create pixel-level masks (2.5+)
|
||||
- **Multi-image**: Compare up to 3,600 images
|
||||
- **OCR**: Extract text from images
|
||||
- **Document Understanding**: Process PDFs with vision
|
||||
|
||||
## Supported Formats
|
||||
|
||||
- **Images**: PNG, JPEG, WEBP, HEIC, HEIF
|
||||
- **Documents**: PDF (up to 1,000 pages)
|
||||
- **Size Limits**:
|
||||
- Inline: 20MB max total request
|
||||
- File API: 2GB per file
|
||||
- Max images: 3,600 per request
|
||||
|
||||
## Model Selection
|
||||
|
||||
### Gemini 2.5 Series
|
||||
- **gemini-2.5-pro**: Best quality, segmentation + detection
|
||||
- **gemini-2.5-flash**: Fast, efficient, all features
|
||||
- **gemini-2.5-flash-lite**: Lightweight, all features
|
||||
|
||||
### Feature Requirements
|
||||
- **Segmentation**: Requires 2.5+ models
|
||||
- **Object Detection**: Requires 2.0+ models
|
||||
- **Multi-image**: All models (up to 3,600 images)
|
||||
|
||||
## Basic Image Analysis
|
||||
|
||||
### Image Captioning
|
||||
|
||||
```python
|
||||
from google import genai
|
||||
import os
|
||||
|
||||
client = genai.Client(api_key=os.getenv('GEMINI_API_KEY'))
|
||||
|
||||
# Local file
|
||||
with open('image.jpg', 'rb') as f:
|
||||
img_bytes = f.read()
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Describe this image in detail',
|
||||
genai.types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
|
||||
]
|
||||
)
|
||||
print(response.text)
|
||||
```
|
||||
|
||||
### Image Classification
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Classify this image. Provide category and confidence level.',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Visual Question Answering
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'How many people are in this image and what are they doing?',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Object Detection (2.5+)
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Detect all objects in this image and provide bounding boxes',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
|
||||
# Returns bounding box coordinates: [ymin, xmin, ymax, xmax]
|
||||
# Normalized to [0, 1000] range
|
||||
```
|
||||
|
||||
### Segmentation (2.5+)
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Create a segmentation mask for all people in this image',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
|
||||
# Returns pixel-level masks for requested objects
|
||||
```
|
||||
|
||||
### Multi-Image Comparison
|
||||
|
||||
```python
|
||||
import PIL.Image
|
||||
|
||||
img1 = PIL.Image.open('photo1.jpg')
|
||||
img2 = PIL.Image.open('photo2.jpg')
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Compare these two images. What are the differences?',
|
||||
img1,
|
||||
img2
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### OCR and Text Extraction
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Extract all visible text from this image',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Input Methods
|
||||
|
||||
### Inline Data (<20MB)
|
||||
|
||||
```python
|
||||
from google.genai import types
|
||||
|
||||
# From file
|
||||
with open('image.jpg', 'rb') as f:
|
||||
img_bytes = f.read()
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Analyze this image',
|
||||
types.Part.from_bytes(data=img_bytes, mime_type='image/jpeg')
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### PIL Image
|
||||
|
||||
```python
|
||||
import PIL.Image
|
||||
|
||||
img = PIL.Image.open('photo.jpg')
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['What is in this image?', img]
|
||||
)
|
||||
```
|
||||
|
||||
### File API (>20MB or Reuse)
|
||||
|
||||
```python
|
||||
# Upload once
|
||||
myfile = client.files.upload(file='large-image.jpg')
|
||||
|
||||
# Use multiple times
|
||||
response1 = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Describe this image', myfile]
|
||||
)
|
||||
|
||||
response2 = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['What colors dominate this image?', myfile]
|
||||
)
|
||||
```
|
||||
|
||||
### URL (Public Images)
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Analyze this image',
|
||||
types.Part.from_uri(
|
||||
uri='https://example.com/image.jpg',
|
||||
mime_type='image/jpeg'
|
||||
)
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Token Calculation
|
||||
|
||||
Images consume tokens based on size:
|
||||
|
||||
**Small images** (≤384px both dimensions): 258 tokens
|
||||
|
||||
**Large images**: Tiled into 768×768 chunks, 258 tokens each
|
||||
|
||||
**Formula**:
|
||||
```
|
||||
crop_unit = floor(min(width, height) / 1.5)
|
||||
tiles = (width / crop_unit) × (height / crop_unit)
|
||||
total_tokens = tiles × 258
|
||||
```
|
||||
|
||||
**Examples**:
|
||||
- 256×256: 258 tokens (small)
|
||||
- 512×512: 258 tokens (small)
|
||||
- 960×540: 6 tiles = 1,548 tokens
|
||||
- 1920×1080: 6 tiles = 1,548 tokens
|
||||
- 3840×2160 (4K): 24 tiles = 6,192 tokens
|
||||
|
||||
## Structured Output
|
||||
|
||||
### JSON Schema Output
|
||||
|
||||
```python
|
||||
from pydantic import BaseModel
|
||||
from typing import List
|
||||
|
||||
class ObjectDetection(BaseModel):
|
||||
object_name: str
|
||||
confidence: float
|
||||
bounding_box: List[int] # [ymin, xmin, ymax, xmax]
|
||||
|
||||
class ImageAnalysis(BaseModel):
|
||||
description: str
|
||||
objects: List[ObjectDetection]
|
||||
scene_type: str
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Analyze this image', img_part],
|
||||
config=genai.types.GenerateContentConfig(
|
||||
response_mime_type='application/json',
|
||||
response_schema=ImageAnalysis
|
||||
)
|
||||
)
|
||||
|
||||
result = ImageAnalysis.model_validate_json(response.text)
|
||||
```
|
||||
|
||||
## Multi-Image Analysis
|
||||
|
||||
### Batch Processing
|
||||
|
||||
```python
|
||||
images = [
|
||||
PIL.Image.open(f'image{i}.jpg')
|
||||
for i in range(10)
|
||||
]
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=['Analyze these images and find common themes'] + images
|
||||
)
|
||||
```
|
||||
|
||||
### Image Comparison
|
||||
|
||||
```python
|
||||
before = PIL.Image.open('before.jpg')
|
||||
after = PIL.Image.open('after.jpg')
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Compare before and after. List all visible changes.',
|
||||
before,
|
||||
after
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### Visual Search
|
||||
|
||||
```python
|
||||
reference = PIL.Image.open('target.jpg')
|
||||
candidates = [PIL.Image.open(f'option{i}.jpg') for i in range(5)]
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Find which candidate images contain objects similar to the reference',
|
||||
reference
|
||||
] + candidates
|
||||
)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Image Quality
|
||||
|
||||
1. **Resolution**: Use clear, non-blurry images
|
||||
2. **Rotation**: Verify correct orientation
|
||||
3. **Lighting**: Ensure good contrast and lighting
|
||||
4. **Size optimization**: Balance quality vs token cost
|
||||
5. **Format**: JPEG for photos, PNG for graphics
|
||||
|
||||
### Prompt Engineering
|
||||
|
||||
**Specific instructions**:
|
||||
- "Identify all vehicles with their colors and positions"
|
||||
- "Count people wearing blue shirts"
|
||||
- "Extract text from the sign in the top-left corner"
|
||||
|
||||
**Output format**:
|
||||
- "Return results as JSON with fields: category, count, description"
|
||||
- "Format as markdown table"
|
||||
- "List findings as numbered items"
|
||||
|
||||
**Few-shot examples**:
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Example: For an image of a cat on a sofa, respond: "Object: cat, Location: sofa"',
|
||||
'Now analyze this image:',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### File Management
|
||||
|
||||
1. Use File API for images >20MB
|
||||
2. Use File API for repeated queries (saves tokens)
|
||||
3. Files auto-delete after 48 hours
|
||||
4. Clean up manually:
|
||||
```python
|
||||
client.files.delete(name=myfile.name)
|
||||
```
|
||||
|
||||
### Cost Optimization
|
||||
|
||||
**Token-efficient strategies**:
|
||||
- Resize large images before upload
|
||||
- Use File API for repeated queries
|
||||
- Batch multiple images when related
|
||||
- Use appropriate model (Flash vs Pro)
|
||||
|
||||
**Token costs** (Gemini 2.5 Flash at $1/1M):
|
||||
- Small image (258 tokens): $0.000258
|
||||
- HD image (1,548 tokens): $0.001548
|
||||
- 4K image (6,192 tokens): $0.006192
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### 1. Product Analysis
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Analyze this product image:
|
||||
1. Identify the product
|
||||
2. List visible features
|
||||
3. Assess condition
|
||||
4. Estimate value range
|
||||
''',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 2. Screenshot Analysis
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Extract all text and UI elements from this screenshot',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Medical Imaging (Informational Only)
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-pro',
|
||||
contents=[
|
||||
'Describe visible features in this medical image. Note: This is for informational purposes only.',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Chart/Graph Reading
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'Extract data from this chart and format as JSON',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
### 5. Scene Understanding
|
||||
|
||||
```python
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
'''Analyze this scene:
|
||||
1. Location type
|
||||
2. Time of day
|
||||
3. Weather conditions
|
||||
4. Activities happening
|
||||
5. Mood/atmosphere
|
||||
''',
|
||||
img_part
|
||||
]
|
||||
)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def analyze_image_with_retry(image_path, prompt, max_retries=3):
|
||||
"""Analyze image with exponential backoff retry"""
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
with open(image_path, 'rb') as f:
|
||||
img_bytes = f.read()
|
||||
|
||||
response = client.models.generate_content(
|
||||
model='gemini-2.5-flash',
|
||||
contents=[
|
||||
prompt,
|
||||
genai.types.Part.from_bytes(
|
||||
data=img_bytes,
|
||||
mime_type='image/jpeg'
|
||||
)
|
||||
]
|
||||
)
|
||||
return response.text
|
||||
except Exception as e:
|
||||
if attempt == max_retries - 1:
|
||||
raise
|
||||
wait_time = 2 ** attempt
|
||||
print(f"Retry {attempt + 1} after {wait_time}s: {e}")
|
||||
time.sleep(wait_time)
|
||||
```
|
||||
|
||||
## Limitations
|
||||
|
||||
- Maximum 3,600 images per request
|
||||
- OCR accuracy varies with text quality
|
||||
- Object detection requires 2.0+ models
|
||||
- Segmentation requires 2.5+ models
|
||||
- No video frame extraction (use video API)
|
||||
- Regional restrictions on child images (EEA, CH, UK)
|
||||
|
||||
---
|
||||
|
||||
## Related References
|
||||
|
||||
**Current**: Image Understanding
|
||||
|
||||
**Related Capabilities**:
|
||||
- [Image Generation](./image-generation.md) - Create and edit images
|
||||
- [Video Analysis](./video-analysis.md) - Analyze video frames
|
||||
- [Video Generation](./video-generation.md) - Reference images for video generation
|
||||
|
||||
**Back to**: [AI Multimodal Skill](../SKILL.md)
|
||||
Reference in New Issue
Block a user