Skip to main content

Gemini Multimodal Understanding

Gemini models support understanding multiple modalities of content including images, video, and audio.
POST /v1beta/models/{model}:generateContent

Image Understanding

curl "https://crazyrouter.com/v1beta/models/gemini-2.5-flash:generateContent?key=YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {
        "role": "user",
        "parts": [
          {"text": "Describe the content of this image in detail"},
          {
            "inlineData": {
              "mimeType": "image/jpeg",
              "data": "/9j/4AAQSkZJRgABAQAA..."
            }
          }
        ]
      }
    ]
  }'

Video Understanding

Send video via inline data or file URI:
curl "https://crazyrouter.com/v1beta/models/gemini-2.5-flash:generateContent?key=YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {
        "role": "user",
        "parts": [
          {"text": "Describe the content of this video and list the key scenes"},
          {
            "inlineData": {
              "mimeType": "video/mp4",
              "data": "AAAAIGZ0eXBpc29t..."
            }
          }
        ]
      }
    ]
  }'

Audio Understanding

curl "https://crazyrouter.com/v1beta/models/gemini-2.5-flash:generateContent?key=YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "contents": [
      {
        "role": "user",
        "parts": [
          {"text": "Transcribe this audio and summarize the main points"},
          {
            "inlineData": {
              "mimeType": "audio/mp3",
              "data": "SUQzBAAAAAAAI1RTU0..."
            }
          }
        ]
      }
    ]
  }'

Multi-Image Comparison

Python
response = model.generate_content([
    "Compare these three product images and analyze the design features, pros, and cons of each",
    {"mime_type": "image/jpeg", "data": image1_data},
    {"mime_type": "image/jpeg", "data": image2_data},
    {"mime_type": "image/jpeg", "data": image3_data}
])

Supported Media Formats

TypeSupported Formats
ImageJPEG, PNG, GIF, WebP, BMP
VideoMP4, AVI, MOV, MKV, WebM
AudioMP3, WAV, FLAC, AAC, OGG
When sending video and audio files via inline data, file size is limited by the request body size. For large files, it is recommended to upload them to an accessible URL first and then reference them via fileData.
Video and audio processing consumes far more tokens than plain text. One minute of video can consume thousands of tokens.