Image Transcription¶

Let a vision model describe images so that text-only models can understand them.

What It Does¶

When you send an image to a text-only model, Agora can use a separate vision model to generate a text description of the image first. This description is then included in the prompt sent to your main model.

This lets you use images with any model, even ones that don't natively support vision.

Setup¶

Go to Settings → Image Transcription
Choose a Transcription Model — this should be a vision-capable model (e.g., GPT-4o, Gemini Flash, Qwen-VL)
Add models to Enabled Models — these are the text-only models that will receive image descriptions
Adjust Batch Size if you send many images at once (how many images to describe per API call)

Local Vision Models

You can use a local vision model (with mmproj) as the transcription model. This keeps image processing on-device.

How It Works¶

You attach an image to your message
Agora detects that your current model doesn't support vision
The image is sent to the transcription model first
The transcription model generates a text description
This description is prepended to your message text
The combined text is sent to your main model

Batch Size¶

Controls how many images are described per API call to the transcription model.

1 — Describe one image at a time (more API calls, more accurate)
5–10 — Describe multiple images per call (fewer API calls, may lose detail)

Default is device-dependent. Lower values give better results but cost more.

Model Selection¶

Transcription Model¶

This is the vision model that generates image descriptions. Choose the most capable vision model available to you.

Enabled Models¶

These are the text-only models that will use image transcription. Only models in this list will receive transcribed image descriptions. Other models will receive images directly (if they support them) or not at all.