Vision
Image understanding with vision-capable models.
Vision-capable models automatically understand images attached to Discord messages. No extra configuration needed. Attach an image, mention the bot, and it processes the visual content alongside your text.
How it works
When a user sends a message with an image attachment, the plugin checks whether the configured engine supports vision via engine.supportedFeatures().vision. If it does, the image URLs are included as structured content objects alongside the text. If it doesn't, the images are silently ignored and the model processes the text only.
Supported models
The OpenAI engine detects vision capability by matching the model name against known patterns:
gpt-4oand variants (e.g.,gpt-4o-mini)gpt-4.1and variantsgpt-5and variants (includinggpt-5-codex)o1ando3series- Any model with
visionoromniin the name
This detection happens through the isVisionCapableModel() helper. Custom engines can override supportedFeatures() to signal vision support regardless of model naming.
Image processing
When vision is supported, the plugin transforms a standard text message into a multimodal content array:
// Internal representation (simplified)
{
role: 'user',
content: [
{ type: 'text', text: 'alice: What is this?' },
{ type: 'image_url', image_url: 'https://cdn.discordapp.com/attachments/.../photo.png' }
]
}// Internal representation (simplified)
{
role: 'user',
content: [
{ type: 'text', text: 'alice: What is this?' },
{ type: 'image_url', image_url: 'https://cdn.discordapp.com/attachments/.../photo.png' }
]
}Multiple attachments in a single message are all included. The user's display name is prefixed to the text for context.
Example conversation
A typical vision interaction in Discord:
- User attaches an image of a chart and types:
@Sage what trends do you see here? - The plugin converts the attachment URL into an
image_urlcontent object. - The model receives both the text prompt and the image.
- The response describes the chart's trends in plain text.
No special syntax or commands required. It's the same mention or reply flow as any other conversation.
Programmatic usage
Use AI.chat() directly with image content for custom integrations:
import { AI } from '@robojs/ai'
export default async () => {
const reply = await AI.chatSync(
[
{
role: 'user',
content: [
{ type: 'text', text: 'What do you see in this image?' },
{ type: 'image_url', image_url: 'https://example.com/photo.jpg' }
]
}
],
{ showTyping: false }
)
return reply.text
}import { AI } from '@robojs/ai'
export default async () => {
const reply = await AI.chatSync(
[
{
role: 'user',
content: [
{ type: 'text', text: 'What do you see in this image?' },
{ type: 'image_url', image_url: 'https://example.com/photo.jpg' }
]
}
],
{ showTyping: false }
)
return reply.text
}The content field accepts either a plain string or an array of content objects. When using the array form, include at least one text entry alongside any image_url entries.
