Multimodal Support
SWE-agent supports multimodal AI models that can process both text and images. This enables the agent to work with visual context from GitHub issues, such as screenshots, diagrams, and UI mockups.
Overview
The multimodal implementation automatically:
- Detects multimodal instances from SWE-bench Multimodal datasets
- Downloads images from GitHub issue URLs
- Converts to base64 markdown format for AI model consumption
- Handles errors gracefully with fallback to text-only processing
Supported Image Categories
Currently, SWE-agent processes images from the problem_statement
category, which includes:
- Screenshots showing bugs or issues
- UI mockups and design specifications
- Diagrams explaining the problem
- Error screenshots and visual evidence
Design Choice
Only problem_statement
images are processed to provide essential visual context for understanding the task, while preserving agent autonomy in determining solution approaches. Images from patch
and test_patch
categories may contain solution hints and are not processed.
Configuration
Basic Multimodal Setup
Use the pre-configured multimodal setup:
sweagent run-batch \
--config config/default_mm_with_images.yaml \
--instances.type swe_bench \
--instances.subset multimodal \
--instances.split dev
Disabling Image Processing
You can disable image processing globally:
# config/your_config.yaml
agent:
templates:
disable_image_processing: true
Or for specific instances:
from sweagent.agent.problem_statement import SWEBenchMultimodalProblemStatement
problem_statement = SWEBenchMultimodalProblemStatement(
text="Fix the rendering issue",
issue_images=["https://example.com/screenshot.png"],
disable_image_processing=True # Skip image processing
)
Supported Models
Multimodal support works with any vision-capable models, including:
- Claude Sonnet 4
- o3 and o4-mini
- Gemini 2.5 models
Example model configuration:
# model_configs/claude-sonnet-4-20250514_mm.yaml
model:
name: claude-sonnet-4-20250514
# Vision capabilities automatically detected
Image Processing Details
Supported Formats
- PNG, JPEG, WebP images
- Maximum size: 10MB per image
Example Usage
Automatic Detection
When loading SWE-bench instances, multimodal support is automatic:
{
"instance_id": "example__repo-123",
"problem_statement": "Fix the chart rendering bug...",
"image_assets": {
"problem_statement": ["http://example.com/chart.png"]
}
}
Direct Usage
from sweagent.agent.problem_statement import SWEBenchMultimodalProblemStatement
problem_statement = SWEBenchMultimodalProblemStatement(
text="Fix the rendering issue shown in the screenshots",
issue_images=[
"https://example.com/before.png",
"https://example.com/after.png"
]
)
# This downloads images and converts them to base64 markdown
processed_text = problem_statement.get_problem_statement()
Configuration Options
In order to enable multimodal processing, you need to update the following configuration options:
History Processing
Enable image parsing in your configuration:
agent:
history_processors:
- type: image_parsing # Parse base64 encoded images in observations
Tool Bundles
Include image and browser tools for visual tasks:
agent:
tools:
bundles:
- path: tools/image_tools # includes open_image tool to let models open image files
- path: tools/web_browser # includes 17 browser automation tools (click_mouse, open_site, etc.)
The web_browser
bundle provides tools for:
- Opening websites (open_site
)
- Taking screenshots (screenshot_site
)
- Interacting with web pages (click_mouse
, type_text
, scroll_on_page
)
- Executing JavaScript (execute_script_on_page
)
- And more - see the configuration guide for the full list
Templates Configuration
We've enabled multimodal processing when --instances.type=swe-bench --instances.subset=multimodal
are set.
To disable this behavior, you must set --templates.disable_image_processing=true
.