Competitive runs
Running swe-agent competitively on benchmarks
This page contains information on our competitive runs on SWE-bench, as well as tips and tricks for evaluating on large batches.
- Please make sure you're familiar with the command line basics and the batch mode
- The default examples will be executing code in a Docker sandbox, so make sure you have docker installed (docker troubleshooting).
Current competitive configurations
We recently used two configurations for SWE-bench submissions
- 250225_anthropic_filemap_simple_review.yaml:
This is our current default one-attempt config. It uses
claude-3-7-sonnet-20250219
. - 250212_sweagent_heavy_sbl.yaml:
This config runs 5 attempts with slightly different configurations using
claude-3-5-sonnet-20241022
, then uses o1 to discriminate between them. This is a very expensive configuration. If you use it, also make sure to use Claude 3.7 instead of claude 3.5.
Retry configurations and command line arguments
Note that the structure of the configuration with agents that run multiple attempts is different from the one of the
default agent. In particular, supplying options like --agent.model.name
etc. will cause (potentially confusing)
error messages. Take a look at the above configuration file to see the structure!
You can find the command with which to run each config at the top of the config file.
In order to run on multiple workers with Claude, you need to use multiple API keys in order to have enough cache break points For this, please set the following environment variable before running
# concatenate your keys
export CLAUDE_API_KEY_ROTATION="KEY1:::KEY2:::KEY3"
See our notes on Claude for more details.
Memory consumption
We run our configuration on a machine with 32GB memory and 8 cores. To avoid out-of-memory (OOM) situations, we recommend setting
--instances.deployment.docker_args=--memory=10g
limiting the maximum amount of memory per worker.
In our case, this completely avoided any instances of running OOM.
However, OOM situations can potentially lock you out of the server, so you might want to use a script like the following as a second layer defense to kill any process that hogs too much memory (note that this will affect any script and not just swe-agent):
Memory sentinel
#!/usr/bin/env python3
"""
Memory Sentinel Script
This script monitors the system's RAM usage and kills the process with the highest
memory consumption if available RAM drops below 5GB.
Usage:
python memory_sentinel.py
The script runs continuously, checking memory usage every second.
Script was generated by Claude 3.7 with the following prompt:
I'm working on a server and I have one script that sometimes consumes so much memory that the entire server becomes unresponsive. This is a hue problem because then I cannot log in to it anymore.
Could you write a sentinel script in python that does the following:
Check the total available RAM of the system (disregarding swap)
Check the currently used RAM of the system
If we have more than 5G left, do nothing
Else, find the process with the highest RAM consumption and kill it. Note: Use the SIGKILL command
Check every second
"""
import logging
import os
import signal
import time
import psutil
# Set up logging
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s", filename="memory_sentinel.log", filemode="a"
)
# Memory threshold in GB
MEMORY_THRESHOLD_GB = 5
# Convert to bytes for comparison
MEMORY_THRESHOLD_BYTES = MEMORY_THRESHOLD_GB * 1024 * 1024 * 1024
def get_available_ram():
"""Get the available RAM in bytes, excluding swap."""
return psutil.virtual_memory().available
def get_total_ram():
"""Get the total RAM in bytes, excluding swap."""
return psutil.virtual_memory().total
def get_used_ram():
"""Get the used RAM in bytes."""
return psutil.virtual_memory().used
def get_process_with_highest_memory():
"""Find the process with the highest memory consumption."""
processes = []
for proc in psutil.process_iter(["pid", "name", "memory_info"]):
try:
processes.append((proc.info["pid"], proc.info["name"], proc.info["memory_info"].rss))
except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
pass
# Sort by memory usage (descending)
processes.sort(key=lambda x: x[2], reverse=True)
if processes:
return processes[0]
return None
def kill_process(pid):
"""Kill a process using SIGKILL."""
try:
os.kill(pid, signal.SIGKILL)
return True
except OSError as e:
logging.error(f"Failed to kill process {pid}: {e}")
return False
def format_bytes(bytes_value):
"""Format bytes to a human-readable string."""
for unit in ["B", "KB", "MB", "GB", "TB"]:
if bytes_value < 1024.0:
return f"{bytes_value:.2f} {unit}"
bytes_value /= 1024.0
return f"{bytes_value:.2f} PB"
def main():
"""Main function that runs the memory monitoring loop."""
logging.info("Memory Sentinel started")
logging.info(f"Memory threshold set to {MEMORY_THRESHOLD_GB}GB")
try:
while True:
available_ram = get_available_ram()
total_ram = get_total_ram()
used_ram = get_used_ram()
logging.debug(
f"Total RAM: {format_bytes(total_ram)}, "
+ f"Used RAM: {format_bytes(used_ram)}, "
+ f"Available RAM: {format_bytes(available_ram)}"
)
if available_ram < MEMORY_THRESHOLD_BYTES:
logging.warning(
f"Available RAM ({format_bytes(available_ram)}) " + f"below threshold of {MEMORY_THRESHOLD_GB}GB"
)
process = get_process_with_highest_memory()
if process:
pid, name, memory = process
logging.warning(f"Killing process {pid} ({name}) " + f"using {format_bytes(memory)}")
if kill_process(pid):
logging.info(f"Successfully killed process {pid} ({name})")
else:
logging.error(f"Failed to kill process {pid} ({name})")
else:
logging.warning("No process found to kill")
# Sleep for 1 second
time.sleep(1)
except KeyboardInterrupt:
logging.info("Memory Sentinel stopped by user")
except Exception as e:
logging.error(f"Unexpected error: {e}")
raise
if __name__ == "__main__":
main()
If swe-agent dies or you frequently abort it, you might have leftover docker containers (they are cleaned up by normal termination of swe-agent but can be left over if it is killed). You can use a sentinel script like the following to clean them up periodically (note that this will affect any long running container and not just those from swe-agent):
Container sentinel
#!/bin/bash
while true; do
echo "Checking for long-running containers..."
# List all running containers with their uptime
docker ps --format "{{.ID}} {{.RunningFor}}" | while read -r id running_for; do
# Extract the number and unit from the running time
if [[ $running_for =~ ([0-9]+)\ (hour|hours) ]]; then
hours=${BASH_REMATCH[1]}
if (( hours >= 2 )); then
echo "Killing container $id (running for $running_for)..."
docker kill "$id"
fi
elif [[ $running_for =~ ([0-9]+)\ (day|days) ]]; then
# If it's running for at least a day, it's definitely over 2 hours
echo "Killing container $id (running for $running_for)..."
docker kill "$id"
fi
done
echo "Sleeping for 10 minutes..."
sleep 600 # Wait 600 seconds (10 minutes) before running again
done
Tradeoffs between resolution rate and cost
- Running multi-attempt configurations will always be very expensive. Don't use them if cost is of importance.
- The simplest setting to keep cost in check is the per instance cost limit or turn limit. Without limiting cost, the average cost will also converge to infinity, as the agent will never stop iterating. With Claude 3.7, a cost-conservative limit would be $1 instance limit or lower and a turn count limit of 50. For our swe-bench submissions we use slightly higher limits (see the configs above).