Skip to content

Competitive runs

Running swe-agent competitively on benchmarks

This page contains information on our competitive runs on SWE-bench, as well as tips and tricks for evaluating on large batches.

Current competitive configurations

We recently used two configurations for SWE-bench submissions

  • 250225_anthropic_filemap_simple_review.yaml: This is our current default one-attempt config. It uses claude-3-7-sonnet-20250219.
  • 250212_sweagent_heavy_sbl.yaml: This config runs 5 attempts with slightly different configurations using claude-3-5-sonnet-20241022, then uses o1 to discriminate between them. This is a very expensive configuration. If you use it, also make sure to use Claude 3.7 instead of claude 3.5.

Retry configurations and command line arguments

Note that the structure of the configuration with agents that run multiple attempts is different from the one of the default agent. In particular, supplying options like --agent.model.name etc. will cause (potentially confusing) error messages. Take a look at the above configuration file to see the structure!

You can find the command with which to run each config at the top of the config file.

In order to run on multiple workers with Claude, you need to use multiple API keys in order to have enough cache break points For this, please set the following environment variable before running

# concatenate your keys
export CLAUDE_API_KEY_ROTATION="KEY1:::KEY2:::KEY3"

See our notes on Claude for more details.

Memory consumption

We run our configuration on a machine with 32GB memory and 8 cores. To avoid out-of-memory (OOM) situations, we recommend setting

--instances.deployment.docker_args=--memory=10g

limiting the maximum amount of memory per worker.

In our case, this completely avoided any instances of running OOM.

However, OOM situations can potentially lock you out of the server, so you might want to use a script like the following as a second layer defense to kill any process that hogs too much memory (note that this will affect any script and not just swe-agent):

Memory sentinel
#!/usr/bin/env python3
"""
Memory Sentinel Script

This script monitors the system's RAM usage and kills the process with the highest
memory consumption if available RAM drops below 5GB.

Usage:
    python memory_sentinel.py

The script runs continuously, checking memory usage every second.

Script was generated by Claude 3.7 with the following prompt:

I'm working on a server and I have one script that sometimes consumes so much memory that the entire server becomes unresponsive. This is a hue problem because then I cannot log in to it anymore.

Could you write a sentinel script in python that does the following:

Check the total available RAM of the system (disregarding swap)
Check the currently used RAM of the system
If we have more than 5G left, do nothing
Else, find the process with the highest RAM consumption and kill it. Note: Use the SIGKILL command
Check every second
"""

import logging
import os
import signal
import time

import psutil

# Set up logging
logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s", filename="memory_sentinel.log", filemode="a"
)

# Memory threshold in GB
MEMORY_THRESHOLD_GB = 5
# Convert to bytes for comparison
MEMORY_THRESHOLD_BYTES = MEMORY_THRESHOLD_GB * 1024 * 1024 * 1024


def get_available_ram():
    """Get the available RAM in bytes, excluding swap."""
    return psutil.virtual_memory().available


def get_total_ram():
    """Get the total RAM in bytes, excluding swap."""
    return psutil.virtual_memory().total


def get_used_ram():
    """Get the used RAM in bytes."""
    return psutil.virtual_memory().used


def get_process_with_highest_memory():
    """Find the process with the highest memory consumption."""
    processes = []
    for proc in psutil.process_iter(["pid", "name", "memory_info"]):
        try:
            processes.append((proc.info["pid"], proc.info["name"], proc.info["memory_info"].rss))
        except (psutil.NoSuchProcess, psutil.AccessDenied, psutil.ZombieProcess):
            pass

    # Sort by memory usage (descending)
    processes.sort(key=lambda x: x[2], reverse=True)

    if processes:
        return processes[0]
    return None


def kill_process(pid):
    """Kill a process using SIGKILL."""
    try:
        os.kill(pid, signal.SIGKILL)
        return True
    except OSError as e:
        logging.error(f"Failed to kill process {pid}: {e}")
        return False


def format_bytes(bytes_value):
    """Format bytes to a human-readable string."""
    for unit in ["B", "KB", "MB", "GB", "TB"]:
        if bytes_value < 1024.0:
            return f"{bytes_value:.2f} {unit}"
        bytes_value /= 1024.0
    return f"{bytes_value:.2f} PB"


def main():
    """Main function that runs the memory monitoring loop."""
    logging.info("Memory Sentinel started")
    logging.info(f"Memory threshold set to {MEMORY_THRESHOLD_GB}GB")

    try:
        while True:
            available_ram = get_available_ram()
            total_ram = get_total_ram()
            used_ram = get_used_ram()

            logging.debug(
                f"Total RAM: {format_bytes(total_ram)}, "
                + f"Used RAM: {format_bytes(used_ram)}, "
                + f"Available RAM: {format_bytes(available_ram)}"
            )

            if available_ram < MEMORY_THRESHOLD_BYTES:
                logging.warning(
                    f"Available RAM ({format_bytes(available_ram)}) " + f"below threshold of {MEMORY_THRESHOLD_GB}GB"
                )

                process = get_process_with_highest_memory()
                if process:
                    pid, name, memory = process
                    logging.warning(f"Killing process {pid} ({name}) " + f"using {format_bytes(memory)}")

                    if kill_process(pid):
                        logging.info(f"Successfully killed process {pid} ({name})")
                    else:
                        logging.error(f"Failed to kill process {pid} ({name})")
                else:
                    logging.warning("No process found to kill")

            # Sleep for 1 second
            time.sleep(1)

    except KeyboardInterrupt:
        logging.info("Memory Sentinel stopped by user")
    except Exception as e:
        logging.error(f"Unexpected error: {e}")
        raise


if __name__ == "__main__":
    main()

If swe-agent dies or you frequently abort it, you might have leftover docker containers (they are cleaned up by normal termination of swe-agent but can be left over if it is killed). You can use a sentinel script like the following to clean them up periodically (note that this will affect any long running container and not just those from swe-agent):

Container sentinel
#!/bin/bash

while true; do
    echo "Checking for long-running containers..."
    # List all running containers with their uptime
    docker ps --format "{{.ID}} {{.RunningFor}}" | while read -r id running_for; do
        # Extract the number and unit from the running time
        if [[ $running_for =~ ([0-9]+)\ (hour|hours) ]]; then
            hours=${BASH_REMATCH[1]}
            if (( hours >= 2 )); then
                echo "Killing container $id (running for $running_for)..."
                docker kill "$id"
            fi
        elif [[ $running_for =~ ([0-9]+)\ (day|days) ]]; then
            # If it's running for at least a day, it's definitely over 2 hours
            echo "Killing container $id (running for $running_for)..."
            docker kill "$id"
        fi
    done
    echo "Sleeping for 10 minutes..."
    sleep 600  # Wait 600 seconds (10 minutes) before running again
done

Tradeoffs between resolution rate and cost

  • Running multi-attempt configurations will always be very expensive. Don't use them if cost is of importance.
  • The simplest setting to keep cost in check is the per instance cost limit or turn limit. Without limiting cost, the average cost will also converge to infinity, as the agent will never stop iterating. With Claude 3.7, a cost-conservative limit would be $1 instance limit or lower and a turn count limit of 50. For our swe-bench submissions we use slightly higher limits (see the configs above).