Benchmarking
Benchmarking on SWE-bench
This page talks about benchmarking on SWE-bench to measure the software engineering capabilities of SWE-agent. Benchmarking for the other modes (programming challenges, cybersecurity) are coming soon.
There are two steps to the SWE-agent/SWE-bench pipeline. First SWE-agent takes an input GitHub issue and returns a pull request that attempts to fix it. We call that step inference. The second step (currently only available for issues in the SWE-bench benchmark) is to evaluate the pull request to verify that it has indeed fixed the issue.
Architectures
At this moment, there are known issues with a small number of repositories that don't install properly for arm64
/ aarch64
architecture computers. We're working on a fix, but if you'd like to run and evaluate on the entirety of SWE-bench, the easiest way is by using an x86
machine.
๐ฉโ๐ป Inference
Run SWE-agent on SWE-bench Lite and generate patches.
python run.py --model_name gpt4 \
--per_instance_cost_limit 2.00 \
--config_file ./config/default.yaml
If you'd like to run on a single issue from SWE-bench, use the --instance_filter
option as follows:
python run.py --model_name gpt4 \
--instance_filter marshmallow-code__marshmallow-1359
The above examples use the default value of --data_path
(princeton-nlp/SWE-bench_Lite
, which will be looked up from huggingface).
You can specify any other huggingface datasets as well, or supply the path to a pre-downloaded dataset.
By default, SWE-agent evaluates on the dev
split of that dataset.
You can change that by supplying the --split
argument to the above commands (obviously you shouldn't tune your model on the test
dataset).
๐งช Evaluation
Previous evaluation/
scripts
We have removed the scripts that were previously in the evaluation/
subfolder.
They were relatively thin wrappers around swe-bench
. After the swe-bench
2.0 update, we recommend to use swe-bench
directly.
You can directly run SWE-bench to evaluate the SWE-agent results.
After installing SWE-bench, you can run run_evaluation
as such:
python -m swebench.harness.run_evaluation \
--predictions_path /path/to/all_preds.jsonl \
--max_workers 1 \
--run_id test \
--split dev
Head over to SWE-bench for details.
Default split
When running swe-agent
it uses the SWE-bench lite dev
split by default (i.e., when not specifying --data_path
or --split
).
However, swe-bench
assumes the SWE-bench lite test
split by default.
When you get warnings from swe-bench
about missing instances, make sure you specify --split test
or --split dev
appropriately.
-
Something broken? Report bug
-
Something unclear? Ask question