princeton-nlp / swe-bench Goto Github PK

View Code? Open in Web Editor NEW

1.2K 23.0 194.0 1.11 MB

[ICLR 2024] SWE-Bench: Can Language Models Resolve Real-world Github Issues?

Home Page: https://www.swebench.com

License: MIT License

Python 97.43% Shell 1.08% Jupyter Notebook 1.49%

benchmark language-model software-engineering

swe-bench's Introduction

Code and data for our ICLR 2024 paper SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Please refer our website for the public leaderboard and the change log for information on the latest updates to the SWE-bench benchmark.

📰 News

[Apr. 15, 2024]: SWE-bench has gone through major improvements to resolve issues with the evaluation harness. Read more in our report.
[Apr. 2, 2024]: We have released SWE-agent, which sets the state-of-the-art on the full SWE-bench test set! (Tweet 🔗)
[Jan. 16, 2024]: SWE-bench has been accepted to ICLR 2024 as an oral presentation! (OpenReview 🔗)

👋 Overview

SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.

🚀 Set Up

To build SWE-bench from source, follow these steps:

Clone this repository locally
cd into the repository.
Run conda env create -f environment.yml to created a conda environment named swe-bench
Activate the environment with conda activate swe-bench

💽 Usage

You can download the SWE-bench dataset directly (dev, test sets) or from HuggingFace.

To use SWE-Bench, you can:

Train your own models on our pre-processed datasets
Run inference on existing models (either models you have on-disk like LLaMA, or models you have access to through an API like GPT-4). The inference step is where you get a repo and an issue and have the model try to generate a fix for it.
Evaluate models against SWE-bench. This is where you take a SWE-Bench task and a model-proposed solution and evaluate its correctness.
Run SWE-bench's data collection procedure on your own repositories, to make new SWE-Bench tasks.

⬇️ Downloads

Datasets	Models
🤗 SWE-bench	🦙 SWE-Llama 13b
🤗 "Oracle" Retrieval	🦙 SWE-Llama 13b (PEFT)
🤗 BM25 Retrieval 13K	🦙 SWE-Llama 7b
🤗 BM25 Retrieval 27K	🦙 SWE-Llama 7b (PEFT)
🤗 BM25 Retrieval 40K
🤗 BM25 Retrieval 50K (Llama tokens)

🍎 Tutorials

We've also written the following blog posts on how to use different parts of SWE-bench. If you'd like to see a post about a particular topic, please let us know via an issue.

[Nov 1. 2023] Collecting Evaluation Tasks for SWE-Bench (🔗)
[Nov 6. 2023] Evaluating on SWE-bench (🔗)

💫 Contributions

We would love to hear from the broader NLP, Machine Learning, and Software Engineering research communities, and we welcome any contributions, pull requests, or issues! To do so, please either file a new pull request or issue and fill in the corresponding templates accordingly. We'll be sure to follow up shortly!

Contact person: Carlos E. Jimenez and John Yang (Email: {carlosej, jy1682}@princeton.edu).

✍️ Citation

If you find our work helpful, please use the following citations.

@inproceedings{
    jimenez2024swebench,
    title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
    author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=VTF8yNQM66}
}

🪪 License

MIT. Check LICENSE.md.

swe-bench's People

Contributors

Stargazers

Watchers

Forkers

mentordotgit epinnock haorenkk123 rdnfn standardgalactic devon-research gaybro8777 fjrdomingues feedback-to-code 5l1v3r1 christiannov moresearch kunlun-zhu nashid sorendunn jasongross lalayang ziyuewang25 skzhang1 mbrukman itaowei xingyaoww panameraxxx 12parker sweepai jesschud pan-bian sudharsansundar mrcodechef ahmedelkhateeb01 v2hack rainteen georgerosu26 alejandrosuarez sunwood-ai-labs cristina-gabriela pigsly ichbingautam d9w jonty800 hwchase17 siri1410 diliu0349 dm4codes wplayergy letsuchem-l hannahoyj rocketsizzlin43czarri odergoojphathe adsorcept78matterva theever64 loverni47 glionaptu goober86 ellynnon72lunetes cuecardsh81 bpromnica gazettekissez-v captainnclonumell renalderwasgrou p-peachninja somberconspiracy-chuddle newsabarkatriorgi moyazarzcommuniquestep c-chmnigma kiwise56purfectzerp storiesgigaolympicry upforbion43 mellowkeeper-gament k-oprokets sellpath dattgoswami peiga atanu2531 freddiev4 westercz clyde-s-auto vcfch843875618 telwha yanxg ailabteam praveenmunagapati will7455 kk580kk svorwerk-flextg wxqianggo openagentsinc wwzeng1 techthiyanes sea-snell miknoj meteoro klieret anilcan-kara cjnama didiforgithub muramura opendevin thakkarparth007 idealisticintj

swe-bench's Issues

Availability of ground truth PRs/commits?

Hi,

Thank you for creating and releasing this benchmark. It looks to be super useful for evaluating LLM based code assistance tools.

As per my understanding, in the current dataset, each example consists of the repo, and issue to solve along with the tests that need to pass.

Since this dataset was mined from actual PRs/Commits on these github repositories, I'm assuming that for each example, there would be a PR that solved the issue in the actual repo.

Would it possible to also release a reference to PR/Commit that solves each example? This could be either a link or PR/Commit ID. This would make it easier to gauge the kind of code edits the model would need to do in order to be successful at the task.

Thanks again for the your work.

Clarification Needed on Evaluation Code for Retrieval with BM25 and Context Length-Specific Recall Metrics

Hello,

I'm exploring how BM25's retrieval effectiveness is evaluated in your repository, particularly regarding recall metrics across different context lengths. However, I can't locate the code segment for this evaluation process.

Could you guide me to the code sections for:

BM25 Retrieval Evaluation: Specifically, the section that handles the calculation of recall metrics across context lengths.
Context Inclusion Criteria: What criteria determine if a file is part of the context? Are partial inclusions considered, and how does this affect recall metrics? (Is a file considered for evaluation if it is not fully included in the context length? )
Context Definition: Is the context limited to files, or does it include prompts, readme_docs, etc.?
Additionally, regarding file encoding methods under DOCUMENT_ENCODING_FUNCTIONS:

file_name_and_contents
file_name_and_documentation
file_name_and_docs_jedi

Which encoding method is used, and are there performance differences among them?

For evaluating retrieval bm25 with context limitation, do you use the tokenizer "togethercomputer/LLaMA-2-7B-32K"?

Thanks for your help!

key not exists

It seems that there are no key "text" in the uploaded dataset. Is it "hints_text"? In the file inference/run_model.py

Failed to install low versions of Python with Conda

Thank you for crafting this benchmark.

When trying to run evaluation, I got errors about failure on creating Conda environments with low Python versions, like 3.5 or 3.6. This seems due to the fact that Conda has removed Python under 3.8 from its source channel.

I've tried to look for solutions in the issues as well as the Internet, but nothing valuable found. Since nobody mentioned this in the issues, I think I may have omitted some important configuration steps. Could you please give some instructions on how to fix this error? Thanks!

2024-04-02 20:20:54,146 - INFO - [Testbed] Installing dependencies for django__django__4.0; Command: source /Volumes/SSD/SWE-bench/harness/testbed/gpt-4/django__django/4.0/tmp6oeue7m9/miniconda3/bin/activate django__django__4.0 && echo 'activate successful' && pip install -r /Volumes/SSD/SWE-bench/harness/testbed/gpt-4/django__django/4.0/tmpq1ztuo7q/requirements.txt
2024-04-02 20:20:55,411 - ERROR - Error: Command '['/Volumes/SSD/SWE-bench/harness/testbed/gpt-4/django__django/2.2/tmpejnlkakl/miniconda3/bin/conda', 'create', '-n', 'django__django__2.2', 'python=3.5', '-y']' returned non-zero exit status 1.
2024-04-02 20:20:55,411 - ERROR - Error stdout: Channels:
 - defaults
Platform: osx-arm64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed

2024-04-02 20:20:55,411 - ERROR - Error stderr:
PackagesNotFoundError: The following packages are not available from current channels:

  - python=3.5*

Current channels:

  - defaults

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.

Unable to create a mirror of an repository

(swe-bench) zyw@Ziyues-MacBook-Pro SWE-bench % ./collect/make_repo/make_repo.sh ziyuewang25/toyexamples
GraphQL: ZiyueWang25 does not have the correct permissions to execute `CreateRepository` (createRepository)
Failed to create the repository.

Installing flash-attn during conda create

Gives the following error:

Pip subprocess error:
error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [6 lines of output]
Traceback (most recent call last):
File "", line 2, in
File "", line 34, in
File "/tmp/pip-install-9d7tfde7/flash-attn_350bfb51814e4efb95855234be5ab4a3/setup.py", line 8, in
from packaging.version import parse, Version
ModuleNotFoundError: No module named 'packaging'
[end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

(on very different machines so I d assume this is smth general)

Collect only for python repos

SWE-bench/collect/utils.py

Line 335 in 5583b1b

if flag != "test" and not line.strip().endswith(".py"):

Only downloads diffs from python files. If intended state in README.

logs are unusable with multiple test instances

harness/run_evaluation.py takes a --log_dir argument but if there are multiple predictions for a single model and test in the --predictions_path file they all write to a single file in the --log_dir and clobber each other.

Only a few samples can pass all test cases, and metrics problem

I used gpt-3.5-turbo-16k-0613 to do the inference. After I execute run_evaluation.sh, I only get 5 logs that pass all test cases.

And when I try to use convert_log_to_ground_truth to get ground_truth dict, this error occurred:
Traceback (most recent call last): File "m.py", line 5, in <module> convert_log_to_ground_truth("/home/scruple/SWE-bench/harness/log/gpt-3.5-turbo-16k-0613/django__django-14267.gpt-3.5-turbo-16k-0613.eval.log") File "/home/scruple/SWE-bench/metrics/conversion.py", line 36, in convert_log_to_ground_truth sms, found = log_path_to_sms(log_fp, log_parser) ValueError: Log file could not be parsed properly (Before, After Logs not found)
I use the following log, in which I omit some parts.

Task Metadata:
	- Instance ID: django__django-14267
	- Testbed: /home/scruple/SWE-bench/harness/testbed/gpt-3.5-turbo-16k-0613/django__django/3.2/tmpcuuifh1n/django__django__3.2
	- Virtual Env.: django__django__3.2
	- Evaluation Model: gpt-3.5-turbo-16k-0613
>>>>> Patch Apply Failed; (pred_try)
Output:
error: corrupt patch at line 12
>>>>> Applied Patch (pred_minimal_try)
>>>>> Applied Patch (pred_minimal_try)
Installation Command:....

Std. Error: 

>>>>> Init Succeeded
>>>>> Applied Patch (test)
>>>>> Applied Patch (pred_minimal)
.........

>>>>> All Tests Passed

Although authors provide explanations of metrics files, I am still confused about metrics. Could you give some examples about how to get the results in the paper? Thanks!

Request for Baseline Model Testing Log Files for Research Purposes

I would be extremely grateful if you could share the log files from the baseline model testing process discussed in your study. Access to these logs, specifically those with names akin to 'astropy__astropy-14907.version.eval.log', would be immensely beneficial for my research. It would significantly contribute to a more thorough understanding of the topic.

I deeply appreciate any help you can provide.

Unable to test for Scikit-Learn

Using the docker container in #56, we do:

We first download the swe-bench test set:

# download_swebench_test.py
from datasets import load_dataset
import pandas as pd

dataset = load_dataset("princeton-nlp/SWE-bench")
test = dataset["test"].to_pandas()
test.to_json("data/processed/swe-bench-test.json", orient="records")

mkdir -p data/processed
python3 download_swebench_test.py

# following https://gist.github.com/sorendunn/9f1f1fade59f986b4925b6633f9ff165
mkdir -p data/predictions
curl -o "data/predictions/scikit-learn-133282.jsonl" "https://gist.githubusercontent.com/sorendunn/2ac4579ea6d6e593a597786f4f4f349a/raw/d4be3d7408ebf913e39059a50d54bacc112528a7/scikit-learn-133282.jsonl"

Using the fork that applies #31 and with added fix for matplotlib (#56):

PREDICTIONS=data/predictions/scikit-learn-133282.jsonl
TASKS_FILE=data/processed/swe-bench-test.json

LOG_DIR=data/logs
TESTBED_DIR=data/testbeds
mkdir -p $LOG_DIR
mkdir -p $TESTBED_DIR

python harness/run_evaluation.py \
    --predictions_path $PREDICTIONS \
    --swe_bench_tasks $TASKS_FILE \
    --log_dir $LOG_DIR \
    --testbed $TESTBED_DIR \
    --skip_existing \
    --timeout 900 \
    --verbose

We will get the following errors:

2024-03-20 09:10:23,664 - INFO - Found 1 predictions across 1 model(s) in predictions file
2024-03-20 09:10:23,664 - INFO - [claude-2/scikit-learn__scikit-learn/0.21] # of predictions to evaluate: 1 (0 already evaluated)
2024-03-20 09:10:23,665 - INFO - [Testbed] Using conda path /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmpw5ebtgx6
2024-03-20 09:10:23,665 - INFO - [Testbed] Using working directory /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmp7egmpmi9 for testbed
2024-03-20 09:10:23,665 - INFO - [Testbed] Repo scikit-learn/scikit-learn: 1 versions
2024-03-20 09:10:23,665 - INFO - [Testbed]      Version 0.21: 1 instances
2024-03-20 09:10:23,665 - INFO - No conda path provided, creating temporary install in /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmpw5ebtgx6/miniconda3...
2024-03-20 09:10:32,365 - INFO - [Testbed] Using conda path /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmpw5ebtgx6/miniconda3
2024-03-20 09:10:32,860 - INFO - [Testbed] Setting up testbed for scikit-learn__scikit-learn__0.21
2024-03-20 09:10:42,252 - INFO - [Testbed] Cloned scikit-learn/scikit-learn to /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmp7egmpmi9/scikit-learn__scikit-learn__0.21
2024-03-20 09:10:42,253 - INFO - [Testbed] Creating environment scikit-learn__scikit-learn__0.21; Command: /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmpw5ebtgx6/miniconda3/bin/conda create -n scikit-learn__scikit-learn__0.21 python=3.6 numpy scipy cython pytest pandas matplotlib -y
/swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmp7egmpmi9/scikit-learn__scikit-learn__0.21: 1 instances
2024-03-20 09:11:10,482 - INFO - [scikit-learn__scikit-learn__0.21] [scikit-learn__scikit-learn-13328] Reset task environment to 37b0e66c871e8fb032a9c7086b2a1d5419838154
2024-03-20 09:11:10,486 - INFO - [scikit-learn__scikit-learn__0.21] [scikit-learn__scikit-learn-13328] Apply patch successful (pred_try)
2024-03-20 09:11:10,490 - INFO - [scikit-learn__scikit-learn__0.21] [scikit-learn__scikit-learn-13328] Revert patch successful (pred_try)
2024-03-20 09:11:10,490 - INFO - [scikit-learn__scikit-learn__0.21] [scikit-learn__scikit-learn-13328] Installing with command: . /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmpw5ebtgx6/miniconda3/bin/activate scikit-learn__scikit-learn__0.21 && echo 'activate successful' && pip install -v --no-use-pep517 --no-build-isolation -e .
2024-03-20 09:11:11,281 - ERROR - Error: Command '. /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmpw5ebtgx6/miniconda3/bin/activate scikit-learn__scikit-learn__0.21 && echo 'activate successful' && pip install -v --no-use-pep517 --no-build-isolation -e .' returned non-zero exit status 1.
2024-03-20 09:11:11,281 - ERROR - Error stdout: activate successful
Using pip 23.3.1 from /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmpw5ebtgx6/miniconda3/lib/python3.11/site-packages/pip (python 3.11)
Obtaining file:///swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmp7egmpmi9/scikit-learn__scikit-learn__0.21
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'error'

2024-03-20 09:11:11,282 - ERROR - Error stderr:   Running command python setup.py egg_info
  /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmp7egmpmi9/scikit-learn__scikit-learn__0.21/setup.py:12: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
    from pkg_resources import parse_version
  Partial import of sklearn during the build process.
  Traceback (most recent call last):
    File "<string>", line 2, in <module>
    File "<pip-setuptools-caller>", line 34, in <module>
    File "/swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmp7egmpmi9/scikit-learn__scikit-learn__0.21/setup.py", line 139, in <module>
      from numpy.distutils.command.build_ext import build_ext  # noqa
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ModuleNotFoundError: No module named 'numpy'
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> See above for output.
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
  full command: /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmpw5ebtgx6/miniconda3/bin/python -c '
  exec(compile('"'"''"'"''"'"'
  # This is <pip-setuptools-caller> -- a caller that pip uses to run setup.py
  #
  # - It imports setuptools before invoking setup.py, to enable projects that directly
  #   import from `distutils.core` to work with newer packaging standards.
  # - It provides a clear error message when setuptools is not installed.
  # - It sets `sys.argv[0]` to the underlying `setup.py`, when invoking `setup.py` so
  #   setuptools doesn'"'"'t think the script is `-c`. This avoids the following warning:
  #     manifest_maker: standard file '"'"'-c'"'"' not found".
  # - It generates a shim setup.py, for handling setup.cfg-only projects.
  import os, sys, tokenize
  
  try:
      import setuptools
  except ImportError as error:
      print(
          "ERROR: Can not execute `setup.py` since setuptools is not available in "
          "the build environment.",
          file=sys.stderr,
      )
      sys.exit(1)
  
  __file__ = %r
  sys.argv[0] = __file__
  
  if os.path.exists(__file__):
      filename = __file__
      with tokenize.open(__file__) as f:
          setup_py_code = f.read()
  else:
      filename = "<auto-generated setuptools caller>"
      setup_py_code = "from setuptools import setup; setup()"
  
  exec(compile(setup_py_code, filename, "exec"))
  '"'"''"'"''"'"' % ('"'"'/swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmp7egmpmi9/scikit-learn__scikit-learn__0.21/setup.py'"'"',), "<pip-setuptools-caller>", "exec"))' egg_info --egg-base /tmp/pip-pip-egg-info-nsx1cti6
  cwd: /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmp7egmpmi9/scikit-learn__scikit-learn__0.21/
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

2024-03-20 09:11:11,283 - ERROR - Error traceback: Traceback (most recent call last):
  File "/swe-bench/harness/context_manager.py", line 49, in __call__
    output = subprocess.run(cmd, **combined_args)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/swe-bench/miniconda3/lib/python3.12/subprocess.py", line 571, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '. /swe-bench/data/testbeds/claude-2/scikit-learn__scikit-learn/0.21/tmpw5ebtgx6/miniconda3/bin/activate scikit-learn__scikit-learn__0.21 && echo 'activate successful' && pip install -v --no-use-pep517 --no-build-isolation -e .' returned non-zero exit status 1.

2024-03-20 09:11:11,283 - ERROR - [scikit-learn__scikit-learn__0.21] [scikit-learn__scikit-learn-13328] Installation failed

Upper bound score by skilled human?

Have you made any attempt to characterize a "skilled human upper bound" score?

I randomly sampled a dozen swe-bench tasks myself, and found that many were basically impossible for a skilled human to “solve”. Mainly because the tasks were under specified wrt to the hidden test cases that determine passing. The tests were checking implementation specific details from the repo’s PR that weren't actually stated requirements of the task.

I asked this question in the current HN discussion about SWE agent, but it doesn't look like you are participating there.

https://news.ycombinator.com/item?id=39910452

Unable to replicate basic results

I am attempting to run swe-bench on a small 20-sample subset of the test dataset. The instance IDs in question are:

django__django-11299
django__django-11618
django__django-12148
django__django-13347
django__django-14109
django__django-14334
django__django-15572
django__django-16873
matplotlib__matplotlib-23562
psf__requests-2873
pylint-dev__pylint-6556
scikit-learn__scikit-learn-10377
scikit-learn__scikit-learn-11315
scikit-learn__scikit-learn-12938
sphinx-doc__sphinx-9260
sympy__sympy-12881
sympy__sympy-13744
sympy__sympy-15542
sympy__sympy-15599

I ran the evaluation using OpenDevin's Dockerfile and noticed a number of build issues in the logs including the one mentioned in issue #57. The final report obtained by the metric script is shown below.

gold_patch_test Evaluation Report:
        None:      0
        Generated: 20
        With Logs: 20
        Applied:   17
        Resolved:  0

Why is it the case that, for the gold patches provided by the dataset itself, only 17 were able to be applied and zero were counted as resolved?

No version column given

SWE-bench/harness/run_evaluation.py

Line 96 in 0fbb49d

version = t["version"]

At least having run collect the tasks.jsonl does not contain vesion which makes this script fail.

question about inference costs

i was wondering if there is one 0 missing for the calculation of the inference costs?

eg https://github.com/princeton-nlp/SWE-bench/blob/main/inference/run_api.py#L54C17-L54C18 states 'gpt-4': 0.0006, if i understood correctly, the token cost is 0.06/1k tokens (https://openai.com/pricing#language-models) so it should be 0.06/1000 -> 'gpt-4': 0.00006 (if this is the price for a single token)?

Collecting dataset - total_instances

SWE-bench/collect/build_dataset.py

Line 143 in 5583b1b

    
           f"[{pull['base']['repo']['full_name']}] ( {ix} / {total_instances} ) {completed} valid, {with_tests} with tests."

total_instances is not total instances (len(prs)) but ix+1

about inference environment and swe-bench virtual environment

The Python version in the swe-bench virtual environment is different from the Python version required in inference/environment.yml. Do we still need to perform inference under the swe-bench environment? When should we use the swe-bench virtual environment? Thank you~

Inference on CPU

Would it be possible to run inference on a CPU?
It seems that the run_llama.py requires a GPU.

python run_llama.py --dataset_name_or_path princeton-nlp/SWE-bench_oracle --model_name_or_path princeton-nlp/SWE-Llama-13b --output_dir ./outputs --temperature 0

Difficulty Reproducing BM25 Results

I've encountered issues while trying to reproduce the BM25 results mentioned in the documentation. I've faced the challenges:

How does the script handle files with more context than the tokenizer can support? Is there a filtering mechanism in place to manage such instances?
Could you provide more details on how the parameter k is utilized in the script and its impact on the results?

I would appreciate any guidance or suggestions on how to address these issues to achieve the expected BM25 results.

Moreover, the tokenizer is being created for each instance, rather than being kept in memory. This seems to be inefficient and could potentially affect performance.
Also, the tokenization process does not appear to be parallelized. As a result, processing is slow, and when running the test dataset overnight, the scores achieved are lower than expected.

What are expected to submit for the leaderboard integration?

Check https://www.swebench.com and found:

UnboundLocalError: local variable 'env_name' referenced before assignment in harness/utils.py line 56

Thanks for making it easy to run your benchmark!

I do experience an error which prevents run_evaluation.py from running correctly:

Traceback (most recent call last):
  File "/home/benjamin/.conda/envs/autodev-swebench/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/benjamin/.conda/envs/autodev-swebench/lib/python3.9/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/benjamin/SWE-bench/harness/engine_evaluation.py", line 168, in main
    setup_testbed(data_groups[0])
  File "/home/benjamin/SWE-bench/harness/engine_validation.py", line 90, in setup_testbed
    with TestbedContextManager(
  File "/home/benjamin/SWE-bench/harness/context_manager.py", line 227, in __enter__
    env_list = get_conda_env_names(exec_cmd, shellenv)
  File "/home/benjamin/SWE-bench/harness/utils.py", line 56, in get_conda_env_names
    env_names.append(env_name)
UnboundLocalError: local variable 'env_name' referenced before assignment

I will submit a PR to fix this minor bug.

sometimes gold_patch cannot pass the test

This is a very challenging benchmark, I have learned a lot from it. Thank you for the effort you have put into this.

I tested using the swe-llama13b you provided and found that the number of tasks that can be successfully solved is 0. Then I changed the KEY_PREDICTION from model_patch to patch, which is the target value of the prediction, and found that there are still a large number of tasks that cannot pass the test. I am using a Mac system, and I only made a modification in one place, which is changing sed -i 's/pytest/pytest -rA/' tox.ini to sed -i '' 's/pytest/pytest -rA/' tox.ini, and did not make other modifications beyond this.

For example, below are the results of pytest-dev__pytest-5103 and pylint-dev__pylint-8281 respectively.

errors related to subprocesses in run_evaluation.py

Thank you all for your amazing work here. This benchmark seems extremely useful and I am excited to work on improving performance on it. Unfortunately I have been encountering some difficulties running the evaluation harness. I have tried running run_evaluation.py on several different platforms (Windows, Mac, Google Colab, an Amazon AWS EC2 Ubuntu instance) and continually encounter errors similar to those described in Issue #6 .

To attempt to fix these errors I adapted the approach suggested by yuexihang in that issue and changed the executions in context_manager.py to bash -c 'source ...' from source .... This allows me to run executions without errors however some gold patches from the dataset seem to be failing to apply and the generations that do successfully run do not log the passed test cases in the log file (but the include the >>>>> All Tests Passed).

The exact modifications I made to the code as well as two of the examples problems I am evaluating it on are under the issue branch of my fork of your repository (to run the code with the modifications you can simply rename context_manager_new.py to context_manager.py).

The matplotlib example given there is the example given in your paper which passes the test cases. It also appears to pass the test cases under the modified code but is an example of a case where the passed test cases are not logged in the log file. The django example appears to be failing to apply but isn't throwing any errors which interrupt execution. Thanks for your help with this error!

some wrong in build_dataset.py

When I run
python build_dataset.py "D:\code\swe-bench\data\path_prs\curve-contract-prs.jsonl" "D:\code\swe-bench\data\path_tasks\curve-contract-task-instances.jsonl" --token <my_token>
I get the following error message:

2024-01-05 20:15:31,027 - main - INFO - 0 instance_ids previously recorded
2024-01-05 20:15:31,027 - main - INFO - [curvefi/curve-contract] ( Up to 0 checked ) 0 valid, 0 with tests.
Traceback (most recent call last):
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connectionpool.py", line 790, in urlopen
response = self._make_request(
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connectionpool.py", line 491, in _make_request
raise new_e
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connectionpool.py", line 467, in _make_request
self._validate_conn(conn)
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connectionpool.py", line 1096, in _validate_conn
conn.connect()
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connection.py", line 642, in connect
sock_and_verified = _ssl_wrap_socket_and_match_hostname(
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connection.py", line 782, in ssl_wrap_socket_and_match_hostname
ssl_sock = ssl_wrap_socket(
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\util\ssl.py", line 470, in ssl_wrap_socket
ssl_sock = ssl_wrap_socket_impl(sock, context, tls_in_tls, server_hostname)
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\util\ssl.py", line 514, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
File "D:\anaconda\envs\swe-bench\lib\ssl.py", line 501, in wrap_socket
return self.sslsocket_class._create(
File "D:\anaconda\envs\swe-bench\lib\ssl.py", line 1074, in _create
self.do_handshake()
File "D:\anaconda\envs\swe-bench\lib\ssl.py", line 1343, in do_handshake
self._sslobj.do_handshake()
TimeoutError: [WinError 10060] 由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\anaconda\envs\swe-bench\lib\site-packages\requests\adapters.py", line 486, in send
resp = conn.urlopen(
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connectionpool.py", line 844, in urlopen
retries = retries.increment(
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\util\retry.py", line 470, in increment
raise reraise(type(error), error, _stacktrace)
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\util\util.py", line 38, in reraise
raise value.with_traceback(tb)
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connectionpool.py", line 790, in urlopen
response = self._make_request(
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connectionpool.py", line 491, in _make_request
raise new_e
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connectionpool.py", line 467, in _make_request
self._validate_conn(conn)
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connectionpool.py", line 1096, in _validate_conn
conn.connect()
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connection.py", line 642, in connect
sock_and_verified = _ssl_wrap_socket_and_match_hostname(
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\connection.py", line 782, in ssl_wrap_socket_and_match_hostname
ssl_sock = ssl_wrap_socket(
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\util\ssl.py", line 470, in ssl_wrap_socket
ssl_sock = ssl_wrap_socket_impl(sock, context, tls_in_tls, server_hostname)
File "D:\anaconda\envs\swe-bench\lib\site-packages\urllib3\util\ssl.py", line 514, in _ssl_wrap_socket_impl
return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
File "D:\anaconda\envs\swe-bench\lib\ssl.py", line 501, in wrap_socket
return self.sslsocket_class._create(
File "D:\anaconda\envs\swe-bench\lib\ssl.py", line 1074, in _create
self.do_handshake()
File "D:\anaconda\envs\swe-bench\lib\ssl.py", line 1343, in do_handshake
self._sslobj.do_handshake()
urllib3.exceptions.ProtocolError: ('Connection aborted.', TimeoutError(10060, '由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。', None, 10060, None))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "D:\code\swe-bench\SWE-bench-main\SWE-bench-main\collect\build_dataset.py", line 185, in
main(**vars(args))
File "D:\code\swe-bench\SWE-bench-main\SWE-bench-main\collect\build_dataset.py", line 161, in main
instance = create_instance(repo, pull)
File "D:\code\swe-bench\SWE-bench-main\SWE-bench-main\collect\build_dataset.py", line 28, in create_instance
patch, test_patch = extract_patches(pull, repo)
File "D:\code\swe-bench\SWE-bench-main\SWE-bench-main\collect\utils.py", line 313, in extract_patches
patch = requests.get(pull["diff_url"]).text
File "D:\anaconda\envs\swe-bench\lib\site-packages\requests\api.py", line 73, in get
return request("get", url, params=params, **kwargs)
File "D:\anaconda\envs\swe-bench\lib\site-packages\requests\api.py", line 59, in request
return session.request(method=method, url=url, **kwargs)
File "D:\anaconda\envs\swe-bench\lib\site-packages\requests\sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
File "D:\anaconda\envs\swe-bench\lib\site-packages\requests\sessions.py", line 725, in send
history = [resp for resp in gen]
File "D:\anaconda\envs\swe-bench\lib\site-packages\requests\sessions.py", line 725, in
history = [resp for resp in gen]
File "D:\anaconda\envs\swe-bench\lib\site-packages\requests\sessions.py", line 266, in resolve_redirects
resp = self.send(
File "D:\anaconda\envs\swe-bench\lib\site-packages\requests\sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
File "D:\anaconda\envs\swe-bench\lib\site-packages\requests\adapters.py", line 501, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', TimeoutError(10060, '由于连接方在一段时间后没有正确答复或连接的主机没有反应，连接尝试失败。', None, 10060, None))

I'm sure there's nothing wrong with my internet link. What can I do to fix this? Or am I using the wrong command?

FileNotFoundError in `TaskEnvContextManager`

Running the harness throws a FileNotFoundError in the TaskEnvContextManager.
In the current implementation of the TaskEnvContextManager's __enter__ method, there is an os.chdir call. If the log_file is provided as a relative path, this causes an issue and crashes, but the code runs fine if I provide an absolute path.

Can a caveat be added to the README that an absolute path to the log_dir command line argument is needed?

Running SWE-Llama(13B/7B) on HuggingFace Endpoints [FeatureRequest]

HuggingFace Endpoints seems to be an easier way to run SWE-Llama models on the cloud.

Tips for coding this feature?

Argument `--swe_bench_tasks` is unclear

Is the --swe_bench_tasks in /harness/run_evaluation.py any clarification?
How can I access the SWE-bench task instances when directly importing the dataset by huggingface? Should I record the instances into a json file when running the inference?

May I ask where can I download the generated results from Claude and GPTs?

I have read your paper and really like this work.
Could I ask where can I download the generated results from Claude and GPTs? These results are beneficial to our work.

Reason for collecting hints_text only before initial commit?

I'm curious if there's a reason why the hints text is only collected before the initial PR commit?

In my experience, the initial commit is often not the patch that gets merged in, so subsequent hints and review feedback could be helpful in providing guidance for the eventual solution.

inference/environment.yml cannot install

When I try to install dependency by inference/environment.yml, I get the following error

Pip subprocess error:
ERROR: Ignored the following versions that require a different python version: 0.0.1 Requires-Python >=3.9,<4.0; 1.25.0 Requires-Python >=3.9; 1.25.0rc1 Requires-Python >=3.9; 1.25.1 Requires-Python >=3.9; 1.25.2 Requires-Python >=3.9; 1.26.0 Requires-Python <3.13,>=3.9; 1.26.0b1 Requires-Python <3.13,>=3.9; 1.26.0rc1 Requires-Python <3.13,>=3.9; 1.26.1 Requires-Python <3.13,>=3.9; 2.1.0 Requires-Python >=3.9; 2.1.0rc0 Requires-Python >=3.9; 2.1.1 Requires-Python >=3.9; 3.2 Requires-Python >=3.9; 3.2rc0 Requires-Python >=3.9; 8.13.1 Requires-Python >=3.9; 8.13.2 Requires-Python >=3.9; 8.14.0 Requires-Python >=3.9; 8.15.0 Requires-Python >=3.9; 8.16.0 Requires-Python >=3.9; 8.16.1 Requires-Python >=3.9
ERROR: Could not find a version that satisfies the requirement pytorch-triton==2.1.0+e6216047b8 (from versions: 0.0.1)
ERROR: No matching distribution found for pytorch-triton==2.1.0+e6216047b8

failed

CondaEnvException: Pip failed

After comment
# - python-json-logger==2.0.7
# - pytorch-triton==2.1.0+e6216047b8
# - tomli==2.0.1
# - torch==2.1.0.dev20230821+cu121
I can install the rest pip packages, but I need to install torch separately.
I don't know if this has some unspecified impact.

SWE-bench GitHub org doesn't have forks of dev set repos

Hello! I noticed that SWE-bench's evaluation and validation code (in the harness subdirectory of this repo) assumes that the SWE-bench GitHub org contains forks of all the open-source projects on which task instances are based (e.g. this fork of Astropy: https://github.com/swe-bench/astropy__astropy).

Here's the code that assumes this: https://github.com/princeton-nlp/SWE-bench/blob/main/harness/utils.py#L238-L242

However, the SWE-bench org doesn't have forks of the projects used for the dev set. E.g. https://github.com/swe-bench/sqlfluff__sqlfluff doesn't exist.

Do you plan to create these forks? If not, I could submit a PR that changes the evaluation and validation code to fall back to the open-source project's original GitHub repo if no fork exists.

Thanks for your time.

conda problems

Problem 1:
When I evaluate this json file.

[
    {
        "instance_id": "matplotlib__matplotlib-24971",
        "prediction": "\n--- a/lib/matplotlib/_tight_bbox.py\n+++ b/lib/matplotlib/_tight_bbox.py\n@@ -59,7 +59,7 @@ def adjust_bbox(fig, bbox_inches, fixed_dpi=None):\n     fig.bbox_inches = Bbox.from_bounds(0, 0, *bbox_inches.size)\n     x0, y0 = tr.transform(bbox_inches.p0)\n     w1, h1 = fig.bbox.size * dpi_scale\n-    fig.transFigure._boxout = Bbox.from_bounds(-x0, -y0, w1, h1)\n+    fig.transFigure._boxout = Bbox.from_bounds(-x0, -y0, w1, h1, transform=fig.transFigure)\n     fig.transFigure.invalidate()\n \n     fig.bbox = TransformedBbox(fig.bbox_inches, tr)\n",
        "model": "gpt-3.5-turbo-16k-0613"
    }
]

This error occurred.

subprocess.CalledProcessError: Command '/home/scruple/SWE-bench/harness/testbed/gpt-3.5-turbo-16k-0613/matplotlib__matplotlib/3.6/tmpbl661dda/miniconda3/bin/conda env list' returned non-zero exit status 1.

Problem 2:
Also, I want to know if there is a method to avoid repeatedly creating conda env like No conda path provided, creating temporary install in

Besides, I think it would be convenient if you could provide docker.
Thank you very much.
@john-b-yang

What do predictions_path parameters in run_evaluation.sh mean?

What do predictions_path parameters in run_evaluation.sh mean? Are there any actual implementation cases?

Code to collate the results

Hi,

The harness runs the tests and writes the logs, however, is there any utility to aggregate the results from those logs and create the numbers reported in the paper?

Thank you!

How to run evaluation with swe-bench_lite?

I have managed to run evaluation according to the tutorial with swe-bench.json. But with the lite version of swe-bench it does not work because benchmark files files for lite version are stored in *.arrow format instead of *.json. So, how to run evaluation for swe-bench_lite?

Unreliability when generating patches with `diff` format?

My experience with using LLMs for codegen is that they struggle to generate valid diff patches (I even tweeted about it), thus I was surprised to see them used here:

SWE-bench/inference/make_datasets/create_instance.py

Lines 22 to 75 in a54104c

    
           PATCH_EXAMPLE = """--- a/file.py 
        
           +++ b/file.py 
        
           @@ -1,27 +1,35 @@ 
        
            def euclidean(a, b): 
        
           -    while b: 
        
           -        a, b = b, a % b 
        
           -    return a 
        
           +    if b == 0: 
        
           +        return a 
        
           +    return euclidean(b, a % b) 
        
            def bresenham(x0, y0, x1, y1): 
        
                points = [] 
        
                dx = abs(x1 - x0) 
        
                dy = abs(y1 - y0) 
        
           -    sx = 1 if x0 < x1 else -1 
        
           -    sy = 1 if y0 < y1 else -1 
        
           -    err = dx - dy 
        
           +    x, y = x0, y0 
        
           +    sx = -1 if x0 > x1 else 1 
        
           +    sy = -1 if y0 > y1 else 1 
        
           -    while True: 
        
           -        points.append((x0, y0)) 
        
           -        if x0 == x1 and y0 == y1: 
        
           -            break 
        
           -        e2 = 2 * err 
        
           -        if e2 > -dy: 
        
           +    if dx > dy: 
        
           +        err = dx / 2.0 
        
           +        while x != x1: 
        
           +            points.append((x, y)) 
        
                        err -= dy 
        
           -            x0 += sx 
        
           -        if e2 < dx: 
        
           -            err += dx 
        
           -            y0 += sy 
        
           +            if err < 0: 
        
           +                y += sy 
        
           +                err += dx 
        
           +            x += sx 
        
           +    else: 
        
           +        err = dy / 2.0 
        
           +        while y != y1: 
        
           +            points.append((x, y)) 
        
           +            err -= dx 
        
           +            if err < 0: 
        
           +                x += sx 
        
           +                err += dy 
        
           +            y += sy 
        
           +    points.append((x, y)) 
        
                return points"""

I am curious what your experience with this is, given that are used here?

On my tweet I got a reply suggesting that Claude was better, but still not very good.

Due to this unreliability, I adopted another format, the git conflict-marker style, which seems to generate pretty reliably. I've since found it to seemingly be the most popular way of generating patches in codegen tools (used by my own gptme, gpt-engineer, Cursor, etc.)

Here's an example:

<<<<<<< HEAD
print("hello world!")
=======
name = input("what is your name?")
print(f"hello {name}!")
>>>>>>> updated

Haven't dug super deep or tried running the SWE-bench code yet, but I plan to.

Error when running evaluation script - ModuleNotFoundError: No module named 'swebench'

When I try to run the run_evaluation.sh script (or the python run_evaluation.py --all --the --args --... script for that matter) from the swebench/harness directory, I get the following error:

Traceback (most recent call last):
  File "/swebench_workspace/SWE-bench/swebench/harness/run_evaluation.py", line 15, in <module>
    from swebench.harness.constants import (
ModuleNotFoundError: No module named 'swebench'

Adding the SWE-Bench directory to my PYTHONPATH like below before running the script solves the issue

PYTHONPATH=/swebench_workspace/SWE-bench:$PYTHONPATH  ./run_evaluation.sh

Issues setting up environment -- Possible [BUG]

Hello!

For the specific case of instance id: matplotlib__matplotlib-17810, the requirements that is installed is: requirements.txt which tries to install git+git://github.com/sphinx-gallery/sphinx-gallery@b41e328#egg=sphinx-gallery and fails. How can I get around this issue?

Thank you!

Unable to run evaluation. Testbed is having trouble creating a specified conda environment

First, kudos to opensourcing such an inspiring body of work, your scripts are incredibly easy to run.
But, I've encountered a problem that I'm not quite sure what is wrong, could you please help me with reproducing your results?

The following error was thrown when I tried to execute run_evaluation.sh using generated results from GPTs you uploaded in previous issue:

raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'source /home/yangzhiyu/workspace/SWE-bench/testbed/gpt-4-32k-0613/django__django/4.1/tmpkaujz6pc/miniconda3/bin/activate django__django__4.1 && echo 'activate successful' && pip install -r /home/yangzhiyu/workspace/SWE-bench/testbed/gpt-4-32k-0613/django__django/4.1/tmp9l993fhu/requirements.txt' returned non-zero exit status 127.

It seems like the testbed is having trouble creating a specified conda environment, but I have faithfully followed your instructions on creating the swe-bench environment. Please help.

run_evaluation.py throws subprocess.CalledProcessError (non-zero exit status 127)

Thanks again for sharing this very interesting benchmark. I am trying to evaluate a model-generated patch via the harness/run_evaluation.py script. Unfortunately, when I call the script with any kind of patch, I consistently get a subprocess.CalledProcessError (non-zero exit status 127) after a while.

Steps to reproduce the error

I install and activate the main conda environment from ./environment.yaml file (with commented out torch,flash-attn and transformerslibraries due to incompatibility).

I created the following minimal patch file minimal_patch.json inside a human folder

[
    {
        "instance_id": "matplotlib__matplotlib-13989",
        "prediction": "<patch>\n</patch>",
        "model": "humanoid"
    }
]

I run the command python ../harness/run_evaluation.py --predictions_path minimal_patch.json --log_dir evaluation_outputs --swe_bench_tasks ../swe-bench.json --testbed eval-artifacts-deleteme --verbose

I get the following output and error

$ python ../harness/run_evaluation.py --predictions_path minimal_patch.json --log_dir evaluation_outputs --swe_bench_tasks ../swe-bench.json --testbed eval-artifacts-deleteme --verbose
2023-10-23 15:33:18,847 - INFO - Found 1 predictions across 1 model(s) in predictions file
2023-10-23 15:33:18,847 - INFO - [humanoid/matplotlib__matplotlib/3.0] # of predictions to evaluate: 1
2023-10-23 15:33:18,879 - INFO - [Testbed] Using conda path eval-artifacts-deleteme/humanoid/matplotlib__matplotlib/3.0/tmps9w_ms71
2023-10-23 15:33:18,879 - INFO - [Testbed] Using working directory eval-artifacts-deleteme/humanoid/matplotlib__matplotlib/3.0/tmpu8oa_lhv for testbed
2023-10-23 15:33:18,879 - INFO - [Testbed] Repo matplotlib/matplotlib: 1 versions
2023-10-23 15:33:18,879 - INFO - [Testbed]      Version 3.0: 1 instances
2023-10-23 15:33:18,880 - INFO - No conda path provided, creating temporary install in eval-artifacts-deleteme/humanoid/matplotlib__matplotlib/3.0/tmps9w_ms71/miniconda3...
2023-10-23 15:33:33,648 - INFO - [Testbed] Using conda path eval-artifacts-deleteme/humanoid/matplotlib__matplotlib/3.0/tmps9w_ms71/miniconda3
2023-10-23 15:33:34,693 - INFO - [Testbed] Setting up testbed for matplotlib__matplotlib__3.0
2023-10-23 15:34:13,333 - INFO - [Testbed] Cloned matplotlib/matplotlib to eval-artifacts-deleteme/humanoid/matplotlib__matplotlib/3.0/tmpu8oa_lhv/matplotlib__matplotlib__3.0
2023-10-23 15:34:13,333 - INFO - [Testbed] Creating environment matplotlib__matplotlib__3.0; Command: /workspaces/SWE-bench/human/eval-artifacts-deleteme/humanoid/matplotlib__matplotlib/3.0/tmps9w_ms71/miniconda3/bin/conda create -n matplotlib__matplotlib__3.0 python=3.7 -y
2023-10-23 15:34:30,139 - INFO - [Testbed] Installing dependencies for matplotlib__matplotlib__3.0; Command: source /workspaces/SWE-bench/human/eval-artifacts-deleteme/humanoid/matplotlib__matplotlib/3.0/tmps9w_ms71/miniconda3/bin/activate matplotlib__matplotlib__3.0 && pip install -r eval-artifacts-deleteme/humanoid/matplotlib__matplotlib/3.0/tmpu8oa_lhv/requirements.txt
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/workspaces/SWE-bench/env/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/workspaces/SWE-bench/env/lib/python3.8/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/workspaces/SWE-bench/harness/engine_evaluation.py", line 164, in main
    setup_testbed(data_groups[0])
  File "/workspaces/SWE-bench/harness/engine_validation.py", line 90, in setup_testbed
    with TestbedContextManager(
  File "/workspaces/SWE-bench/harness/context_manager.py", line 215, in __enter__
    subprocess.run(cmd, shell=True, **self.subprocess_args)
  File "/workspaces/SWE-bench/env/lib/python3.8/subprocess.py", line 516, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command 'source /workspaces/SWE-bench/human/eval-artifacts-deleteme/humanoid/matplotlib__matplotlib/3.0/tmps9w_ms71/miniconda3/bin/activate matplotlib__matplotlib__3.0 && pip install -r eval-artifacts-deleteme/humanoid/matplotlib__matplotlib/3.0/tmpu8oa_lhv/requirements.txt' returned non-zero exit status 127.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "../harness/run_evaluation.py", line 169, in <module>
    main(**vars(args))
  File "../harness/run_evaluation.py", line 149, in main
    pool.map(eval_engine, eval_args)
  File "/workspaces/SWE-bench/env/lib/python3.8/multiprocessing/pool.py", line 364, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/workspaces/SWE-bench/env/lib/python3.8/multiprocessing/pool.py", line 771, in get
    raise self._value
subprocess.CalledProcessError: Command 'source /workspaces/SWE-bench/human/eval-artifacts-deleteme/humanoid/matplotlib__matplotlib/3.0/tmps9w_ms71/miniconda3/bin/activate matplotlib__matplotlib__3.0 && pip install -r eval-artifacts-deleteme/humanoid/matplotlib__matplotlib/3.0/tmpu8oa_lhv/requirements.txt' returned non-zero exit status 127.

The error appears to come from one of the subprocess calls in harness/context_manager.py. I am not sure how to fix this, I tried adapting the context manager script but with no success.

OS Details

My details for reproducing the above

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"

EDIT: I used a default GitHub codespace when running into this error. This should make reproducing the platform easier.

Thanks a lot for your help with this error!

Expected column names in eval are not aligned with defautl column names from inference

SWE-bench/harness/constants.py

Line 512 in 0fbb49d

KEY_MODEL = "model"

e.g. inference uses "model_name_or_path" and "model_patch" by default.

Sharing my gpt-4-bm25-27k results

hi, I ran inference on half of the problems using gpt-4 (bm25-27k setting), sharing my results here if this can be useful. Awesome project!
https://drive.google.com/file/d/1Q9AX4zgOKDLMrXlKvBJ4e5qJtbtPW3aP/view?usp=sharing

Issues in the test case parsing logic from the logs

Firstly, thanks for this useful dataset.

Some of the pytest logs have a format such as this:

PASSED sklearn/feature_extraction/tests/test_text.py::test_callable_analyzer_error[file-AttributeError-'str' object has no attribute 'read'-CountVectorizer]
PASSED sklearn/feature_extraction/tests/test_text.py::test_callable_analyzer_error[file-AttributeError-'str' object has no attribute 'read'-TfidfVectorizer]

While parsing which test cases passed, the current code is splitting the line by spaces and taking the second token. As a result, :

test_callable_analyzer_error[file-AttributeError-'str' object has no attribute 'read'-CountVectorizer maps to sklearn/feature_extraction/tests/test_text.py::test_callable_analyzer_error[file-AttributeError-'str'
test_callable_analyzer_error[file-AttributeError-'str' object has no attribute 'read'-TfidfVectorizer maps to sklearn/feature_extraction/tests/test_text.py::test_callable_analyzer_error[file-AttributeError-'str'

These erroneous mappings have also been captured in the dataset such as for scikit-learn__scikit-learn-14430, scikit-learn__scikit-learn-13554 and several others.

The correction requires very minor changes in the code. Posting here for others using the dataset.

Tests that should fail don't fail.

Thank you for providing a great data set.
I have tested it in my environment and confirmed that the fail to pass test does not fail in the following instances, which should fail before applying the gold patch.

psf__requests-1689
psf__requests-1724
psf__requests-1888
psf__requests-2153
psf__requests-2617
psf__requests-2674
psf__requests-2821
psf__requests-774
psf__requests-863

I have applied the test patch.

Could you please confirm this?

PASS_TO_PASS tests failing on original program

Thank you for creating the dataset!

I was running scripts on the benchmark, and realized that sometimes the PASS_TO_PASS tests fail on the original version of the subject when no patch was applied.

For example, in matplotlib__matplotlib-24334, upon checking out the base_commit, entering the conda env, and performing install commands, executing the test_cmd pytest --no-header -rA --tb=no -p no:cacheprovider lib/matplotlib/tests/test_axes.py results in some test failures on my system (ubuntu 20.04 LTS). The following tests failed and are in the PASS_TO_PASS list of this instance:

FAILED lib/matplotlib/tests/test_axes.py::test_hist2d[png] - matplotlib.testing.exceptions.ImageComparisonFailure: images not close (RMS 5.559):
FAILED lib/matplotlib/tests/test_axes.py::test_hist2d[pdf] - matplotlib.testing.exceptions.ImageComparisonFailure: images not close (RMS 142.950):
FAILED lib/matplotlib/tests/test_axes.py::test_hist2d[svg] - matplotlib.testing.exceptions.ImageComparisonFailure: images not close (RMS 4.493):
FAILED lib/matplotlib/tests/test_axes.py::test_hist2d_transpose[pdf] - matplotlib.testing.exceptions.ImageComparisonFailure: images not close (RMS 155.125):

These tests seem to test for histogram generation. I am not sure why they fail, but maybe some system-level dependencies were missing / having wrong version on my system.

May I know whether these tests could pass on your system? Since some of the projects in the benchmark require quite a number of system-level dependencies, it might be good to have a Docker environment. Also, for the tests that are not very related to the target issue but have unstable behavior due to the host environment, would you consider removing them from the benchmark?

Thank you for looking into this!

Updated Docker Container

Thank you for all your hard work on this benchmark. I saw that in Issue #15 it was mentioned that there was work being done on a Docker container and a work-in-progress Dockerfile was shared. However, when I attempt to run evaluation in this Docker container no repositories seem to be being properly installed for evaluation. Would you be able to share an updated version of the Docker container along with the repositories for which evaluation is currently working with it?

Thanks!

Sphinx pre_install rationale

Hello,

In constants.py, you modify the jinja2 version. Why wouldn't the previous Jinja2>=2.3 not work, and an upper limit of 3.1 needed?

How can I run_evalution? What is the args swe_bench_tasks,I need to create by myself?

as the title,I read the readme but ,I don't know how to run the evaluation ,the args predictions_path and ser_bench_tasks is what?
I download the dataset but I don't know how to use it.
I need to ceate a json file ? if anybody can give me a example ,I would appericate it

Reproducible Experiments via MLFlow or Similar [FeatureRequest]

It would be nice to have a "Experiments" directory for reproducible research via MLflow or a similar tool.

Regarding the inconsistent results of the apply ratio

In Table 5 of the paper, under the condition of oracle retrieval, the apply ratio of GPT-4 is 13.2%. However, in Table 14, the number of applies is 150, the number of generations is 472, and the ratio is 31.8%. These numbers are inconsistent. I look forward to your response. Thank you.

	PATCH_EXAMPLE = """--- a/file.py
	+++ b/file.py
	@@ -1,27 +1,35 @@
	def euclidean(a, b):
	- while b:
	- a, b = b, a % b
	- return a
	+ if b == 0:
	+ return a
	+ return euclidean(b, a % b)


	def bresenham(x0, y0, x1, y1):
	points = []
	dx = abs(x1 - x0)
	dy = abs(y1 - y0)
	- sx = 1 if x0 < x1 else -1
	- sy = 1 if y0 < y1 else -1
	- err = dx - dy
	+ x, y = x0, y0
	+ sx = -1 if x0 > x1 else 1
	+ sy = -1 if y0 > y1 else 1

	- while True:
	- points.append((x0, y0))
	- if x0 == x1 and y0 == y1:
	- break
	- e2 = 2 * err
	- if e2 > -dy:
	+ if dx > dy:
	+ err = dx / 2.0
	+ while x != x1:
	+ points.append((x, y))
	err -= dy
	- x0 += sx
	- if e2 < dx:
	- err += dx
	- y0 += sy
	+ if err < 0:
	+ y += sy
	+ err += dx
	+ x += sx
	+ else:
	+ err = dy / 2.0
	+ while y != y1:
	+ points.append((x, y))
	+ err -= dx
	+ if err < 0:
	+ x += sx
	+ err += dy
	+ y += sy

	+ points.append((x, y))
	return points"""