nus-apr / auto-code-rover Goto Github PK

A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 30.67% tasks (pass@1) in SWE-bench lite and 38.40% tasks (pass@1) in SWE-bench verified with each task costs less than $0.7.

License: Other

Dockerfile 0.19% Python 95.35% CSS 0.02% TypeScript 4.33% JavaScript 0.06% Shell 0.06%

auto-code-rover's Introduction

AutoCodeRover: Autonomous Program Improvement

ArXiv Paper Website Discord server

Note

This is a public version of the AutoCodeRover project. Check the latest results on our website.

📣 Updates

[August 14, 2024] On the SWE-bench Verified dataset released by OpenAI, AutoCodeRover(v20240620) achieves 38.40% efficacy, and AutoCodeRover(v20240408) achieves 28.8% efficacy. More details in the blog post from OpenAI and SWE-bench leaderboard.
[July 18, 2024] AutoCodeRover now supports a new mode that outputs the list of potential fix locations.
[June 20, 2024] AutoCodeRover(v20240620) now achieves 30.67% efficacy (pass@1) on SWE-bench-lite!
[June 08, 2024] Added support for Gemini, Groq (thank you KasaiHarcore for the contribution!) and Anthropic models through AWS Bedrock (thank you JGalego for the contribution!).
[April 29, 2024] Added support for Claude and Llama models. Find the list of supported models here! Support for more models coming soon.
[April 19, 2024] AutoCodeRover now supports running on GitHub issues and local issues! Feel free to try it out and we welcome your feedback!

Discord - server for general discussion, questions, and feedback.

👋 Overview

AutoCodeRover is a fully automated approach for resolving GitHub issues (bug fixing and feature addition) where LLMs are combined with analysis and debugging capabilities to prioritize patch locations ultimately leading to a patch.

[Update on June 20, 2024] AutoCodeRover(v20240620) now resolves 30.67% of issues (pass@1) in SWE-bench lite! AutoCodeRover achieved this efficacy while being economical - each task costs less than $0.7 and is completed within 7 mins!

[April 08, 2024] First release of AutoCodeRover(v20240408) resolves 19% of issues in SWE-bench lite (pass@1), improving over the current state-of-the-art efficacy of AI software engineers.

AutoCodeRover works in two stages:

🔎 Context retrieval: The LLM is provided with code search APIs to navigate the codebase and collect relevant context.
💊 Patch generation: The LLM tries to write a patch, based on retrieved context.

✨ Highlights

AutoCodeRover has two unique features:

Code search APIs are Program Structure Aware. Instead of searching over files by plain string matching, AutoCodeRover searches for relevant code context (methods/classes) in the abstract syntax tree.
When a test suite is available, AutoCodeRover can take advantage of test cases to achieve an even higher repair rate, by performing statistical fault localization.

🗎 arXiv Paper

AutoCodeRover: Autonomous Program Improvement [arXiv 2404.05427]

For referring to our work, please cite and mention:

@misc{zhang2024autocoderover,
      title={AutoCodeRover: Autonomous Program Improvement},
      author={Yuntong Zhang and Haifeng Ruan and Zhiyu Fan and Abhik Roychoudhury},
      year={2024},
      eprint={2404.05427},
      archivePrefix={arXiv},
      primaryClass={cs.SE}
}

✔️ Example: Django Issue #32347

As an example, AutoCodeRover successfully fixed issue #32347 of Django. See the demo video for the full process:

acr-final.mp4

Enhancement: leveraging test cases

AutoCodeRover can resolve even more issues, if test cases are available. See an example in the video:

acr_enhancement-final.mp4

🚀 Setup & Running

Setup API key and environment

We recommend running AutoCodeRover in a Docker container.

Set the OPENAI_KEY env var to your OpenAI key:

export OPENAI_KEY=sk-YOUR-OPENAI-API-KEY-HERE

For Anthropic model, Set the ANTHROPIC_API_KEY env var can be found here

export ANTHROPIC_API_KEY=sk-ant-api...

The same with GROQ_API_KEY

Build and start the docker image:

docker build -f Dockerfile -t acr .
docker run -it -e OPENAI_KEY="${OPENAI_KEY:-OPENAI_API_KEY}" -p 3000:3000 -p 5000:5000 acr

Alternatively, you can use Dockerfile.scratch which supports arm64 (Apple silicon) and ppc in addition to amd64. Dockerfile.scratch will build both SWE-bench (from https://github.com/yuntongzhang/SWE-bench.git) and ACR.

docker build -f Dockerfile.scratch -t acr .

There are build args for customizing the build in Dockerfile.scratch like this:

docker build --build-arg [email protected] --build-arg GIT_NAME=your_id \
       --build-arg SWE_BENCH_REPO=https://github.com/your_id/SWE-bench.git \
       -f Dockerfile.scratch -t acr .

After setting up, we can run ACR in three modes:

GitHub issue mode: Run ACR on a live GitHub issue by providing a link to the issue page.
Local issue mode: Run ACR on a local repository and a file containing the issue description.
SWE-bench mode: Run ACR on SWE-bench task instances.

[GitHub issue mode] Set up and run on new GitHub issues

If you want to use AutoCodeRover for new GitHub issues in a project, prepare the following:

Link to clone the project (used for git clone ...).
Commit hash of the project version for AutoCodeRover to work on (used for git checkout ...).
Link to the GitHub issue page.

Then, in the docker container (or your local copy of AutoCodeRover), run the following commands to set up the target project and generate patch:

cd /opt/auto-code-rover
conda activate auto-code-rover
PYTHONPATH=. python app/main.py github-issue --output-dir output --setup-dir setup --model gpt-4o-2024-05-13 --model-temperature 0.2 --task-id <task id> --clone-link <link for cloning the project> --commit-hash <any version that has the issue> --issue-link <link to issue page>

Here is an example command for running ACR on an issue from the langchain GitHub issue tracker:

PYTHONPATH=. python app/main.py github-issue --output-dir output --setup-dir setup --model gpt-4o-2024-05-13 --model-temperature 0.2 --task-id langchain-20453 --clone-link https://github.com/langchain-ai/langchain.git --commit-hash cb6e5e5 --issue-link https://github.com/langchain-ai/langchain/issues/20453

The <task id> can be any string used to identify this issue.

If patch generation is successful, the path to the generated patch will be printed in the end.

Web UI is also provided for visualization of the issue fixing process. In the docker shell, run the following command:

cd /opt/auto-code-rover/demo_vis/
bash run.sh

then open the url localhost:3000 in the web explorer.

[Local issue mode] Set up and run on local repositories and local issues

Instead of cloning a remote project and run ACR on an online issue, you can also prepare the local repository and issue beforehand, if that suits the use case.

For running ACR on a local issue and local codebase, prepare a local codebase and write an issue description into a file, and run the following commands:

cd /opt/auto-code-rover
conda activate auto-code-rover
PYTHONPATH=. python app/main.py local-issue --output-dir output --model gpt-4o-2024-05-13 --model-temperature 0.2 --task-id <task id> --local-repo <path to the local project repository> --issue-file <path to the file containing issue description>

If patch generation is successful, the path to the generated patch will be printed in the end.

[SWE-bench mode] Set up and run on SWE-bench tasks

This mode is for running ACR on existing issue tasks contained in SWE-bench.

Set up

In the docker container, we need to first set up the tasks to run in SWE-bench (e.g., django__django-11133). The list of all tasks can be found in conf/swe_lite_tasks.txt.

The tasks need to be put in a file, one per line:

cd /opt/SWE-bench
echo django__django-11133 > tasks.txt

Or if running on arm64 (e.g. Apple silicon), try this one which doesn't depend on Python 3.6 (which isn't supported in this env):

echo django__django-16041 > tasks.txt

Then, set up these tasks by running:

cd /opt/SWE-bench
conda activate swe-bench
python harness/run_setup.py --log_dir logs --testbed testbed --result_dir setup_result --subset_file tasks.txt

Once the setup for this task is completed, the following two lines will be printed:

setup_map is saved to setup_result/setup_map.json
tasks_map is saved to setup_result/tasks_map.json

The testbed directory will now contain the cloned source code of the target project. A conda environment will also be created for this task instance.

If you want to set up multiple tasks together, put their ids in tasks.txt and follow the same steps.

Run a single task in SWE-bench

Before running the task (django__django-11133 here), make sure it has been set up as mentioned above.

cd /opt/auto-code-rover
conda activate auto-code-rover
PYTHONPATH=. python app/main.py swe-bench --model gpt-4o-2024-05-13 --setup-map ../SWE-bench/setup_result/setup_map.json --tasks-map ../SWE-bench/setup_result/tasks_map.json --output-dir output --task django__django-11133

The output of the run can then be found in output/. For example, the patch generated for django__django-11133 can be found at a location like this: output/applicable_patch/django__django-11133_yyyy-MM-dd_HH-mm-ss/extracted_patch_1.diff (the date-time field in the directory name will be different depending on when the experiment was run).

Run multiple tasks in SWE-bench

First, put the id's of all tasks to run in a file, one per line. Suppose this file is tasks.txt, the tasks can be run with

cd /opt/auto-code-rover
conda activate auto-code-rover
PYTHONPATH=. python app/main.py swe-bench --model gpt-4o-2024-05-13 --setup-map ../SWE-bench/setup_result/setup_map.json --tasks-map ../SWE-bench/setup_result/tasks_map.json --output-dir output --task-list-file /opt/SWE-bench/tasks.txt

NOTE: make sure that the tasks in tasks.txt have all been set up in SWE-bench. See the steps above.

Using a config file

Alternatively, a config file can be used to specify all parameters and tasks to run. See conf/vanilla-lite.conf for an example. Also see EXPERIMENT.md for the details of the items in a conf file. A config file can be used by:

python scripts/run.py conf/vanilla-lite.conf

Using a different model

AutoCodeRover works with different foundation models. You can set the foundation model to be used with the --model command line argument.

The current list of supported models:

	Model	AutoCodeRover cmd line argument
OpenAI	gpt-4o-2024-08-06	--model gpt-4o-2024-08-06
	gpt-4o-2024-05-13	--model gpt-4o-2024-05-13
	gpt-4-turbo-2024-04-09	--model gpt-4-turbo-2024-04-09
	gpt-4-0125-preview	--model gpt-4-0125-preview
	gpt-4-1106-preview	--model gpt-4-1106-preview
	gpt-3.5-turbo-0125	--model gpt-3.5-turbo-0125
	gpt-3.5-turbo-1106	--model gpt-3.5-turbo-1106
	gpt-3.5-turbo-16k-0613	--model gpt-3.5-turbo-16k-0613
	gpt-3.5-turbo-0613	--model gpt-3.5-turbo-0613
	gpt-4-0613	--model gpt-4-0613
Anthropic	Claude 3.5 Sonnet	--model claude-3-5-sonnet-20240620
	Claude 3 Opus	--model claude-3-opus-20240229
	Claude 3 Sonnet	--model claude-3-sonnet-20240229
	Claude 3 Haiku	--model claude-3-haiku-20240307
Meta	Llama 3 70B	--model llama3:70b
	Llama 3 8B	--model llama3
AWS	Claude 3 Opus	--model bedrock/anthropic.claude-3-opus-20240229-v1:0
	Claude 3 Sonnet	--model bedrock/anthropic.claude-3-sonnet-20240229-v1:0
	Claude 3 Haiku	--model bedrock/anthropic.claude-3-haiku-20240307-v1:0
Groq	Llama 3 8B	--model groq/llama3-8b-8192
	Llama 3 70B	--model groq/llama3-70b-8192
	Llama 2 70B	--model groq/llama2-70b-4096
	Mixtral 8x7B	--model groq/mixtral-8x7b-32768
	Gemma 7B	--model groq/gemma-7b-it

Note

Using the Groq models on a free plan can cause the context limit to be exceeded, even on simple issues.

Note

Some notes on running ACR with local models such as llama3:

Before using the llama3 models, please install ollama and download the corresponding models with ollama (e.g. ollama pull llama3).
You can run ollama server on the host machine, and ACR in its container. ACR will attempt to communicate to the ollama server on host.
If your setup is ollama in host + ACR in its container, we recommend installing Docker Desktop on the host, in addition to the Docker Engine.
- Docker Desktop contains Docker Engine, and also has a virtual machine which makes it easier to access the host ports from within a container. With Docker Desktop, this setup will work without additional effort.
- When the docker installation is only Docker Engine, you may need to add either --net=host or --add-host host.docker.internal=host-gateway to the docker run command when starting the ACR container, so that ACR can communicate with the ollama server on the host machine.

Experiment Replication

Please refer to EXPERIMENT.md for information on experiment replication.

✉️ Contacts

For any queries, you are welcome to open an issue.

Alternatively, contact us at: {yuntong,hruan,zhiyufan}@comp.nus.edu.sg.

Acknowledgements

This work was partially supported by a Singapore Ministry of Education (MoE) Tier 3 grant "Automated Program Repair", MOE-MOET32021-0001.

auto-code-rover's People

Contributors

Stargazers

Watchers

Forkers

karbon0x mrgsub everhusk touristshaun mivanovitch ram-parthiban veryvanya 1jsingh ytbryan rahulchhabra07 azure-arc-0 maddyonline mz0in a43501 rkunnamp segmond wjkennedy mkrupczak3 polya20 aiexandr eltociear id-2 eugenepyvovarov theycallmeloki bermanboris krish240574 ashishbijlani johndpope tomchapin hasokeric nlile mdwoicke yacineali74 mindcrime-forks corner-3-partners wizkid1968 plurigrid johnunnaki autominds shivamms andriy-safe-ai andersonamaral2 codeaudit xc0r qml-coder octag0no stophobia jaedukseo guillermo-ayala moxmoussa hp2413 paul-pham-157 sikkgit tecworks-dev cedrickchee primedeviation vikyw89 grimmcrows xiechengmude beimingmaster techthiyanes xcytxs sizzles aidev-pt lee-b cocobeach apollohuang1 adamkane f00lg0ldl0af cygwynd svorwerk-flextg dagelf jiyangzhang dependify kumar045 tmendoza darksidesfear jjhw gitdakky flyingbearhk filipeborges1993 alberto-codes kezekwem codeplexer sunwood-ai-labs martianbandit paperwave elysia090 savvibrax zzmjohn asdlei99 johnny-rice autonoma-pocs zuxfoucault thenornet58 haemate63 sporksenet-horatorbr i-bundersl nicsysca69 humanshangcottonhope

auto-code-rover's Issues

Adding a support for Cohere Command-R, Anthropic Claude, and Gemini APIs

Hey, I would like to suggest support for integrating additional language model APIs beyond just OpenAI. Specifically, it would be very helpful to have the ability to use:

Cohere API (including recent Command-R model with amazing retrieval-augmented generation)
Anthropic API (Claude model)
Google Gemini API
Ollama Local LLMs (for the sake of those, who can't share the code cuz of the NDA)

These models rank among the top 10 AI language models according to benchmarks like https://chat.lmsys.org/ and provide capabilities complementary to OpenAI's models.

The recent Command-R model from Cohere is particularly compelling for its strong retrieval-augmented capabilities using its embeddings. And the Claude model from Anthropic has received acclaim for its coherence and abilities to code.

Having this flexibility would be incredibly valuable. Would be amazing if you consider adding it!

Unable to add issue from github

Hi so I added a custom issue (the one not in the conf/swe_lite_tasks.txt file ) from GitHub and was getting this error

vercel__next.js-64413

2024-04-12 17:11:58,722 - INFO - env_name for all setup entries: [] 2024-04-12 17:11:58,722 - INFO - No setup needed..

So what should I do?

docker install error

This is a very good project, but when I install docker according to the readme, it doesn't work properly. It's a pity. Is there any solution?

Make it easier to install

Can you make something like SWE-Agent where you can just run a simple command inside WSL with the neccessary infos like Model, API Key and Link to the Issue on Github?

Request for Explanation on Restriction of OpenAI Parallel Tool Calls

Hello,

I have noticed that in the code, the project restricts the use of OpenAI parallel tool calls. Specifically, when using the OpenAI function calling, the agent can only make one function call at a time. Could you please provide some insight into the reason behind this restriction?

Thank you for your time and assistance.

A WORD TO THE WISE IS ENOUGH

PLEASE IMPLEMENT LLM INFERENCE USING LITE LLM OR THE PROJECT CANNOT GROW EFFICIENTLY.

Comments on PullRequestBenchmark?

Hello!

I am the author of PullRequestBenchmark and I am wondering if you have any thoughts on that?

Best Regards

How to replay output?

In this example.mp4 file, it replay the planing and reasoning trajectories, but I can't find that mode in main execution file.

Can you show me some pointers to replay it?

Thanks

Ollama support issue

When testing the llama3 model and ollama, I encountered an error indicating that communication with the ollama server is unreachable:

httpx.ConnectError: [Errno 111] Connection refused

This issue arises because ollama.chat(model=self.name, messages=[]) invokes chat = _client.chat (located in site-packages/ollama/init.py), where _client = Client(). The Client() constructor defaults to 'http://localhost:11434', which, within a Docker container, refers to the container itself rather than the host machine, while I install ollama in the host.

To resolve this, I propose two options:

Update the README: Suggest that ollama should be installed within the same Docker container as the agent. This approach requires users to configure a GPU environment within the container if they wish to utilize GPU capabilities for running llama3, which might be cumbersome.
Host Installation with Custom Client Configuration: Recommend installing ollama on the host machine. Use client.chat where client = Client(host='http://host.docker.internal:11434'). Here, host.docker.internal points to the host within the Docker network.

I hope the maintainers acknowledge this issue. Considering that llama3 is a cost-effective option, its popularity is likely to increase, potentially affecting many users with this connectivity problem.

How to make it work without anaconda

Hi , Anaconda is blocked for my department, can I get any solution which can help me to use it just using python.

[Fresh issue mode] Support local codebase and local issue file

Currently, the fresh issue mode only supports cloning a remote project and working on issues from GitHub links. Sometimes one may want to pre-download the codebase and write the issue description in a file instead.

open source the latest version of the code

Great work! Are there any plans to open source the latest version of the code? Thanks!

Can AutoCodeRover execute the code files?

I am really interested to know if AutoCodeRover can execute the files once new patch is created?
Can it commit code?

Add easy web based usage

A lower barrier to entry would enhance the adoption and usefulness of the project. Consider the following scenarios that a repo owner might want:

Adding a link to the repo that users could follow to an instance pre-configured for working on the repo.
Adding a link to an issue that users could follow to an instance configured to work the issue.

These links could be added manually or by bots.

There are lots of potential options for where instances might run -- Replit, Colab, Github, ...
The important thing is that more people will have the time and ability to use SWE-agent for working issues if doing so is as simple as possible. That is true even if they need to have a paid account somewhere to use the link.

Some requests

if possible, please add the way to evaluate swe-bench-verified.
did you use other model like LLama 3.1 or Claude as backbone, if so, could you please release the result
Thanks a lot !

Context window limitation

One of the major issues with ai programming is the context window. Agents perform better with 128k+ window models compared to 4k or 8k. Does ACR have the same issue as well? Gemma-2-27b-it is great and performs as well as gpt-4 variants. However, it has an 8k context window. 30% score on swe-bench lite is with the gpt-4o, which is a 128k model. Is it possible to get +- 25% score level performance with Gemma an 8k model?

Expose openai.base_url for access to more models

It would be great if openai.base_url was exposed so that nearly any model could be loaded in via lm-studio or other OpenAI-like APIs.

Create a Discord group （创建一个 Discord 讨论组）

Great job! How about creating a Discord discussion group so the community can have real-time discussions?
Just like https://github.com/princeton-nlp/SWE-agent?tab=readme-ov-file#-contributions-

Question about Auto Code Rover SWE-bench data

I am planning on writing an article on Auto Code Rover and I was wondering if you could tell me about the format of the SWE-bench test results in: https://github.com/nus-apr/auto-code-rover/tree/main/results/swe-agent-results
How am I to interpret the results in this directory? Specifically for Devin they formatted diffs for their SWE-bench run into separate pass/fail directories: https://github.com/CognitionAI/devin-swebench-results/tree/main/output_diffs
How is this done for your results? Thanks in advance and thanks for publishing your work.

-Harry

missing rich library

missing rich library , when I use pip install rich in container, it's solved .

PYTHONPATH=. python app/main.py swe-bench --model gpt-4-0125-preview --setup-map ../SWE-bench/setup_result/setup_map.json --tasks-map ../SWE-bench/setup_result/tasks_map.json --output-dir output --task django__django-11133
Traceback (most recent call last):
File "/opt/auto-code-rover/app/main.py", line 16, in
from app import globals, globals_mut, inference, log
File "/opt/auto-code-rover/app/inference.py", line 11, in
from app.api.manage import ProjectApiManager
File "/opt/auto-code-rover/app/api/manage.py", line 12, in
from app import log
File "/opt/auto-code-rover/app/log.py", line 5, in
from rich.console import Console
ModuleNotFoundError: No module named 'rich'
(auto-code-rover) root@aa1d1cf79120:/opt/auto-code-rover# pip install rich

Why have you implemented your tool from scratch instead of using existing frameworks like AutoGPT or Baby AGI?

I noticed that the AutoCodeRover has been implemented from scratch. There are several existing frameworks, such as AutoGPT and Baby AGI, that provide robust functionality for creating LLM-based agents. These frameworks could potentially save development time and leverage existing solutions for common challenges.

Could you please provide more details on the rationale behind the decision to develop this from scratch? Specifically, I am curious to know:

What specific requirements or goals led you to choose a custom implementation over existing frameworks?
Were there any limitations or shortcomings in AutoGPT or Baby AGI that influenced this decision?
How does the custom implementation of SW-Agent compare to these frameworks in terms of performance, scalability, and maintainability?
Are there any plans to integrate features from these frameworks in the future?

Understanding these points would be really helpful in appreciating the design choices and the potential advantages of the custom implementation.

Thank you!

Possible underestimated pass@3 results

I have evaluated your predictions using my Docker based swe-bench evaluator. I achieve 26% on pass@3 compared to the 22% you reported. It might be worthwhile to review the logs for the failed benchmarks to see if your agent can actually achieve even better results :D

You find the logs and report here

And here's a sheet I use to compare the results.

Leveraging Compile/Build/Type Errors to inform and improve success metrics

Just wanted to say amazing work here! ACR is one of the few projects I saw which implemented the capability of leveraging test cases which I think is incredible and crucial to having truly autonomous code generation & tooling. I did have some questions/thoughts and wanted to know what the future looks like for ACR.

Questions

What languages are supported for the AST? Is it just python or is there planned support for other languages as well? For reference repo maps where AST are constructed is done with aider https://aider.chat/2023/10/22/repomap.html which uses tree-sitter for supporting many languages
Any plans for discriminators/MCTS?
- https://github.com/zhentingqi/rStar
- Actor Critic Regenerator pattern https://huggingface.co/LoneStriker/HelixNet-critic-4.0bpw-h6-exl2
- Tree of thought patterns via Chain of Thought https://x.com/Arcturus_f/status/1739762147859525835/photo/1 combined with Actor-Generator-Discriminators for Graph of Thought?
- From 10 mo ago but "reflecting" between two models improved quality https://www.reddit.com/r/LocalLLaMA/comments/180uz42/today_is_the_first_day_im_getting_results/
Any plans for selecting a model based on language? It appears that some languages are better than others for certain languages (Gemma is quite good a typescript despite being a much smaller model than others)

Project Organization

I had a few questions/ideas and I was wondering if this is on the roadmap, if maybe it would be possible to see the roadmap/if there is one on Github (can be done via github projects https://docs.github.com/en/issues/planning-and-tracking-with-projects/customizing-views-in-your-project/customizing-the-roadmap-layout )

Requests

Can options be added to ensure that any code added (feature or fix) includes corresponding generated test cases? (Maybe generate tests based on spec and then code is generated on spec and matched against the code?)
There are tools in the case of javascript where if you run jest test --coverage (coverage in python does the same https://coverage.readthedocs.io/en/7.6.1/ ) you get a coverage report of all code where the lines not covered by tests are surfaced automatically. Can this be used as an input to increase test coverage of the codebase? (i.e. it can see all uncovered lines of code and then generate test cases, ensuring this way any new code added cannot introduce unknown bugs)
In the case of python mypy can be used to add types and running mypy can ensure if code adheres to the type spec provided. This also exists within Typescript. Can the outputs of running mypy (for python) or eventually typescript/biome errors be used as inputs alongside the tests to improve code return quality?
Build/compile time errors used as inputs as well whether it's Swift/Rust etc? For anyone not familiar in many languages such as Swift or Rust, in the former build errors are provided by the language server/xcode/compiler that tells you why something won't work with an error alongside suggested changes to fix it.

Really appreciate all the work done here and excited to see if there is any way to contribute. If and when autocoderover advances more....it can be used to improve autocoderover itself :)

Evaluation new models

Hello, I see you added new supported models. Can you provide an evaluation of them on SWE-bench so that it can be compared with the evaluations already done?

Thank you

how to interpret results for Auto Code Rover SWE-bench?

I am trying to understand results for Auto Code Rover and SWE-Agent.

Can you please let me know the format of the SWE-Agent test results in:
https://github.com/nus-apr/auto-code-rover/tree/main/results/swe-agent-results

What are all these cost_2_1, cost_2_2, and cost_2_3?

How can I to understand the results in this directory?

Also for Auto Code Reover, I see acr-run-1, acr-run-2, acr-run-3. Which one should I take? Which result are you reporting in the paper?

Docker image fails to build (M1 Mac)

I am trying to get ACR running on my local machine but the Docker image (Dockerfile.scratch since I am on Apple Silicon) will not build.

First error:

$ docker build -f Dockerfile.scratch -t acr .
(...)
2.048 E: Package 'python-tk' has no installation candidate
------
Dockerfile.scratch:10
--------------------
   9 |     
  10 | >>> RUN apt update && apt install -y \
  11 | >>>     git wget vim \
  12 | >>>     libffi-dev python3-pytest pkg-config build-essential libssl-dev \
  13 | >>>     libfreetype6-dev libqhull-dev \
  14 | >>>     texlive cm-super dvipng python-tk ffmpeg \
  15 | >>>     imagemagick fontconfig ghostscript inkscape graphviz \
  16 | >>>     optipng fonts-comic-neue  python3-pikepdf
  17 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c apt update && apt install -y     git wget vim     libffi-dev python3-pytest pkg-config build-essential libssl-dev     libfreetype6-dev libqhull-dev     texlive cm-super dvipng python-tk ffmpeg     imagemagick fontconfig ghostscript inkscape graphviz     optipng fonts-comic-neue  python3-pikepdf" did not complete successfully: exit code: 100

On Apple Silicon it seems tkinter is no longer installable via pip, but is bundled with Python (unless I'm misunderstanding something). I verified that I do have python-tk on my machine, so I removed the dependency from the apt install hoping that would fix the issue.

In any case, I get a different error now:

$ docker build -f Dockerfile.scratch -t acr .
(...)
=> [ 7/11] RUN conda env create -f environment.yml                                                                                               29.4s 
 => [ 8/11] RUN ln -sf /bin/bash /bin/sh                                                                                                           0.3s 
 => [ 9/11] COPY . /opt/auto-code-rover                                                                                                            6.2s 
 => [10/11] WORKDIR /opt/auto-code-rover                                                                                                           0.0s 
 => ERROR [11/11] RUN conda env create -f environment.yml                                                                                          9.6s 
------                                                                                                                                                  
 > [11/11] RUN conda env create -f environment.yml:                                                                                                     
0.663 Channels:                                                                                                                                         
0.663  - conda-forge                                                                                                                                    
0.663  - defaults                                                                                                                                       
0.663 Platform: linux-aarch64                                                                                                                           
0.663 Collecting package metadata (repodata.json): ...working... done
4.591 Solving environment: ...working... failed
5.119 Channels:
5.119  - conda-forge
5.119  - defaults
5.119 Platform: linux-aarch64
5.119 Collecting package metadata (repodata.json): ...working... done
9.014 Solving environment: ...working... failed
9.533 
9.533 LibMambaUnsatisfiableError: Encountered problems while solving:
9.533   - package unidiff-0.7.5-py38he3eb160_0 requires python >=3.8,<3.9.0a0 *_cpython, but none of the providers can be installed
9.533 
9.533 Could not solve for environment specs
9.533 The following packages are incompatible
9.533 ├─ libuuid 1.41.5**  is requested and can be installed;
9.533 ├─ python 3.11.7**  is installable with the potential options
9.533 │  ├─ python [3.10.11|3.10.12|...|3.9.19] would require
9.533 │  │  └─ libuuid >=2.38.1,<3.0a0 , which conflicts with any installable versions previously reported;
9.533 │  └─ python 3.11.7, which can be installed;
9.533 ├─ unidiff 0.7.5**  is installable with the potential options
9.533 │  ├─ unidiff 0.7.5 would require
9.533 │  │  └─ python >=3.10,<3.11.0a0 *_cpython but there are no viable options
9.533 │  │     ├─ python [3.10.0|3.10.1|...|3.9.9] would require
9.533 │  │     │  └─ libuuid >=2.32.1,<3.0a0 , which conflicts with any installable versions previously reported;
9.533 │  │     └─ python [3.10.11|3.10.12|...|3.9.19], which cannot be installed (as previously explained);
9.533 │  ├─ unidiff 0.7.5 would require
9.533 │  │  └─ python >=3.11,<3.12.0a0 *_cpython with the potential options
9.533 │  │     ├─ python [3.10.0|3.10.1|...|3.9.9], which cannot be installed (as previously explained);
9.533 │  │     ├─ python [3.10.11|3.10.12|...|3.9.19], which cannot be installed (as previously explained);
9.533 │  │     └─ python 3.11.0 would require
9.533 │  │        └─ xz >=5.2.6,<5.3.0a0 , which can be installed;
9.533 │  ├─ unidiff 0.7.5 would require
9.533 │  │  └─ python >=3.12,<3.13.0a0 *_cpython, which cannot be installed (as previously explained);
9.533 │  ├─ unidiff 0.7.5 would require
9.533 │  │  └─ python >=3.8,<3.9.0a0 *_cpython but there are no viable options
9.533 │  │     ├─ python [3.8.10|3.8.12|...|3.8.8] conflicts with any installable versions previously reported;
9.533 │  │     ├─ python [3.10.0|3.10.1|...|3.9.9], which cannot be installed (as previously explained);
9.533 │  │     └─ python [3.10.11|3.10.12|...|3.9.19], which cannot be installed (as previously explained);
9.533 │  ├─ unidiff 0.7.5 would require
9.533 │  │  └─ python_abi 3.9 *_pypy39_pp73, which requires
9.533 │  │     └─ python 3.9.* *_73_pypy, which conflicts with any installable versions previously reported;
9.533 │  └─ unidiff 0.7.5 would require
9.533 │     └─ python >=3.9,<3.10.0a0  with the potential options
9.533 │        ├─ python [3.10.0|3.10.1|...|3.9.9], which cannot be installed (as previously explained);
9.533 │        ├─ python [3.10.11|3.10.12|...|3.9.19], which cannot be installed (as previously explained);
9.533 │        ├─ python [3.9.0|3.9.1|...|3.9.7] conflicts with any installable versions previously reported;
9.533 │        └─ python 3.9.19 would require
9.533 │           └─ xz >=5.4.6,<6.0a0 , which can be installed;
9.533 └─ xz 5.4.5**  is not installable because it conflicts with any installable versions previously reported.
9.533 
------
Dockerfile.scratch:30
--------------------
  28 |     COPY . /opt/auto-code-rover
  29 |     WORKDIR /opt/auto-code-rover
  30 | >>> RUN conda env create -f environment.yml
  31 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c conda env create -f environment.yml" did not complete successfully: exit code: 1

I am not sure what the problem is. I have Python 3.12.2 installed as my default version, but the ACR README doesn't specify a requirement on a particular Python version?

Anyway, any help would be most appreciated.

How to select the api？

What exactly should I do to select the api, is it via command line or shortcut?

Why not generate the JSON with api calls directly in the first inference?

Is there a reason why you are doing 2 inferences instead of only 1 to analyze and generate the json with api calls?
With a good prompt, we could do both together and save costs and noise.
Do you want me to send a PR?

Broken image yuntongzhang/swe-bench:latest

Not possible to install libs to the image

$DPKG_HOOK_ACTION" = remove-architecture; } && test -x /usr/share/pkg-config-dpkghook; then /usr/share/pkg-config-dpkghook update; fi', exit code 32512
E: Sub-process /usr/bin/dpkg returned an error code (2)
E: Problem executing scripts DPkg::Post-Invoke 'if [ -d /var/lib/update-notifier ]; then touch /var/lib/update-notifier/dpkg-run-stamp; fi; /usr/lib/update-notifier/update-motd-updates-available 2>/dev/null || true'
E: Sub-process returned an error code
The command '/bin/sh -c apt install -y vim build-essential libssl-dev' returned a non-zero code: 100

License issue

Great work on your agent. What do you think about changing the license to something like MIT or Apache 2.0? Unfortunately, a lot of people will not be able to use this due to the GPL license.

Question on how to obtain different repo versions

Thank you for developing and maintaining this inspiring project!

I'm using harness/run_setup.py to obtain different versions of a repository (e.g., Django) for testing but noticed the clone_repo function in harness/utils.py doesn't switch to specific branches/tags. This results in always getting the latest version of the codebase. Is there a way to clone a repo's specific versions (e.g., tags) using the current setup, or did I miss something?

I am looking forward to your help and thanks again!

Not working

(base) root@26c024020254:/opt/auto-code-rover# cd /opt/SWE-bench
(base) root@26c024020254:/opt/SWE-bench# echo opendevin__ssh-connection-issue-911 > tasks.txt
(base) root@26c024020254:/opt/SWE-bench# conda activate swe-bench
(swe-bench) root@26c024020254:/opt/SWE-bench# python harness/run_setup.py --log_dir logs --testbed testbed --result_dir setup_result --subset_file tasks.txt
2024-04-09 09:11:13,566 - INFO - env_name for all setup entries: []
2024-04-09 09:11:13,566 - INFO - No setup needed.
(swe-bench) root@26c024020254:/opt/SWE-bench# cd /opt/auto-code-rover
(swe-bench) root@26c024020254:/opt/auto-code-rover# conda activate auto-code-rover
(auto-code-rover) root@26c024020254:/opt/auto-code-rover# PYTHONPATH=. python app/main.py --enable-layered --model gpt-4-0125-preview --setup-map /opt/SWE-bench/setup_result/setup_map.json --tasks-map /opt/SWE-bench/setup_result/tasks_map.json --output-dir /mnt/c/Users/pierr/output --task opendevin__ssh-connection-issue-911
Traceback (most recent call last):
File "/opt/auto-code-rover/app/main.py", line 477, in
main()
File "/opt/auto-code-rover/app/main.py", line 399, in main
with open(setup_map_file, "r") as f:
^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/opt/SWE-bench/setup_result/setup_map.json'
(auto-code-rover) root@26c024020254:/opt/auto-code-rover#

swe_agent_rep

Hello,
In your paper, how do you run swe agent in your docker env? I saw the comparison between your docker env and theirs.
Thank you

Why not listed in the swe-bench leaderboard

See: https://github.com/swe-bench/experiments/tree/main

Docker compose file needed

For easy and better selfhosting docker compose is needed for this awesome project ;-))

PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: '...\\info.log'

On Win32 shutil.move() fails because the logger handlers are kept open:

(auto-code-rover) C:\Users\kripp\source\repos\auto-code-rover>python app/main.py --enable-layered --model gpt-4-0125-preview --setup-map ../SWE-bench/setup_result/setup_map.json --tasks-map ../SWE-bench/setup_result/tasks_map.json --output-dir output --task django__django-11133

[2024-04-13 09:39:49] Total number of tasks: 1

[2024-04-13 09:39:49] Total number of processes: 1

[2024-04-13 09:39:49] Task group info: (number of groups: 1)

[2024-04-13 09:39:49]   setup_django__django__3.0: 1 tasks

[2024-04-13 09:39:49] Running in single process mode.

[2024-04-13 09:39:49] ============= Running task django__django-11133 =============
Error running command: ['git', 'apply', 'C:\\Users\\kripp\\source\\repos\\SWE-bench\\testbed\\django__django\\setup_django__django__3.0\\swe_bench_tests.patch'], Command '['git', 'apply', 'C:\\Users\\kripp\\source\\repos\\SWE-bench\\testbed\\django__django\\setup_django__django__3.0\\swe_bench_tests.patch']' returned non-zero exit status 1.

[2024-04-13 09:39:50] Finished all tasks sequentially.

[2024-04-13 09:39:50] Post-processing completed experiment results.

[2024-04-13 09:39:50] SWE-Bench input file created: C:\Users\kripp\source\repos\auto-code-rover\output\predictions_for_swebench.json

(auto-code-rover) C:\Users\kripp\source\repos\auto-code-rover>python app/main.py --enable-layered --model gpt-4-0125-preview --setup-map ../SWE-bench/setup_result/setup_map.json --tasks-map ../SWE-bench/setup_result/tasks_map.json --output-dir output --task django__django-11133

[2024-04-13 09:40:14] Total number of tasks: 1

[2024-04-13 09:40:14] Total number of processes: 1

[2024-04-13 09:40:14] Task group info: (number of groups: 1)

[2024-04-13 09:40:14]   setup_django__django__3.0: 1 tasks

[2024-04-13 09:40:14] Running in single process mode.

[2024-04-13 09:40:14] ============= Running task django__django-11133 =============
Error running command: ['git', 'apply', 'C:\\Users\\kripp\\source\\repos\\SWE-bench\\testbed\\django__django\\setup_django__django__3.0\\swe_bench_tests.patch'], Command '['git', 'apply', 'C:\\Users\\kripp\\source\\repos\\SWE-bench\\testbed\\django__django\\setup_django__django__3.0\\swe_bench_tests.patch']' returned non-zero exit status 1.

[2024-04-13 09:40:15] Finished all tasks sequentially.

[2024-04-13 09:40:15] Post-processing completed experiment results.
Traceback (most recent call last):
  File "C:\Users\kripp\miniconda3\envs\auto-code-rover\Lib\shutil.py", line 886, in move
    os.rename(src, real_dst)
PermissionError: [WinError 5] Access is denied: 'C:\\Users\\kripp\\source\\repos\\auto-code-rover\\output\\django__django-11133_2024-04-13_09-40-14' -> 'C:\\Users\\kripp\\source\\repos\\auto-code-rover\\output\\no_patch\\django__django-11133_2024-04-13_09-40-14'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\kripp\source\repos\auto-code-rover\app\main.py", line 477, in <module>
    main()
  File "C:\Users\kripp\source\repos\auto-code-rover\app\main.py", line 472, in main
    swe_input_file = organize_and_form_input(globals.output_dir)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kripp\source\repos\auto-code-rover\app\post_process.py", line 477, in organize_and_form_input
    organize_experiment_results(expr_dir)
  File "C:\Users\kripp\source\repos\auto-code-rover\app\post_process.py", line 275, in organize_experiment_results
    shutil.move(task_dir, corresponding_dir)
  File "C:\Users\kripp\miniconda3\envs\auto-code-rover\Lib\shutil.py", line 904, in move
    rmtree(src)
  File "C:\Users\kripp\miniconda3\envs\auto-code-rover\Lib\shutil.py", line 820, in rmtree
    return _rmtree_unsafe(path, onexc)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\kripp\miniconda3\envs\auto-code-rover\Lib\shutil.py", line 648, in _rmtree_unsafe
    onexc(os.unlink, fullname, err)
  File "C:\Users\kripp\miniconda3\envs\auto-code-rover\Lib\shutil.py", line 646, in _rmtree_unsafe
    os.unlink(fullname)
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\kripp\\source\\repos\\auto-code-rover\\output\\django__django-11133_2024-04-13_09-40-14\\info.log'

Add a new mode that only outputs fix locations

Sometimes we do not want to run the entire workflow to generate a patch.

Add a new mode to exit after context retrieval stage, and output the visited locations (files, classes, methods) in a single file for users to inspect.

In addition, this mode should allow specifying a max context retrieval round number. Once that number is hit, the workflow should also be aborted. This is to avoid running too many rounds and to control the cost.

nus-apr / auto-code-rover Goto Github PK

auto-code-rover's Introduction

AutoCodeRover: Autonomous Program Improvement

📣 Updates

Discord - server for general discussion, questions, and feedback.

👋 Overview

✨ Highlights

🗎 arXiv Paper

AutoCodeRover: Autonomous Program Improvement [arXiv 2404.05427]

✔️ Example: Django Issue #32347

Enhancement: leveraging test cases

🚀 Setup & Running

Setup API key and environment

[GitHub issue mode] Set up and run on new GitHub issues

[Local issue mode] Set up and run on local repositories and local issues

[SWE-bench mode] Set up and run on SWE-bench tasks

Set up

Run a single task in SWE-bench

Run multiple tasks in SWE-bench

Using a config file

Using a different model

Experiment Replication

✉️ Contacts

Acknowledgements

auto-code-rover's People

Contributors

Stargazers

Watchers

Forkers

auto-code-rover's Issues

Recommend Projects

Recommend Topics

Recommend Org