Comments (5)
I'm down to 2 failing tests now in pydata/xarray 0.12. I probably need to compare to logs from a successful run to fix those effectively.
I'm also testing testbeds for the regular dataset using the check harness predictions now.
from swe-bench.
I'll chime in that @aorwall's docker images and run_evaluation.py
script have worked very well for me. I was able to run ~all the "lite" tests without problems. Whereas working with the original conda testbeds, most tests of the gold patches were failing to build or pass.
Also, the docker testbeds launch and execute very quickly compared to re-building the conda testbeds.
from swe-bench.
~all the lite meaning, not quite all? I've been struggling to get much to run
from swe-bench.
I got all except for pydata__xarray-4094
and pydata__xarray-4493
to run.
from swe-bench.
@PandelisZ sorry, I should have been more clear. I got 298 out of 300 test cases to work out of the box with @aorwall's dockerized SWE-bench-docker tooling. The 2 that fail are known not to work, so that was expected.
I only got a few of the test cases to work with the original/official conda test beds, after a half a day of trying.
from swe-bench.
Related Issues (20)
- Dataset field & set up reliable environment
- swe-bench eval stops running after a point HOT 1
- improve eval performance by caching per-repo/version conda environments
- get_eval_refs doesn't work with a dataset that's been `save_to_disk`'d HOT 2
- environment is lost when running pip install
- Reproducer Docker image
- Why AutoCodeRover not mentioned? HOT 2
- Is it possible to evaluate the train set?
- run_live.py: clone_repo() takes 3 positional arguments but 5 were given
- swe-bench eval stops running after a point HOT 2
- Has anyone successfully ran an eval on patches against early versions of astropy, sympy, scipy etc? I'm really struggling to run things from earlier python versions HOT 2
- Using `uv pip` instead of `pip` for significant speedup HOT 1
- How can one participate in the SWE-bench leaderboard? HOT 3
- What's the best way to browse the SWE-bench dataset? HOT 2
- how to download one task instance from SWE-bench dataset? HOT 1
- what's the difference between environment_setup_commit and base_commit? HOT 2
- `model_name_or_path` is None when running models without adapters, causing an error in `run_evaluation.py`
- Install failed on instances from astropy__astropy HOT 3
- Clarification Needed on Removal of Instances with Error Message Checks in SWE-bench Lite Dataset
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from swe-bench.