This repository hosts the replication package for our research paper, titled "CommitBench: A Benchmark for Commit Message Generation". It includes the necessary code, datasets, and tools to replicate our datasets and further explore the dataset we have compiled and analyzed.
- Dataset: Available on Zenodo and Huggingface
- Docker Environment: Available on DockerHub
- Pre-Print: Available on arXiv
- Docker: Ensure Docker is installed on your system.
We already provide a build image on dockerhub which is automatically used in console.sh. If you want to build the docker image run
docker build -t maxscha/commitbench:latest .
Execute bash console.sh
to initiate the container and enter an interactive shell environment with the current working directory mounted into /workspace
.
Run run_pipeline.sh
to generate both the standard and extended versions of the dataset. It will produce the short version and the long version
While we try to make everything as replicable as possible, the fact that repositories are changing or deleted makes it challenging to recreate the exact dataset. We provide the underlying scraped data on request.
Scripts and resources are organized into specific folders, and they are intended to be executed from the root directory of the project.
- analyze: Contains code for analyzing various stages of the dataset.
- enhance: Includes code for improving the dataset, such as adding more information.
- filter: Includes code for filtering the dataset
- exporter: Scripts for converting the dataset into formats compatible with different models.
- importer: Scripts for importing common dataset formats to standardize the base across different datasets.
- prepare: Contains scripts for downloading CodeSearchNet.
To contact the authors reach out to "[email protected]".
The official proceedings citation is forthcoming and will be updated once available.
Coming soon