Giter Site home page Giter Site logo

sun-wendy / dafnybench Goto Github PK

View Code? Open in Web Editor NEW
10.0 1.0 1.0 5.76 MB

DafnyBench: A Benchmark for Formal Software Verification

License: Apache License 2.0

Dafny 90.48% Python 0.84% Shell 0.05% Jupyter Notebook 8.64%
benchmark dafny formal-verification language-model program-verification

dafnybench's Introduction

DafnyBench: A Benchmark for Formal Software Verification

Dataset & code for our paper DafnyBench: A Benchmark for Formal Software Verification

Dataset is also available for download on ๐Ÿค— Hugging Face.

Overview ๐Ÿ“Š

DafnyBench is the largest benchmark of its kind for training and evaluating machine learning systems for formal software verification, with over 750 Dafny programs.

Usage ๐Ÿ’ป

  • Dataset: The dataset for DafnyBench (with 782 programs) could be found in the DafnyBench directory, which contains the ground_truth set & the hints_removedset (with compiler hints, i.e. annoataions, removed).
  • Evaluation: Evaluate LLMs on DafnyBench by asking models to fill in missing hints in a test file from the hints_removed set and checking if the reconstructed program could be verified by Dafny. Please refer to the eval directory.



Set Up for Evaluation ๐Ÿ”ง

  1. Install Dafny on your machine by following this tutorial
  2. Clone & cd into this repository
  3. Set up environment by running the following lines:
python -m venv stats
source stats/bin/activate
pip install -r requirements.txt
cd eval
  1. Set up environment variable for the root directory:
export DAFNYBENCH_ROOT=
  1. Set up environment variable for path to Dafny executable on your machine (for example, /opt/homebrew/bin/Dafny):
export DAFNY_PATH=
  1. If you're evaluating an LLM through API access, set up API key. For example:
export OPENAI_API_KEY=
  1. You can choose to evaluate an LLM on a single test program, such as:
python fill_hints.py --model "gpt-4o" --test_file "Clover_abs_no_hints.dfy" --feedback_turn 3 --dafny_path "$DAFNY_PATH"

or evaluate on the entire dataset:

export model_to_eval='gpt-4o'
./run_eval.sh

Contents ๐Ÿ“

  • DafnyBench
    • A collection of 782 Dafny programs. Each program has a ground_truth version that is fully verified with Dafny & a hints_removed version that has hints (i.e. annotations) removed
  • eval
    • Contains scripts to evaluate LLMs on DafnyBench
  • results
    • results_summary - Dataframes that summarize LLMs' success on every test program
    • reconstructed_files - LLM outputs with hints filled back in
    • analysis - Contains a notebook for analyzing the results



Citation ๐Ÿ“Ž

@article{loughridge2024dafnybench,
         title={DafnyBench: A Benchmark for Formal Software Verification}, 
         author={Chloe Loughridge and Qinyi Sun and Seth Ahrenbach and Federico Cassano and Chuyue Sun and Ying Sheng and Anish Mudide and Md Rakib Hossain Misu and Nada Amin and Max Tegmark},
         year={2024},
         journal={arXiv preprint arXiv:2406.08467}
}

dafnybench's People

Contributors

sun-wendy avatar

Stargazers

Stanislav Alekseev avatar westtide avatar Changjie Wang avatar Ekaterina Verbitskaia avatar  avatar Ying Sheng avatar Chloe avatar Zhuobin Huang avatar  avatar LiviaSun avatar

Watchers

 avatar

Forkers

mrhmisu

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.