Giter Site home page Giter Site logo

putnam's Introduction

PutnamBench

PutnamBench is a benchmark for evaluation of theorem-proving algorithms on competition mathematics problems sourced from the William Lowell Putnam Mathematical Competition years 1962 - 2023. Our formalizations currently support three formal languages : Lean 4 $\land$ Isabelle $\land$ Coq. PutnamBench comprises of 1697 manually-crafted formalizations, aggregated over all languages.

PutnamBench aims to support research in automated mathematical reasoning by providing a multilingual benchmark for evaluating theorem-proving algorithms. It is released under permissive licenses (Apache 2.0 for Lean 4 and Isabelle, MIT for Coq). The informal statements are also available with permission from the MAA.

PutnamBench includes factored solutions for problems which require exhibiting a numerical answer in addition to its proof of correctness. For these problems, one can attempt two tasks: proving the problem with the numerical answer written into the theorem statement, or additionally producing the answer along with the proof.

We are hosting a leaderboard and will readily receive evaluation results which are accompanied by a preprint or publication. Do not include proofs as confirmation in any public setting. Please reach out privately at [email protected] with any requests for additions to the leaderboard.

We strongly encourage community feedback! Please let us know if you have any comments for improving PutnamBench. If you notice any mistakes, please raise an issue on the repository and we will address it. We kindly ask that you do not write formal proofs for any of the problems in an effort to reduce contamination.

Statistics

Language Count
Lean 4 640
Isabelle 640
Coq 417

We also report the number of problems in a certain category. Note that some problems fall under multiple categories. While the categories are intended to capture general features of the problems, we also note that there is a high variance of problems inside an individual category.

Category Total Quantity
Algebra 253
Analysis 226
Number Theory 107
Geometry 68
Linear Algebra 51
Abstract Algebra 28
Combinatorics 26
Probability 9
Set Theory 8

Versioning

  • Version: v0
  • In preliminary release to allow for initial community feedback. We seek to release an official first version following several weeks of discussion with the community.

Citation

The associated paper for PutnamBench is available at this link. Please consider including the following citation if you find PutnamBench useful.

@misc{tsoukalas2024putnambenchevaluatingneuraltheoremprovers,
      title={PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition}, 
      author={George Tsoukalas and Jasper Lee and John Jennings and Jimmy Xin and Michelle Ding and Michael Jennings and Amitayush Thakur and Swarat Chaudhuri},
      year={2024},
      eprint={2407.11214},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.11214}, 
}

putnam's People

Contributors

georgetsoukalas avatar michaelhenryjennings avatar jxin31415 avatar michelled01 avatar johnedwardjennings avatar leejasper851 avatar amit9oct avatar eric-wieser avatar yashj1579 avatar

Stargazers

Vincent V avatar Bartosz Piotrowski avatar  avatar Pavel Chen avatar Nicolas Martel avatar Xiaohu Zhu avatar fzyzcjy avatar loopchen avatar William avatar Junyan Xu avatar ahuoguo avatar Bastien Le Chenadec avatar  avatar iJava avatar  avatar Monish Singhal avatar Jocelyn Chen avatar  avatar westtide avatar 爱可可-爱生活 avatar Zhaoyu Li (李照宇) avatar xuefengli avatar Jason Zhang avatar Zheng Yuan avatar Zengzhi Wang avatar Kaiyu Yang avatar

Watchers

Junyan Xu avatar  avatar Swarat Chaudhuri avatar  avatar

putnam's Issues

Evaluation of Formal Proofs

Many of the theorem statements does not have formal proofs(they have the sorry tags). If I have a generated proof for a theorem how should I evaluate it, even if we have a predicted proof and ground truth proof, how do I go for semantic evaluation between these two.
Someone please help me with this.

Putnam 2009 B1 issues

The Coq formalization of 2009 B1 (which is highlighted in the paper) has an error:

Theorem putnam_2009_b1
(Factl := fix factl (l : list nat) : list nat :=
match l with
| nil => nil
| h :: t => fact h :: t
end)
: forall (q: Q), q > 0 -> exists (n d: list nat), (forall x, (In x n \/ In x d)-> prime (Z.of_nat x)) /\
inject_Z (Z.of_nat (fold_left Nat.mul (Factl n) 1%nat)) / inject_Z (Z.of_nat (fold_left Nat.mul (Factl d) 1%nat)) = q.

The helper function Factl never recurses. More generally, this formalization looks rather non-idiomatic to me (which also makes me worry about the rest of them). Factl could be written using map, In should probably be replaced by Forall, and you might consider using some coercions. If you want to keep to the standard library, I'd go for something closer to this:

Require Import List QArith Znumtheory Reals Arith.
Open Scope Q.

Local Coercion inject_Z : Z >-> Q.
Local Coercion Z_of_nat : nat >-> Z.

Theorem putnam_2009_b1 :
  let fact_prod ls := fold_left Nat.mul (map fact ls) 1%nat in
  forall (q: Q), q > 0 -> exists (n d: list nat),
  @Forall nat prime (n ++ d) /\ fact_prod n / fact_prod d = q.
Proof.
Admitted.

However, generally speaking, for these kinds of problems you are almost certainly better off using Mathcomp instead of the standard library. It has a much richer development around primes, such as a prime decomposition function that will very likely be needed to prove this theorem. The following is a formalization in Mathcomp. Hopefully it is reasonably idiomatic, but I'm not a Mathcomp expert. I strongly recommend that you avoid the stdlib for these kinds of problems.

From mathcomp Require Import ssrbool seq ssrnat prime rat ssralg ssrnum ssrint.
Local Open Scope ring_scope.

Theorem putnam_2009_b1 :
  let fact_prod ls := (\prod_(i <- ls) i`!)%:Q in
  forall q : rat, q > 0 -> exists n d,
  all prime (n ++ d) /\ fact_prod n / fact_prod d = q.
Proof.
Admitted.

Finally, the Lean version seems to be implemented using functions on finite numbers instead of lists. Why this difference?

theorem putnam_2009_b1
(isquotprodprimefact : ℚ → Prop :=
fun q => (∃ (k m : ℕ) (a : Fin k → ℕ) (b : Fin m → ℕ),
(∀ i : Fin k, Nat.Prime (a i)) ∧ (∀ j : Fin m, Nat.Prime (b j))
∧ (q = (∏ i : Fin k, Nat.factorial (a i))/(∏ j : Fin m, Nat.factorial (b j)))
))
: ∀ q : ℚ, q > 0 → isquotprodprimefact q :=

Isabelle uses yet another approach using functions from nat to nat, with a separately encoded maximum indices k and m:

theorem putnam_2009_b1:
fixes isquotprodprimefact :: "rat \<Rightarrow> bool"
defines "isquotprodprimefact \<equiv> (\<lambda>q::rat. (\<exists>(k::nat)(m::nat)(a::nat\<Rightarrow>nat)(b::nat\<Rightarrow>nat).
(\<forall>i::nat\<in>{0..(k-1)}. prime (a i)) \<and> (\<forall>j::nat\<in>{0..(m-1)}. prime (b j))
\<and> q = (\<Prod>i::nat=0..(k-1). fact (a i)) / (\<Prod>j::nat=0..(m-1). fact (b j))))"
shows "\<forall>q::rat. (q > 0 \<longrightarrow> isquotprodprimefact q)"

Some lean files do not compile

From the CI logs:

error: > LEAN_PATH=./.lake/packages/std/.lake/build/lib:./.lake/packages/Qq/.lake/build/lib:./.lake/packages/aesop/.lake/build/lib:./.lake/packages/proofwidgets/.lake/build/lib:./.lake/packages/Cli/.lake/build/lib:./.lake/packages/importGraph/.lake/build/lib:./.lake/packages/mathlib/.lake/build/lib:./.lake/build/lib LD_LIBRARY_PATH=./.lake/build/lib /home/runner/.elan/toolchains/leanprover--lean4---v4.7.0-rc2/bin/lean ./././src/putnam_1972_b5.lean -R ././. -o ./.lake/build/lib/src/putnam_1972_b5.olean -i ./.lake/build/lib/src/putnam_1972_b5.ilean -c ./.lake/build/ir/src/putnam_1972_b5.c
error: stdout:
./././src/putnam_1972_b5.lean:10:2: error: failed to synthesize
  T2Space (EuclideanSpace ℝ (Fin 3))
(deterministic) timeout at 'typeclass', maximum number of heartbeats (20000) has been reached (use 'set_option synthInstance.maxHeartbeats <num>' to set the limit)
./././src/putnam_1972_b5.lean:10:23: error: failed to synthesize
  T2Space (EuclideanSpace ℝ (Fin 3))
(deterministic) timeout at 'typeclass', maximum number of heartbeats (20000) has been reached (use 'set_option synthInstance.maxHeartbeats <num>' to set the limit)
./././src/putnam_1972_b5.lean:10:44: error: failed to synthesize
  T2Space (EuclideanSpace ℝ (Fin 3))
(deterministic) timeout at 'typeclass', maximum number of heartbeats (20000) has been reached (use 'set_option synthInstance.maxHeartbeats <num>' to set the limit)
./././src/putnam_1972_b5.lean:10:65: error: failed to synthesize
  T2Space (EuclideanSpace ℝ (Fin 3))
(deterministic) timeout at 'typeclass', maximum number of heartbeats (20000) has been reached (use 'set_option synthInstance.maxHeartbeats <num>' to set the limit)
error: external command `/home/runner/.elan/toolchains/leanprover--lean4---v4.7.0-rc2/bin/lean` exited with code 1
error: > LEAN_PATH=./.lake/packages/std/.lake/build/lib:./.lake/packages/Qq/.lake/build/lib:./.lake/packages/aesop/.lake/build/lib:./.lake/packages/proofwidgets/.lake/build/lib:./.lake/packages/Cli/.lake/build/lib:./.lake/packages/importGraph/.lake/build/lib:./.lake/packages/mathlib/.lake/build/lib:./.lake/build/lib LD_LIBRARY_PATH=./.lake/build/lib /home/runner/.elan/toolchains/leanprover--lean4---v4.7.0-rc2/bin/lean ./././src/putnam_1967_b1.lean -R ././. -o ./.lake/build/lib/src/putnam_1967_b1.olean -i ./.lake/build/lib/src/putnam_1967_b1.ilean -c ./.lake/build/ir/src/putnam_1967_b1.c
error: stdout:
./././src/putnam_1967_b1.lean:13:26: error: unexpected token ')'; expected ','
error: external command `/home/runner/.elan/toolchains/leanprover--lean4---v4.7.0-rc2/bin/lean` exited with code 1

Putnam 2017 B2 issues

Another one from the paper:

Theorem putnam_2017_b2
(mina : nat)
(posMin : mina > 0)
(A : nat -> nat -> nat := fun a k => Z.to_nat (floor (sum_n (fun i => Raxioms.INR (a + (i + 1))) k)))
(p : nat -> nat -> Prop := fun N k => exists (a: nat), a > 0 /\ A a k = N)
(q : nat -> Prop := fun N => p N 2017 /\ forall (k: nat), k > 1 -> k <> 2017 -> ~ p N k)
(hmina : q (A mina 2017))
(hminalb : (forall (a: nat), a > 0 /\ q (A a 2017) -> mina <= a))
: mina = putnam_2017_b2_solution.

  • Why does a problem related only to natural numbers involve real numbers? Seems convoluted and makes proving it more difficult.
  • The way the theorem is stated seems very awkward. It says: "If you give me a number for which you can prove the requirement from the problem, then that number must be 16." I would expect more something much more straightforward like "the number 16 has the required properties":
Require Import Nat ZArith Reals.
Definition putnam_2017_b2_solution := 16.
Theorem putnam_2017_b2 :
  let seq a k := sum_nat a (a + k - 1) in
  let valid a := a > 0 /\ forall b k, b > 0 -> k > 1 -> seq a 2017 = seq b k -> k = 2017 in
  valid putnam_2017_b2_solution /\ forall a, valid a -> putnam_2017_b2_solution <= a.
Proof. Admitted.

Not only is this simpler, but the difference really matters. You can prove the original statement assuming this new statement, but not the other way around...

I'm seeing these kinds of patterns happen in other problems as well. My guess is that this happens because of the desire to inline all 'definitions' into the theorem. Why do this? Is there a problem with separate definitions?

unsound usage of optParams in Lean problems

putnam_2008_b5 and many other problems specify predicates as default values for optional parameters. It seems that this was done to avoid the need to add extra top-level definitions. Unfortunately, it leads to false theorems, because within the theorem body, we are not allowed to assume that the optional parameter always has its default value.

To show that putnam_2008_b5 is false as currently stated, here is an example of how to use it to prove False:

import Mathlib
open BigOperators

open Filter Topology Set Nat

abbrev putnam_2008_b5_solution : Set (ℝ → ℝ) := {fun (x : ℝ) => x + n | n : ℤ} ∪ {fun (x : ℝ) => -x + n | n : ℤ}

theorem putnam_2008_b5
(fqsat : (ℝ → ℝ) → ℚ → Prop :=
  fun f q => ContDiff ℝ 1 f ∧ (∃ p : ℚ, p = f q ∧ p.den = q.den))
(fsat : (ℝ → ℝ) → Prop := fun f => ∀ q : ℚ, fqsat f q)
: ∀ f : (ℝ → ℝ), fsat f ↔ f ∈ putnam_2008_b5_solution := by
  sorry

theorem explode : False := by
  have h1 :=
    putnam_2008_b5 (fun _ _ ↦ False) (fun _ ↦ False) (fun x ↦ x + 1)
  rw [false_iff] at h1
  contrapose! h1
  simp
  left
  use 1
  norm_cast

Sum elaboration issues on v4.7.0

./././src/putnam_2004_b1.lean:12:37: error: elaboration function for 'BigOperators.bigsum' has not been implemented
  ∑ j ∈ Finset.range (i + 1), (P.coeff (n - j) * r ^ (i + 1 - j))
./././src/putnam_2005_b2.lean:10:73: error: elaboration function for 'BigOperators.bigsum' has not been implemented
  ∑ i ∈ Finset.range n, k i
./././src/putnam_1995_a4.lean:10:16: error: elaboration function for 'BigOperators.bigsum' has not been implemented
  ∑ i ∈ Finset.range n, necklace i
./././src/putnam_1995_a4.lean:11:48: error: elaboration function for 'BigOperators.bigsum' has not been implemented
  ∑ i ≤ k, necklace ((cut + i) % n)

Putname 1980 A5 issues

Again a problem from the paper:

Theorem putnam_1980_a5
(n : nat)
(npos : gt n 0)
(coeff : nat -> R)
(hcoeff : coeff n <> 0)
(p : R -> R := fun x => sum_n (fun i => coeff i * x ^ i) (S n))
(h1 : nat -> Prop := fun a => RInt (fun x => p x * sin x) 0 (INR a) = 0)
(h2 : nat -> Prop := fun a => RInt (fun x => p x * cos x) 0 (INR a) = 0)
: exists (m: nat), forall (b: nat), h1 b /\ h2 b -> lt b m.

  • The integrals are parameterized on a nat but should be on a real.
  • The theorem states that the largest parameter for which the integrals are zero is bounded. It should instead be bounding the number of parameters where the integral is zero.
  • Is the successor function in p really needed? Not sure, but this might be a off-by-one error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.