trishullab / putnam Goto Github PK

An evaluation benchmark for undergraduate competition math in Lean4, Isabelle, Coq, and natural language.

Lean 31.45% Python 1.97% Isabelle 44.64% Coq 21.88% Shell 0.06%

putnam's Introduction

PutnamBench

PutnamBench is a benchmark for evaluation of theorem-proving algorithms on competition mathematics problems sourced from the William Lowell Putnam Mathematical Competition years 1962 - 2023. Our formalizations currently support three formal languages : Lean 4 $\land$ Isabelle $\land$ Coq. PutnamBench comprises of 1697 manually-crafted formalizations, aggregated over all languages.

PutnamBench aims to support research in automated mathematical reasoning by providing a multilingual benchmark for evaluating theorem-proving algorithms. It is released under permissive licenses (Apache 2.0 for Lean 4 and Isabelle, MIT for Coq). The informal statements are also available with permission from the MAA.

PutnamBench includes factored solutions for problems which require exhibiting a numerical answer in addition to its proof of correctness. For these problems, one can attempt two tasks: proving the problem with the numerical answer written into the theorem statement, or additionally producing the answer along with the proof.

We are hosting a leaderboard and will readily receive evaluation results which are accompanied by a preprint or publication. Do not include proofs as confirmation in any public setting. Please reach out privately at [email protected] with any requests for additions to the leaderboard.

We strongly encourage community feedback! Please let us know if you have any comments for improving PutnamBench. If you notice any mistakes, please raise an issue on the repository and we will address it. We kindly ask that you do not write formal proofs for any of the problems in an effort to reduce contamination.

Statistics

Language	Count
Lean 4	640
Isabelle	640
Coq	417

We also report the number of problems in a certain category. Note that some problems fall under multiple categories. While the categories are intended to capture general features of the problems, we also note that there is a high variance of problems inside an individual category.

Category	Total Quantity
Algebra	253
Analysis	226
Number Theory	107
Geometry	68
Linear Algebra	51
Abstract Algebra	28
Combinatorics	26
Probability	9
Set Theory	8

Versioning

Version: v0
In preliminary release to allow for initial community feedback. We seek to release an official first version following several weeks of discussion with the community.

Citation

The associated paper for PutnamBench is available at this link. Please consider including the following citation if you find PutnamBench useful.

@misc{tsoukalas2024putnambenchevaluatingneuraltheoremprovers,
      title={PutnamBench: Evaluating Neural Theorem-Provers on the Putnam Mathematical Competition}, 
      author={George Tsoukalas and Jasper Lee and John Jennings and Jimmy Xin and Michelle Ding and Michael Jennings and Amitayush Thakur and Swarat Chaudhuri},
      year={2024},
      eprint={2407.11214},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2407.11214}, 
}

putnam's People

Contributors

Stargazers

Watchers

Forkers

adarsh321123 eric-wieser pitmonticone

putnam's Issues

Euclidean.dist should probably be avoided

Some fairly low priority feedback; this formalization

PutnamBench/lean4/src/putnam_1962_a3.lean

Lines 11 to 13 in 0d6e021

    
           (hA' : A' ∈ segment ℝ B C ∧ Euclidean.dist C A' / Euclidean.dist A' B = k) 
        
           (hB' : B' ∈ segment ℝ C A ∧ Euclidean.dist A B' / Euclidean.dist B' C = k) 
        
           (hC' : C' ∈ segment ℝ A B ∧ Euclidean.dist B C' / Euclidean.dist C' A = k)

should probably not be using Euclidean.dist, which doesn't do what you'd expect.

Evaluation of Formal Proofs

Many of the theorem statements does not have formal proofs(they have the sorry tags). If I have a generated proof for a theorem how should I evaluate it, even if we have a predicted proof and ground truth proof, how do I go for semantic evaluation between these two.
Someone please help me with this.

InternLM-Math solves 4 problems with answers in LEAN 4.

How do I update our results to your benchmark?

Putnam 2009 B1 issues

The Coq formalization of 2009 B1 (which is highlighted in the paper) has an error:

PutnamBench/coq/src/putnam_2009_b1.v

Lines 3 to 10 in adb10d6

    
           Theorem putnam_2009_b1 
        
               (Factl := fix factl (l : list nat) : list nat := 
        
                   match l with 
        
                   | nil => nil 
        
                   | h :: t => fact h :: t 
        
               end) 
        
               : forall (q: Q), q > 0 -> exists (n d: list nat), (forall x, (In x n \/ In x d)-> prime (Z.of_nat x)) /\  
        
               inject_Z (Z.of_nat (fold_left Nat.mul (Factl n) 1%nat)) / inject_Z (Z.of_nat (fold_left Nat.mul (Factl d) 1%nat)) = q.

The helper function Factl never recurses. More generally, this formalization looks rather non-idiomatic to me (which also makes me worry about the rest of them). Factl could be written using map, In should probably be replaced by Forall, and you might consider using some coercions. If you want to keep to the standard library, I'd go for something closer to this:

Require Import List QArith Znumtheory Reals Arith.
Open Scope Q.

Local Coercion inject_Z : Z >-> Q.
Local Coercion Z_of_nat : nat >-> Z.

Theorem putnam_2009_b1 :
  let fact_prod ls := fold_left Nat.mul (map fact ls) 1%nat in
  forall (q: Q), q > 0 -> exists (n d: list nat),
  @Forall nat prime (n ++ d) /\ fact_prod n / fact_prod d = q.
Proof.
Admitted.

However, generally speaking, for these kinds of problems you are almost certainly better off using Mathcomp instead of the standard library. It has a much richer development around primes, such as a prime decomposition function that will very likely be needed to prove this theorem. The following is a formalization in Mathcomp. Hopefully it is reasonably idiomatic, but I'm not a Mathcomp expert. I strongly recommend that you avoid the stdlib for these kinds of problems.

From mathcomp Require Import ssrbool seq ssrnat prime rat ssralg ssrnum ssrint.
Local Open Scope ring_scope.

Theorem putnam_2009_b1 :
  let fact_prod ls := (\prod_(i <- ls) i`!)%:Q in
  forall q : rat, q > 0 -> exists n d,
  all prime (n ++ d) /\ fact_prod n / fact_prod d = q.
Proof.
Admitted.

Finally, the Lean version seems to be implemented using functions on finite numbers instead of lists. Why this difference?

PutnamBench/lean4/src/putnam_2009_b1.lean

Lines 6 to 12 in adb10d6

    
           theorem putnam_2009_b1 
        
           (isquotprodprimefact : ℚ → Prop := 
        
           fun q => (∃ (k m : ℕ) (a : Fin k → ℕ) (b : Fin m → ℕ), 
        
           (∀ i : Fin k, Nat.Prime (a i)) ∧ (∀ j : Fin m, Nat.Prime (b j)) 
        
           ∧ (q = (∏ i : Fin k, Nat.factorial (a i))/(∏ j : Fin m, Nat.factorial (b j))) 
        
           )) 
        
           : ∀ q : ℚ, q > 0 → isquotprodprimefact q :=

Isabelle uses yet another approach using functions from nat to nat, with a separately encoded maximum indices k and m:

PutnamBench/isabelle/putnam_2009_b1.thy

Lines 6 to 11 in adb10d6

    
           theorem putnam_2009_b1: 
        
             fixes isquotprodprimefact :: "rat \<Rightarrow> bool" 
        
             defines "isquotprodprimefact \<equiv> (\<lambda>q::rat. (\<exists>(k::nat)(m::nat)(a::nat\<Rightarrow>nat)(b::nat\<Rightarrow>nat). 
        
               (\<forall>i::nat\<in>{0..(k-1)}. prime (a i)) \<and> (\<forall>j::nat\<in>{0..(m-1)}. prime (b j)) 
        
               \<and> q = (\<Prod>i::nat=0..(k-1). fact (a i)) / (\<Prod>j::nat=0..(m-1). fact (b j))))" 
        
             shows "\<forall>q::rat. (q > 0 \<longrightarrow> isquotprodprimefact q)"

Some lean files do not compile

From the CI logs:

error: > LEAN_PATH=./.lake/packages/std/.lake/build/lib:./.lake/packages/Qq/.lake/build/lib:./.lake/packages/aesop/.lake/build/lib:./.lake/packages/proofwidgets/.lake/build/lib:./.lake/packages/Cli/.lake/build/lib:./.lake/packages/importGraph/.lake/build/lib:./.lake/packages/mathlib/.lake/build/lib:./.lake/build/lib LD_LIBRARY_PATH=./.lake/build/lib /home/runner/.elan/toolchains/leanprover--lean4---v4.7.0-rc2/bin/lean ./././src/putnam_1972_b5.lean -R ././. -o ./.lake/build/lib/src/putnam_1972_b5.olean -i ./.lake/build/lib/src/putnam_1972_b5.ilean -c ./.lake/build/ir/src/putnam_1972_b5.c
error: stdout:
./././src/putnam_1972_b5.lean:10:2: error: failed to synthesize
  T2Space (EuclideanSpace ℝ (Fin 3))
(deterministic) timeout at 'typeclass', maximum number of heartbeats (20000) has been reached (use 'set_option synthInstance.maxHeartbeats <num>' to set the limit)
./././src/putnam_1972_b5.lean:10:23: error: failed to synthesize
  T2Space (EuclideanSpace ℝ (Fin 3))
(deterministic) timeout at 'typeclass', maximum number of heartbeats (20000) has been reached (use 'set_option synthInstance.maxHeartbeats <num>' to set the limit)
./././src/putnam_1972_b5.lean:10:44: error: failed to synthesize
  T2Space (EuclideanSpace ℝ (Fin 3))
(deterministic) timeout at 'typeclass', maximum number of heartbeats (20000) has been reached (use 'set_option synthInstance.maxHeartbeats <num>' to set the limit)
./././src/putnam_1972_b5.lean:10:65: error: failed to synthesize
  T2Space (EuclideanSpace ℝ (Fin 3))
(deterministic) timeout at 'typeclass', maximum number of heartbeats (20000) has been reached (use 'set_option synthInstance.maxHeartbeats <num>' to set the limit)
error: external command `/home/runner/.elan/toolchains/leanprover--lean4---v4.7.0-rc2/bin/lean` exited with code 1

error: > LEAN_PATH=./.lake/packages/std/.lake/build/lib:./.lake/packages/Qq/.lake/build/lib:./.lake/packages/aesop/.lake/build/lib:./.lake/packages/proofwidgets/.lake/build/lib:./.lake/packages/Cli/.lake/build/lib:./.lake/packages/importGraph/.lake/build/lib:./.lake/packages/mathlib/.lake/build/lib:./.lake/build/lib LD_LIBRARY_PATH=./.lake/build/lib /home/runner/.elan/toolchains/leanprover--lean4---v4.7.0-rc2/bin/lean ./././src/putnam_1967_b1.lean -R ././. -o ./.lake/build/lib/src/putnam_1967_b1.olean -i ./.lake/build/lib/src/putnam_1967_b1.ilean -c ./.lake/build/ir/src/putnam_1967_b1.c
error: stdout:
./././src/putnam_1967_b1.lean:13:26: error: unexpected token ')'; expected ','
error: external command `/home/runner/.elan/toolchains/leanprover--lean4---v4.7.0-rc2/bin/lean` exited with code 1

Putnam 2017 B2 issues

Another one from the paper:

PutnamBench/coq/src/putnam_2017_b2.v

Lines 3 to 11 in 6ddcaac

    
           Theorem putnam_2017_b2 
        
               (mina : nat) 
        
               (posMin : mina > 0) 
        
               (A : nat -> nat -> nat := fun a k => Z.to_nat (floor (sum_n (fun i => Raxioms.INR (a + (i + 1))) k))) 
        
               (p : nat -> nat -> Prop := fun N k => exists (a: nat), a > 0 /\ A a k = N)  
        
               (q : nat -> Prop := fun N => p N 2017 /\ forall (k: nat), k > 1 -> k <> 2017 -> ~ p N k) 
        
               (hmina : q (A mina 2017))  
        
               (hminalb : (forall (a: nat), a > 0 /\ q (A a 2017) -> mina <= a)) 
        
               : mina = putnam_2017_b2_solution.

Why does a problem related only to natural numbers involve real numbers? Seems convoluted and makes proving it more difficult.
The way the theorem is stated seems very awkward. It says: "If you give me a number for which you can prove the requirement from the problem, then that number must be 16." I would expect more something much more straightforward like "the number 16 has the required properties":

Require Import Nat ZArith Reals.
Definition putnam_2017_b2_solution := 16.
Theorem putnam_2017_b2 :
  let seq a k := sum_nat a (a + k - 1) in
  let valid a := a > 0 /\ forall b k, b > 0 -> k > 1 -> seq a 2017 = seq b k -> k = 2017 in
  valid putnam_2017_b2_solution /\ forall a, valid a -> putnam_2017_b2_solution <= a.
Proof. Admitted.

Not only is this simpler, but the difference really matters. You can prove the original statement assuming this new statement, but not the other way around...

I'm seeing these kinds of patterns happen in other problems as well. My guess is that this happens because of the desire to inline all 'definitions' into the theorem. Why do this? Is there a problem with separate definitions?

unsound usage of optParams in Lean problems

putnam_2008_b5 and many other problems specify predicates as default values for optional parameters. It seems that this was done to avoid the need to add extra top-level definitions. Unfortunately, it leads to false theorems, because within the theorem body, we are not allowed to assume that the optional parameter always has its default value.

To show that putnam_2008_b5 is false as currently stated, here is an example of how to use it to prove False:

import Mathlib
open BigOperators

open Filter Topology Set Nat

abbrev putnam_2008_b5_solution : Set (ℝ → ℝ) := {fun (x : ℝ) => x + n | n : ℤ} ∪ {fun (x : ℝ) => -x + n | n : ℤ}

theorem putnam_2008_b5
(fqsat : (ℝ → ℝ) → ℚ → Prop :=
  fun f q => ContDiff ℝ 1 f ∧ (∃ p : ℚ, p = f q ∧ p.den = q.den))
(fsat : (ℝ → ℝ) → Prop := fun f => ∀ q : ℚ, fqsat f q)
: ∀ f : (ℝ → ℝ), fsat f ↔ f ∈ putnam_2008_b5_solution := by
  sorry

theorem explode : False := by
  have h1 :=
    putnam_2008_b5 (fun _ _ ↦ False) (fun _ ↦ False) (fun x ↦ x + 1)
  rw [false_iff] at h1
  contrapose! h1
  simp
  left
  use 1
  norm_cast

Sum elaboration issues on v4.7.0

./././src/putnam_2004_b1.lean:12:37: error: elaboration function for 'BigOperators.bigsum' has not been implemented
  ∑ j ∈ Finset.range (i + 1), (P.coeff (n - j) * r ^ (i + 1 - j))

./././src/putnam_2005_b2.lean:10:73: error: elaboration function for 'BigOperators.bigsum' has not been implemented
  ∑ i ∈ Finset.range n, k i

./././src/putnam_1995_a4.lean:10:16: error: elaboration function for 'BigOperators.bigsum' has not been implemented
  ∑ i ∈ Finset.range n, necklace i
./././src/putnam_1995_a4.lean:11:48: error: elaboration function for 'BigOperators.bigsum' has not been implemented
  ∑ i ≤ k, necklace ((cut + i) % n)

MeasureTheory.volume is incorrectly used on 2d sets in 3D

Unfortunately, in this line

PutnamBench/lean4/src/putnam_2013_a5.lean

Line 15 in 84beb3b

    
           (harea3 : ∀ a b c : Fin 3 → ℝ, area3 a b c = (MeasureTheory.volume (convexHull ℝ {a, b, c})).toReal)

the volume of the provided set is zero, as volume in 3d does not refer to area.
See Zulip.

Putname 1980 A5 issues

Again a problem from the paper:

PutnamBench/coq/src/putnam_1980_a5.v

Lines 2 to 10 in 6ddcaac

    
           Theorem putnam_1980_a5 
        
               (n : nat) 
        
               (npos : gt n 0) 
        
               (coeff : nat -> R) 
        
               (hcoeff : coeff n <> 0) 
        
               (p : R -> R := fun x => sum_n (fun i => coeff i * x ^ i) (S n)) 
        
               (h1 : nat -> Prop := fun a => RInt (fun x => p x * sin x) 0 (INR a) = 0) 
        
               (h2 : nat -> Prop := fun a => RInt (fun x => p x * cos x) 0 (INR a) = 0) 
        
               : exists (m: nat), forall (b: nat), h1 b /\ h2 b -> lt b m.

The integrals are parameterized on a nat but should be on a real.
The theorem states that the largest parameter for which the integrals are zero is bounded. It should instead be bounding the number of parameters where the integral is zero.
Is the successor function in p really needed? Not sure, but this might be a off-by-one error.

	(hA' : A' ∈ segment ℝ B C ∧ Euclidean.dist C A' / Euclidean.dist A' B = k)
	(hB' : B' ∈ segment ℝ C A ∧ Euclidean.dist A B' / Euclidean.dist B' C = k)
	(hC' : C' ∈ segment ℝ A B ∧ Euclidean.dist B C' / Euclidean.dist C' A = k)

	Theorem putnam_2009_b1
	(Factl := fix factl (l : list nat) : list nat :=
	match l with
	\| nil => nil
	\| h :: t => fact h :: t
	end)
	: forall (q: Q), q > 0 -> exists (n d: list nat), (forall x, (In x n \/ In x d)-> prime (Z.of_nat x)) /\
	inject_Z (Z.of_nat (fold_left Nat.mul (Factl n) 1%nat)) / inject_Z (Z.of_nat (fold_left Nat.mul (Factl d) 1%nat)) = q.

	theorem putnam_2009_b1
	(isquotprodprimefact : ℚ → Prop :=
	fun q => (∃ (k m : ℕ) (a : Fin k → ℕ) (b : Fin m → ℕ),
	(∀ i : Fin k, Nat.Prime (a i)) ∧ (∀ j : Fin m, Nat.Prime (b j))
	∧ (q = (∏ i : Fin k, Nat.factorial (a i))/(∏ j : Fin m, Nat.factorial (b j)))
	))
	: ∀ q : ℚ, q > 0 → isquotprodprimefact q :=

	theorem putnam_2009_b1:
	fixes isquotprodprimefact :: "rat \<Rightarrow> bool"
	defines "isquotprodprimefact \<equiv> (\<lambda>q::rat. (\<exists>(k::nat)(m::nat)(a::nat\<Rightarrow>nat)(b::nat\<Rightarrow>nat).
	(\<forall>i::nat\<in>{0..(k-1)}. prime (a i)) \<and> (\<forall>j::nat\<in>{0..(m-1)}. prime (b j))
	\<and> q = (\<Prod>i::nat=0..(k-1). fact (a i)) / (\<Prod>j::nat=0..(m-1). fact (b j))))"
	shows "\<forall>q::rat. (q > 0 \<longrightarrow> isquotprodprimefact q)"

	Theorem putnam_2017_b2
	(mina : nat)
	(posMin : mina > 0)
	(A : nat -> nat -> nat := fun a k => Z.to_nat (floor (sum_n (fun i => Raxioms.INR (a + (i + 1))) k)))
	(p : nat -> nat -> Prop := fun N k => exists (a: nat), a > 0 /\ A a k = N)
	(q : nat -> Prop := fun N => p N 2017 /\ forall (k: nat), k > 1 -> k <> 2017 -> ~ p N k)
	(hmina : q (A mina 2017))
	(hminalb : (forall (a: nat), a > 0 /\ q (A a 2017) -> mina <= a))
	: mina = putnam_2017_b2_solution.

	Theorem putnam_1980_a5
	(n : nat)
	(npos : gt n 0)
	(coeff : nat -> R)
	(hcoeff : coeff n <> 0)
	(p : R -> R := fun x => sum_n (fun i => coeff i * x ^ i) (S n))
	(h1 : nat -> Prop := fun a => RInt (fun x => p x * sin x) 0 (INR a) = 0)
	(h2 : nat -> Prop := fun a => RInt (fun x => p x * cos x) 0 (INR a) = 0)
	: exists (m: nat), forall (b: nat), h1 b /\ h2 b -> lt b m.