Abstract
The CADE ATP System Competition (CASC) is the annual evaluation of fully automatic, classical logic, Automated Theorem Proving (ATP) systems. CASC-J11 was the twenty-seventh competition in the CASC series. Twenty-four ATP systems competed in the various competition divisions. This paper presents an outline of the competition design and a commentated summary of the results.
Introduction
Automated Theorem Proving (ATP) deals with the task of proving theorems from axioms – the derivation of conclusions that follow inevitably from known facts [33]. The converse task of disproving conjectures is another facet of interest [5,11]. The axioms and conjecture are written in an appropriately expressive logic, and the proofs are often similarly written in logic [46]. The CADE ATP System Competition (CASC) [42] is the annual evaluation of fully automatic, classical logic, ATP systems – the world championship for such systems. One purpose of CASC is to provide a public evaluation of the relative capabilities of ATP systems. Additionally, CASC aims to stimulate ATP research, motivate development and implementation of robust ATP systems that can be easily and usefully deployed in applications, provide an inspiring environment for personal interaction between ATP researchers, and expose ATP systems within and beyond the ATP community. CASC-J11 was the twenty-seventh competition in the CASC series; see [47] and citations therein, and the CASC web site
CASC was originally devised by Christian Suttner and the first author, after CADE-12 in 1994. The first CASC, CASC-13, was held as part of CADE-13 at DIMACS in 1996. Christian Suttner died in a gliding accident on 1st February 2022. CASC-J11 is dedicated to his memory – CASC would not exist without his inspiration and insights.
CADE was a constituent of the 11th International Joint Conference on Automated Reasoning, hence “J11” for the “11th Joint” conference.
CASC is divided into divisions according to problem and system characteristics, in a coarse version of the TPTP problem library’s Specialist Problem Classes (SPCs) [48]. There are competition divisions in which the systems are explicitly ranked, and a demonstration division in which systems demonstrate their abilities without being ranked (for systems that cannot be entered into the competition divisions for any reason, e.g., the system is the previous CASC’s division winner, or the entrant has a conflict of interest). Each competition division uses problems that have certain logical, language, and syntactic characteristics, so that the systems that compete in a division are, in principle, able to attempt all the problems in the division. Some divisions are further divided into problem categories that make it possible to analyze, at a more fine-grained level, which systems work well for what types of problems. The demonstration division uses the same problems as the competition divisions, and the entry specifies which competition divisions’ problems are to be used. Table 1 catalogs the divisions and problem categories of CASC-J11.
Divisions and problem categories
Twenty-four ATP systems competed in the various divisions. The division winners from CASC-28 (the previous CASC), and the Prover9 1109a system, were automatically entered into the demonstration division to provide benchmarks against which progress can be judged. The systems, the divisions in which they were entered, and their entrants, are listed in Table 2. A division acronym in italics indicates the system was in the demonstration division. System descriptions are in the competition proceedings [44] and on the CASC-J11 web site.
The ATP systems and entrants
CASC-J11 was organized by Geoff Sutcliffe, assisted by Martin Desharnais for the SLH and LTB divisions, and overseen by a panel consisting of Pascal Fontaine, Cláudia Nalon, and Christoph Weidenbach. The competition was run on computers provided by the StarExec project [39] at the University of Miami.
This paper is organized as follows: Section 2 outlines the design and organization of the competition. Section 3 provides a commentated summary of the results. Section 4 contains short descriptions of three of the ATP systems. Section 5 describes some of the CASC activities that stimulate ATP research. Section 6 concludes and discusses plans for future CASCs. A Tense Note: Attentive readers will notice changes between the present and past tenses in this paper. Many parts of CASC are established and stable – they are described in the present tense (the rules are the rules). Aspects that were particular to CASC-J11 are described in the past tense so that they make sense when reading this after the event.
The design and organization of CASC has evolved over the years to a sophisticated state. An outline of the CASC-J11 design and organization is provided here; the details are in [45] and on the CASC-J11 web site. Important changes for CASC-J11 were (for readers already familiar with the general design of CASC):
The TFA division returned from hiatus. The LTB division did not have FOF versions of the problems, but added TXF versions.
System delivery, execution, and evaluation
The ATP systems entered into CASC are delivered to the competition organizer as StarExec installation packages, which the organizer installs and tests on StarExec. Source code is delivered separately, under the trusting assumption that the installation package does correspond to the source code. After the competition all competition division systems’ StarExec and source code packages are made available on the CASC web site. This allows anyone to use the systems on StarExec, and to examine the source code. An open source license is encouraged, to allow the systems to be freely used, modified, and shared. Many of the StarExec packages include statically linked binaries that provide further portability and longevity for the systems.
The ATP systems are required to be fully automatic. They are executed as black boxes, on one problem (in the non-LTB divisions) or one batch (in the LTB division) at a time. Any command line parameters have to be the same for all problems/batches in each division. The ATP systems are required to be sound, and are tested for soundness by submitting non-theorems to the systems in the THF, TFA, FOF, UEQ, SLH, and LTB divisions, and theorems to the systems in the FNT division. Claiming to have found a proof of a non-theorem or a disproof of a theorem indicates unsoundness. One system was found to be unsound before CASC-J11, and was repaired in time for the competition.
The systems in the competition divisions are ranked according to the number of problems solved with an acceptable proof/model output; see [45] for an explanation of what is “acceptable”. Ties are broken according to the average time taken over problems solved. Trophies are awarded to the competition divisions’ winners.
In addition to the ranking, three other performance measures are presented in the results: The state-of-the-art contribution (SotAC) quantifies the unique abilities of each system (excluding the previous year’s winners that are earlier versions of competing systems). For each problem solved by a system, its SotAC for the problem is the fraction of systems that do not solve the problem, and a system’s overall SotAC is the average SotAC over the problems it solves but that are not solved by all the systems. The efficiency balances the number of problems solved with the time taken. It is the average solution rate over the problems solved (the solution rate for one problem is the reciprocal of the time taken to solve it), multiplied by the fraction of problems solved. The core usage measures the extent to which the systems take advantage of multiple cores. It is the average of the ratios of CPU time to wall clock time used, over the problems solved. The competition ran on octa-core computers, thus the maximal core usage was 8.0.
The competition problems
Problems for the TPTP-based divisions
The problems for the THF, TFA, FOF, FNT, and UEQ divisions were taken from the Thousands of Problems for Theorem Provers (TPTP) problem library [43], v8.1.0. The TPTP version used for CASC is released only after the competition has started, so that new problems in the release have not been seen by the entrants. The problems have to meet certain criteria to be eligible for use:
The TPTP tags problems that are designed specifically to be suited or ill-suited to a particular ATP system, calculus, or control strategy as biased, and they are excluded from the competition.
The problems must be syntactically non-propositional.
The TPTP uses system performance data in the Thousands of Solutions from Theorem Provers (TSTP) solution library to compute problem difficulty ratings in the range 0.00 (easy) to 1.00 (unsolved) [48]. Problems with ratings in the range 0.21 to 0.99 are eligible. Problems of lesser and greater ratings are made eligible if there are not enough problems with ratings in that range. For CASC-J11 the TNE, FNN, and FNQ problem categories used problems with ratings in the range 0.21 to 1.00. The TNE category selected 3 problems with rating 1.00 out of the 100 problems used, the FNN category selected 28 problems with rating 1.00 out of the 100 problems used, and the FNQ category selected 39 problems with rating 1.00 out of the 150 problems used.
Systems can be submitted before the competition so that their performance data is used in computing the problem ratings – problems that are solved get a rating less than 1.00 and are thus eligible (unless the rating drops below 0.21). The rating system also uses performance data from ATP systems that are not entered into the competition, which can produce ratings that make problems eligible for selection but easy or unsolvable by the systems in the competition – see Sections 3.1 and 3.2 for examples. Using such problems that are solved by all or none of the competition systems does not affect the competition rankings, has the benefit of placing the systems’ performances in the context of the state-of-the-art in ATP, but does reduce the differentiation between the systems in the competition. The rating system is being improved to reduce this effect in future TPTP releases.
In order to ensure that no system receives an advantage or disadvantage due to the specific presentation of the problems in the TPTP, the problems are preprocessed to strip out all comment lines (in particular, the problem header), randomly reorder the formulae/clauses (
The numbers of problems used in each division and problem category are constrained by the numbers of eligible problems, the number of systems entered across the divisions, the number of CPUs available, the time limits (see Section 2.3), and the time available for running the competition live in one conference day, i.e., in about 6 hours. The numbers of problems used are set within these constraints, according to the judgement of the organizers. Table 3 gives the numbers of eligible problems, the maximal numbers that could be used after taking into account the limitation on very similar problems (see below), and the numbers of problems used, in each division and category. The problems used are randomly selected from the eligible problems based on a seed supplied by the competition panel:
The selection is constrained so that no problem category contains an excessive number of very similar problems, according to the “very similar problems” (VSP) lists distributed with the TPTP: For each problem category in each division, if the category is going to use N problems and there are L VSP lists that have an intersection of at least In order to combat excessive tuning towards problems that are already in the preceding TPTP version, the selection is biased to select problems that are new in the TPTP version used, until 50% of the problems in each problem category have been selected or there are no more new problems to select, after which random selection from old and new problems continues.
The problems are given to the ATP systems as files in TPTP format, in increasing order of TPTP difficulty rating.
Numbers of eligible and used problems
Numbers of eligible and used problems
As a “surprise” test to see if there had been progress in the last year, the same 720 SLH problems from CASC-28 were used in CASC-J11. In summary (see [47] for details), an initial set of 2040 problems was generated by Isabelle’s Sledgehammer system [29] from the Archive of Formal Proofs (AFP) [1],3
The problems for the LTB division are taken from various sources, with each problem category being based on one source. CASC-J11 had only one category, the VBT category, which used a set of 8900 problems generated by Isabelle’s Sledgehammer system from the “van Emde Boas Trees” session in the AFP.4
The problems are given to the ATP systems as files in TPTP format, in the natural order of their source, i.e., for CASC-J11 in the order of their export from the AFP. The problem batch was unordered, which meant that the ATP systems could attempt as many of the versions as they wanted, in any order including in parallel, and could attempt them multiple times. This provided opportunities for sharing logical and control results between proof attempts. A solution to any version counted as a solution to the problem.
The StarExec computers used for the competition had an octa-core Intel(R) Xeon(R) E5-2667 v4 CPU run at 2.10 GHz, 128 GiB memory, and the CentOS Linux release 7.4.1708 (Core) operating system Linux kernel 3.10.0-693.el7.x86_64. One ATP system ran on one CPU at a time, and could use all the cores on the CPU. (Each StarExec computer has two sockets, i.e., two CPUs, and 256 GiB memory. StarExec uses Linux’s
In the THF, TFA, FOF, FNT, and UEQ divisions the time limit for each problem is constrained by the same factors that constrain the numbers of problems that can used (see Section 2.2.1), and additionally the numbers of problems that are used. The time limit is set within the constraints, according to the judgement of the organizers. In CASC-J11 a 120 s wall clock time limit was imposed for each problem, and no CPU limits were imposed so that it could be advantageous to use all the cores on the CPU. In the SLH division a 30 s CPU time limit was imposed for each problem, which is the amount of CPU time that can be usefully allocated for a proof attempt in the Sledgehammer context.5
According to Jasmin Blanchette, and he should know.
Demonstration division systems can run on the competition computers, or the computers can be supplied by the entrant. The CASC-J11 demonstration division systems all used the competition computers.
For each ATP system, for each problem, four items of data were recorded: whether or not the problem was solved, the CPU and wall clock times taken (as measured by StarExec’s
The THF division
Table 4 summarizes the results of the THF division. Zipperposition dominated the division, with the new Zipperposition 2.1.999 being outperformed by the CASC-28 winner Zipperposition 2.1. The most interesting change in this division was the entry of E 3.0, which had acquired THF capability in the last year. Even more interesting is that E outperformed other systems that have had THF capability for many years, e.g., Vampire, Leo-III, Satallax, and cvc5. This new capability came from the integration of new data structures and inference rules [3,50,51,53]. A short system description of E is provided in Section 4.
Another interesting entry was Lash 1.12, which is a reimplementation [7] of Satallax 3.4 [8] in C. Lash did not perform as well as Satallax because several features in Satallax had not yet been ported: pattern clauses, higher order unification, and use of E as a backend. The performance graph for the THF division6
THF division results
In addition to being a new entrant in the THF division, E also had the lowest average time taken, and as a result the highest efficiency. It also had the second highest SoTAC. cvc5 also had a relatively high efficiency thanks to its low average time taken. As in the past, Leo-III had a lower efficiency due to its JVM startup time of 1-3 s, which means that solutions always take a few seconds even if the time used for reasoning is low. While Satallax solved more problems than Lash, the reimplementation in Lash is clearly faster, which results in higher efficiency. The core usage was rather low in THF compared to other divisions – see the summary of core usages in Section 5.1. Zipperposition and Vampire made the best use of the multiple cores available. The category rankings and numbers of new problems solved are almost completely aligned with the overall ranking, which indicates that the systems were not particularly tuned to problems with or without equality, or to problems in the TPTP. None of the three problems with rating 1.00 (see Section 2.2) were solved.
The individual problem results show that 19 problems were unsolved, 120 problems were solved by all the systems, and 24 problems were solved by only one system (13 problems were solved by only the two versions of Zipperposition, and are counted as unique solutions for Zipperposition 2.1.999). Of the 24 unique solutions, 16 were by Zipperposition 2.1.999, 2 each by E, Satallax, and Zipperposition 2.1, and 1 each by cvc5 and Lash. A portfolio approach (sharing the time between several systems) could help here. The best portfolio with 45 s allocated to each of the two Zipperpositions, E, and Leo-III, would solve 470 problems. A further 1 problem could be solved by a very precise time allocation to those four systems and additionally cvc5. The rather large number of problems solved by all the systems is an instance of the issue noted in Section 2.2.1. In this case it was due to the use of performance data from both theorem proving and model finding variants of systems e.g., those of E and Vampire, when computing the problem ratings. This particular effect will be gone from the ratings of future TPTP releases.
The TFA division was on hiatus for CASC-28 because only three systems, including the previous CASC-27’s winner, were entered into the division in CASC-J10 the year before. The division returned from hiatus with encouragement from some system developers, and it was pleasing to see five systems entered in CASC-J11. The entrants included two new systems: SnakeForV4.7 and iProver. SnakeForV4.7 was entered as a demonstration division system because the entrant felt that the system was “capitalizing on the whole Vampire team’s effort”, and he did not want to be “snatching any potential trophies from Vampire proper”.
Table 5 summarizes the results of the TFA division. The winner was cvc5, with the CASC-28 winner Vampire 4.5 close behind, and the demonstration division’s SnakeForV4.7 well out ahead. The developer of cvc5 commented: “To be honest, I’m a bit surprised since I didn’t make a big push for it. Perhaps the problem selection swung in my favor this year.”. Note that the new Vampire 4.7 solved less problems than Vampire 4.5. The performance graph for the TFA division7
The impressive performance of SnakeForV4.7 is clearly the highlight of the division, and a short system description of SnakeForV4.7 is provided in Section 4. The first entry of iProver into the TFA division is also of quite some interest, and a short system description of iProver is also provided in Section 4.
TFA division results
SnakeForV4.7 and the Vampires had high efficiency due to a combination of low times taken over high numbers of problems solved. All the systems except cvc5 had high core usage. The high core usage of iProver might seem strange alongside its relatively high average (wall clock) time taken, but is explained by its quite high average CPU time of 127.4 s.
The individual problem results show that 10 problems were unsolved, 109 problems were solved by all the systems, 20 problems were solved by only one system, and no problems were solved by only the two versions of Vampire. Of the 20 unique solutions, 12 were by cvc5, 5 were by Vampire 4.5, 2 by SnakeForV4.7, and 1 by iProver. A simple portfolio of SnakeForV4.7 and cvc5, giving each 60 s, would solve 229 problems. The reason for the rather large number of problems solved by all the systems is another instance of the issue noted in Section 2.2.1. In this case many problems were easy for the systems in the competition due to the influence of performance data from systems that were not entered, including Beagle [2], Princess [35], SPASS + T [30], and Z3 [13].
Table 6 summarizes the results of the FOF division. There were strong performances by the usual suspects: the Vampires, E, iProver, and CSE_E. The Vampire-based demonstration division system SnakeForV4.7 solved the most problems. Prover9, which is entered as a fixed point against which progress can be judged, slipped further down the ranking compared to CASC-28, where it was sixth from last. While this might be bad news for Prover9, it’s good news for ATP because it does indicate progress in the field. While Etableau found proofs for 279 problems, the CASC-J11 panel had ruled that the proofs were not acceptable. As a result Etableau received a zero score and ranked last. The developer said that he “understood completely”, and hopefully the ruling will lead to better proof output ready for CASC-29!
Goéland was one of the newcomers to CASC, and while it’s performance was relatively weak, the CASC rules explicitly state that “A system that is not entered into a division is assumed to perform worse than the entered systems”, i.e., Goéland is assumed to be better than several well-known ATP systems (that were not entered). See Section 5.1 for details of the best newcomer award received by Goéland.
FOF division results
FOF division results
The three Vampire(-based) systems all had low average times taken, high SoTACs, high efficiency, and reasonable core usage. iProver, GKC, and Drodi had the highest core usage. The performances in the two problem categories are well aligned with the overall performance, with E having its usual predilection for the FEQ category. Prover9 and SnakeForV4.7 also did better in the FEQ category.
The individual problem results show that 13 problems were unsolved, no problems were solved by all the systems, and 17 problems were solved by only one system (4 problems were solved by only the two versions of Vampire, and are counted as unique solutions for Vampire 4.7). Of the 17 unique solutions, 5 were by Vampire 4.7, 5 by SnakeForV4.7, 2 by each of E and cvc5, and 1 by each of CSE_e, iProver, and Prover9. A portfolio giving 40 s to each of SnakeForV4.7, Vampire 4.6, and E would solve 470 problems.
Table 7 summarizes the results of the FNT division. As in CASC-28 the Vampires came out ahead of the other systems, with the CASC-28 winner Vampire 4.6 solving the most problems. The demonstration division system SnakeForV4.7 also performed strongly. The performance graph for the FNT division8
FNT division results
Vampire 4.7 and SnakeForV4.7 had the highest SoTACs, and were joined by Vampire 4.6 for the highest efficiencies. iProver, Vampire 4.7, and SnakeForV4.7 had high core usage.
The individual problem results show that 77 problems were unsolved (65 of rating 1.00, as discussed above), 13 problems were solved by all the systems, and 8 problems were solved by only one system (1 problem was solved by only the two versions of Vampire, and is counted as a unique solution for Vampire 4.7). Of the 8 unique solutions, 3 were by Vampire 4.6, 2 by each of Vampire 4.7 and iProver, and 1 by SnakeForV4.7. A portfolio approach does not help much, with a combination of Vampire 4.6 using 92 s and E using 28 s solving 167 problems.
Table 8 summarizes the results of the UEQ division. The new version of Twee won, with the previous version that won CASC-28 solving one less problem (which led somebody to joke that the new version number should have been 2.4+1 instead of 2.4.1 ☺). In comparison to CASC-28. where Twee 2.4 won with 227 problems solved and E 2.6 was the runner up with 195 problems solved, the new E 3.0 did relatively better. The developer of E believes that the improved performance is due to E’s new support for multicore-scheduling, which allows it to run its search processes longer – this is described in the short description of E in Section 4. As in other divisions, the demonstration system SnakeForV4.7 outperformed the underlying Vampire system. The rather weak performance of Toma is due to it being a newly developed system. The developer explained that “many modules (e.g., rewriting, unification, proof reconstruction, etc.) are implemented too naively”, and that he is “planning to implement term indexing”. Toma will return for CASC-29 and hopefully it will perform much better – developers consistently learn and improve their systems through participation in CASC.
UEQ division results
UEQ division results
Twee, E, and to a lesser extent SnakeForV4.7, had the best SoTACs, which predicts that they would do well in a portfolio – this is confirmed below. These systems also had the best efficiencies, with Drodi also having a high efficiency thanks to its low average time taken. With the exception of Toma, all the systems had high core usage, even more so than in other divisions. See Section 5.1 for a discussion of this phenomenon.
There were 46 new problems in the UEQ division, 28 of which were created by encoding non-UEQ problems with Horn clause normal forms to UEQ form [10], 8 are relational lattice theory problems [28], 4 are problems in MV algebra [20], and 2 are in knot theory [19]. (Although these problems were contributed to the TPTP by the developer of Twee, this did not seem to give Twee an unfair advantage in the competition.) All of the new problems were solved by some system – four by only E and three by only the Twees. E solved the most new problems, which contributed to its strong performance in the division.
The individual problem results show that 7 problems were unsolved, 97 problems were solved by all the systems except Toma, and 25 problems were solved by only one system (8 problems were solved by only the two versions of Twee, and are counted as unique solutions for Twee 2.4.1). Of the 25 unique solutions, 14 were by E, 8 by Twee 2.4.1, 2 by SnakeForV4.7, and 1 by Twee 2.4. A portfolio approach works well – a simple portfolio giving 30 s to each of Twee 2.4.1, E, SnakeForV4.7, and GKC would solve 235 problems.
The left half of Table 9 summarizes the results of the SLH division in CASC-J11 (see below for a discussion of the right half of the table). E and Ehoh (the CASC-28 winner) solved the most problems, closely followed by the demonstration division system Zipperposition.
There is a developmental lineage from Zipperposition, to Ehoh, to E: Zipperposition was originally developed as a clone of E (written in OCaml) to experiment with types, sorts, and induction [12]. Over time Zipperposition was adopted as a testbed for new techniques for embedding higher-order logic into a saturation-based ATP system [3,50]. Ehoh resulted from porting some of those techniques into a separate branch of the E code base, written in C. The branch was regularly merged into E, and in CASC-28 both systems were built from nearly the same code base. For CASC-28 Ehoh was kept separate from E to ensure that the contributions of Ehoh’s developer were clearly visible, which turned out to be a wise decision – Ehoh won the CASC-28 SLH division! By CASC-J11 E had added new features not implemented in Ehoh. In particular E can now reason in full higher-order logic, which results in a potentially larger search space, but is offset by better heuristics – see the short description of E in Section 4. Ehoh can do only lambda-free higher-order logic with lambda-lifting, but that is enough for nearly all the SLH problems. Zipperposition was entered in the CASC-J11 demonstration division because the entrant was concerned about a conflict of interest – the SLH problems were generated in the same group that developed the theory and implementation of Zipperposition, and Zipperposition could have benefited from testing and training on those (and similar) problems. In contrast, Ehoh and E were not tested and trained (specifically, at least) on these problems, and the systems’ entrants did not feel a conflict of interest. Further, the historical linkage from Zipperposition to Ehoh and E would have impacted both systems similarly, due to the shared code base. Despite the shared heritage, the differences between E and Ehoh resulted in them each solving 32 problems that were not solved by the other (so they solved the same number of problems). The developer of E believes that the different unsolved problems were too hard for each system’s inference rules and chosen strategy schedule.
SLH division results
SLH division results
There is quite a range of average times taken, with cvc5 having the lowest and Zipperposition having the highest. The latter is largely due to the implementation in OCaml, vs. the other systems’ implementations in C/C++. For cvc5 the low average time taken results in an efficiency comparable to the top systems. Only Vampire had a lower efficiency, which reflects Vampire’s emphasis on multi-core performance that does not help in the SLH division where a CPU time limit is imposed.
The individual problem results show that 3 problems were unsolved, 459 problems were solved by all the systems, and 14 problems were solved by only one system. Of the 14 unique solutions, 4 were by each of E, Vampire, and cvc5, and 2 by Ehoh. A portfolio approach works well here – a portfolio of all five systems, giving each 6.0 s, would solve 708 problems.
Recall from Section 2.2.2, the same 720 SLH problems from CASC-28 were used in CASC-J11. The computing environment was also unchanged, so a comparison of the CASC-J11 and CASC-28 results is meaningful and interesting. The results in the right half of Table 9 are from CASC-28. As is expected, Ehoh solved the same number of problems. Vampire improved slightly, and cvc5, which did not change version number since CASC-28, improved quite dramatically. The cvc5 developer explained that the system had not really “improved”, but rather that in CASC-28 it was mistakenly configured for first-order logic, which was corrected in CASC-J11. In contrast, the new version of Zipperposition performed worse than the previous version that ran in CASC-28. In terms of individual results, the 459 problems solved by all the systems in CASC-J11 contrasts with the 208 problems in CASC-28. This increase is due to the improved performance of cvc5, and also by the absence of Leo-III which solved only 499 problems in CASC-28. A portfolio approach also worked well in CASC-28.
Table 10 summarizes the results of the LTB division. The Vampires solved the most problems, demonstrating the efficacy of their dynamic strategy tuning: The idea is to start by attempting all versions of all problems with a short time limit, to get an initial set of problems solved. Following that an auto-tuning mode starts, repeatedly iterating through the remaining unsolved problems, randomly selecting a version of an unsolved problem and randomly selecting a strategy for the version, weighted according to how successful the strategies have been so far. The strategy weighting is updated dynamically by giving a successful strategy a weight proportional to how many previous strategies failed to solve the problem version, and giving an unsuccessful strategy a constant penalty. Over time the successful strategies are given higher priority, and different (successful) strategies are used in subsequent attempts on unsolved problems. For CASC-J11 Vampire 4.7 added a slight variation that introduced a fresh strategy with a 1% probability at each attempt, by randomly permuting an existing strategy. If the fresh strategy solved the problem it was kept in the pool of strategies, otherwise it was discarded. The purpose was to inject some randomness into very long runs, but the developer reported that he has “no idea if it actually helped!”.
The selected problems and problem versions were evidently harder for the systems than in CASC-28, where Vampire 4.6 solved 78%, iProver 3.5 70%, and E 2.6 68% of the problems, compared to 47%, 43%, and 41% respectively in CASC-J11. Polymorphism continues to be a challenge for the ATP systems – see the CASC-28 report [47] and the discussion of Table 11 below. In CASC-28 Vampire 4.6 solved 262 problems by finding that they had contradictory axioms. This year Vampire 4.7 solved 146 by contradictory axioms. Of the 146, 41 were TF0, 30 TF1, 66 TX0, 5 TH0, and 4 TH1. Contradictory axioms is a recurring “feature” of problems generated by Sledgehammer, where contradictory axioms can appear in proofs of negation, proofs by contradiction, and proofs by case analysis where some cases do not apply.
LTB division results
LTB division results
Table 11 shows the numbers of each version of the problems solved. There were 4532 problems solved across all systems and problem versions. There is quite a high degree of specialization to specific problem versions, with Vampire 4.7 having the broadest range. The most problems were solved in the lowest logic, with iProver solving all its 3446 problems in TF0. The benefit of being able to solve problems in higher logics is clear, and there appears to be some space for improvement by adding TX1 capability.
Problems solved for each version
The Vampires’ dynamic strategy tuning approach meant that most problems and problem versions were attempted more than once; only 40 problems were attempted (and thus solved) only once, six problems were solved in the 77th attempt, and one problem was attempted (but not solved) 78 times. Figure 1 shows the accumulated number of problems solved by the number of attempts. While over half of the solutions were found by the 12 attempt, the figure shows how the number of problems solved keeps increasing, up to the six solved in the 77th attempt. The multiple attempts on problems is also reflected in Vampire 4.7’s accumulated CPU times across the solved problems, from 1155 problems accumulating less than 1 s of CPU time, up to 2 problems that accumulated just over 4000 s of CPU time – see the performance graph for the LTB division.9

Vampire 4.7 problems solved by attempts.
The individual problem results show that 3468 problems were unsolved, 2552 problems were solved by all the systems, and 919 problems were solved by only one system (263 problems were solved by only the two versions of Vampire, and are counted as unique solutions for Vampire 4.7). Of the 919 unique solutions, 425 were by Vampire 4.7, 175 were by Vampire 4.6, 169 by iProver, and 150 by E. As the systems have different approaches to solving the problems, a portfolio approach that shares the available time cannot be (easily) considered. However, Table 11 shows that giving all systems the full 172800 s solves 4532 problems. Interestingly, just Vampire 4.6 and iProver 3.6 together would solve 4141 problems, and adding Vampire 4.7 to that mix would solve 4382 problems.
As was noted in Sections 3.1, 3.2, 3.3, and 3.4, there were three systems of particular interest in CASC-J11: E 3.0 for its somewhat astonishing new THF capability and strong performance in FOF; SnakeForV4.7 1.0’s for its powerhouse performances in TFA, FOF, and FNT; iProver 3.6 for its new TFA capability and features described in the IJCAR best student paper [17]. These short descriptions of the systems were written by their entrants.
E [37] is a saturating theorem prover. It was originally built for first-order clausal logic with equality, but over time has been extended to full first order logic, many-sorted first-order logic with first-class Booleans, λ-free higher order logic [52], and most recently full higher-order logic [4,50]. E implements a pipeline of preprocessing (including clausification), the main DISCOUNT saturation loop, and post-processing to extract a TPTP-compliant proof. While the higher-order superposition calculus [4] is complete, its implementation in E is not because (i) some inference rules of the calculus that seem to be rarely useful (based on experiments with the Zipperposition system that implements the calculus) have not been implemented, and (ii) the higher-order unification procedure computes at most a limited number of unifiers, because computing and using all (potentially infinitely many) unifiers is not easily supported in the given clause loop. Thus infinitary inferences are not supported, and instead the finite set of unifiers returned by the unification procedure is used immediately, with each unifier potentially resulting in the creation of a new clause. Proof search in E is controlled primarily by a literal selection strategy, a clause selection heuristic, and a simplification ordering. The prover supports a large number of pre-programmed literal selection strategies. Clause selection heuristics can be constructed on the fly by combining various parameterized primitive evaluation functions, or can be selected from a set of predefined heuristics. Clause evaluation heuristics are based on symbol-counting, but also take other clause properties into account. Several parameterized instances of Knuth-Bendix-Ordering (KBO) and Lexicographic Path Ordering (LPO) [36] are supported for term ordering, which can be lifted in different ways to literal orderings. For CASC-J11 E used a multi-core strategy-scheduling automatic mode that supports two-tiered schedules. First, based on properties of the original input problem, a number of preprocessing strategies are selected. Each of these is scheduled onto one or more cores. After preprocessing, each process reclassifies the problem based on the resulting CNF problem, and runs a schedule of search strategies on its assigned core. E is implemented in C, based on efficient data structures and indexing techniques, and portable to most modern operating systems.
SnakeForV4.7 is a strategy discovery and schedule construction tool applied to Vampire 4.7 [27]. Snake samples the space of available prover strategies, locally optimizes the discovered solutions, builds a performance database of the optimized strategies evaluated over a problem set, and finally constructs a schedule in a greedy heuristic fashion. Snake treats Vampire as a Las Vegas randomized algorithm [40], and uses instruction limiting (as opposed to time limiting) for precise and robust resource control. Currently the strategy space sampling is uniform. When Vampire solves a problem for the first time, or solves a problem using less instructions than in any previous solution, a local optimization process takes over. This process takes the new strategy-problem pair
iProver [15,25] is a theorem prover for first-order logic, with support for arithmetic. iProver combines an instantiation calculus Inst-Gen [26] with ordered resolution and superposition calculi in an abstraction-refinement framework [22]. Groundings of clauses generated by first-order calculi are exchanged with SAT (MiniSAT [18]) and SMT (Z3 [13]) solvers; ground models returned by these solvers are used to guide instantiations. Recent additions to iProver were focused on refinements of superposition, in particular on extending methods from term rewriting and Knuth-Bendix completion from unit equalities into full first-order logic. These methods include ground joinability, connectedness, encompassment demodulation, and AC normalisation [16,17]. These techniques improved overall performance of iProver for both unit and non-unit equational problems, and thereby solving several problems previously unsolved by any prover. For problems containing arithmetic, iProver uses clausification and theory axioms provided by Vampire [27,32], and submits ground queries to Z3 for sub-tasks such as normalisation and simplification of terms containing interpreted symbols, and checking the consistency of ground clauses modulo theories. iProver runs schedules consisting of several Inst-Gen, superposition, and resolution components, which can share clauses. For example, lemmas judged useful in superposition can be passed to Inst-Gen where they can be used for simplifications and instantiation inferences. Schedules are trained using a machine learning framework HOS-ML, which significantly increases the number of problems solved and reduces the time taken, compared to a naïve schedule [23]. iProver is implemented in OCaml.
Core usages
Core usages
Scientific stimulation
One aspect of ATP that has been stimulated by recent CASCs is the use of multi-core search strategies.10
Actually, there is a historical social note here: The developer of iProver has long been a proponent of wall clock time limits and multi-core approaches. He lobbied the CASC organizer on this topic as far back as CASC-25 in 2015, in particular after the CASC dinner that year. That was one influence that led to wall clock time limits being introduced in CASC-J10 in 2020.
At the second Olympics Games session at FLoC, multiple competitions’ organizers presented their results and handed out the winners’ medals that were provided by the games chair, Dana Drachsler Cohen. For CASC-J11, medals for the seven division winners were given out. Additionally, the audience was reminded that CASC is as much about stimulation as it is about evaluation, and the Goéland team received the “best newcomer” trophy recognizing their stimulating new participation in CASC-J11, including a system description in the IJCAR conference [9].
In addition to being a scientific evaluation of the relative capabilities of ATP systems, CASC provides social stimulation for ATP research with an inspiring environment for personal interaction between ATP researchers – see Section 1. This is done in several ways, starting with the CASC dinner on the evening before the competition. The CASC dinner offers a relaxed atmosphere in which developers can discuss their research, talk about their systems, and learn from each other. As Stephan Schulz (the developer of E) once said: “You can learn more from people sitting around der Tisch at a CASC dinner than you do from reading their papers or listening to their talks”. The CASC-J11 dinner was attended by 26 entrants, associates, organizers, panel members, and spouses.
A highlight of the CASC dinner is the revelation of the CASC T-shirts that are included in the CASC registration fee. The CASC T-shirts are well known in the CADE/IJCAR community, and provide a way to identify the people involved with CASC during and after the conference, leading to stimulating interactions. The CASC-J11 T-shirt design can be seen at
To further stimulate conversations and insights into the ATP systems in the competition, before the competition starts entrants and other conference attendees are invited to buy a “CASC sweepstakes” ticket. Each ticket names a random ATP system in the FOF division, and the challenge is to predict how many problems the “horse” will solve. This activity draws in observers, thus exposing the ATP systems within and beyond the ATP community – talking to the system developer is one good way to get a good estimate of the system’s expected performance!
Conclusion
CASC-J11 was the twenty-seventh annual evaluation of fully automatic, classical logic, Automated Theorem Proving (ATP) systems. CASC-J11 fulfilled its objectives by evaluating the relative capabilities of ATP systems, and stimulating development and interest in ATP. The highlights of CASC-J11 were: good core usage by most systems, definite advancement in some systems, and another fascinating LTB division.
While the design of CASC is mature and stable, each year’s experiences lead to ideas for changes and improvements. Changes being planned for CASC-29 are:
The LTB division will go on hiatus, because the 14 LTB divisions since CASC-24 in 2008 have adequately achieved the goal of stimulating development of ATP systems that are effective in the LTB setting, e.g., their use in interactive theorem proving [6,24,49]. Of course more progress is always possible, and the LTB division is likely to return from hiatus in a future edition of CASC. Typed First-order Non-theorem (TFN) division will return from hiatus, using monomorphic typed first-order problems without arithmetic. This is motivated by the observed importance of model finding in applications, including verification [14], checking the consistency of axiomatizations [38], and finding a solution to a problem that is coded as a model finding problem [54]. Noting the observations in Section 5.1, the measurement of core usage will be modified to avoid inaccuracies introduced by problems that are solved very quickly, e.g., one option might be to measure core usage over only those problems that take at least 1 s of wall clock time.
As always, the ongoing success and utility of CASC depends on ongoing contributions of problems to the TPTP. The automated reasoning community is encouraged to continue making contributions of all types of problems.
