Abstract
The CADE ATP System Competition (CASC) is an annual evaluation of fully automatic, classical logic Automated Theorem Proving (ATP) systems. CASC-J7 was the nineteenth competition in the CASC series. Twenty-four ATP systems and system variants competed in the various competition and demonstration divisions. An outline of the competition design, and a commentated summary of the results, are presented.
Introduction
The CADE ATP System Competition (CASC) is an annual evaluation of fully automatic, classical logic Automated Theorem Proving (ATP) systems – the world championship for such systems. One purpose of CASC is to provide a public evaluation of the relative capabilities of ATP systems. Additionally, CASC aims to stimulate ATP research, to motivate development and implementation of robust ATP systems that are useful and easily deployed in applications, to provide an inspiring environment for personal interaction between ATP researchers, and to expose ATP systems within and beyond the ATP community. Fulfillment of these objectives provides insight and stimulus for the development of more powerful ATP systems, leading to increased and more effective use.
CASC-J7 was held on 20th July 2014 in Vienna, Austria, as part of the 7th International Joint Conference on Automated Reasoning, which was in turn part of the Federated Logic Conference (FLoC) component of the 2014 Vienna Summer of Logic. The “FLoC Olympic Games” hosted 14 competitions in various areas of logic and reasoning, which provided the CASC-J7 entrants the opportunity to observe and interact with entrants with other types of reasoning tools. CASC-J7 was the nineteenth competition in the CASC series; see [13] and citations therein for information about individual previous competitions. Twenty-four ATP systems and system variants, listed in Table 1, competed in the various divisions of CASC-J7. The division winners of CASC-24 (the previous CASC) were automatically entered to provide benchmarks against which progress can be judged.1
In the THF division (see Section 2.1 for details of the divisions) the CASC-24 runner-up, Satallax 2.7, was entered instead of the CASC-24 winner, Satallax-MaLeS 1.2, because it was too hard to install Satallax-MaLeS 1.2 on the new competition computers. This had the useful side effect of providing a comparison between the plain Satallax and the MaLeS enhanced version – see Section 3.1. As the UEQ division had been suspended since CASC-23 in 2011, the CASC-23 UEQ winner was entered.
The ATP systems and entrants
CASC-J7 was organized by Geoff Sutcliffe, and overseen by a panel consisting of Bernhard Beckert, Maria Paola Bonacina and Aart Middeldorp. The competition was run on computers provided by the StarExec project [7], at the University of Iowa.
This paper is organized as follows: Section 2 outlines the design and organization of CASC-J7. Section 3 provides a commentated summary of the results. Section 4 contains short descriptions of three of the ATP systems. Section 5 concludes and discusses plans for future CASCs.
Divisions and problem categories
Divisions and problem categories
Important changes for CASC-J7 were:2
This summary of changes is most useful for readers who are already familiar with the general design of CASC. Other readers can skip to Section 2.1.
The competition was run on StarExec. Systems had to be delivered as StarExec installation packages.
Systems had to use the SZS ontology and standards [9] for reporting their results.
The LTB division took a one year hiatus, to simplify the move to StarExec.
The UEQ division returned from its three year hiatus.
CASC is divided into divisions according to problem and system characteristics. There are competition divisions in which the systems are explicitly ranked, and a demonstration division in which systems demonstrate their abilities without being ranked. (The demonstration division is for systems that cannot be entered into the competition divisions for any reason, e.g., the system runs on specialist hardware, or the entrant is a competition organizer or panel member.) In CASC-J7 iProver 1.4 ran in the demonstration division because it was entered after the system registration deadline. Each competition division uses problems that have certain logical, language, and syntactic characteristics, so that the systems that compete in the division are, in principle, able to attempt all the problems in the division. Some divisions are further divided into problem categories that make it possible to analyze, at a more fine-grained level, which systems work well for what types of problems. The demonstration division uses the same problems as the competition divisions – the entry specifies which competition divisions’ problems are to be used. Table 2 catalogs the divisions and problem categories of CASC-J7. The example problems can be viewed online at
System execution and evaluation
The ATP systems were delivered to the competition organizer as StarExec installation packages, which the organizer installed and tested on StarExec. Source code was delivered separately for archiving on the competition web site.
The ATP systems entered into CASC are required to be sound and fully automatic. They are executed as black boxes, with one problem file name and any entrant specified switches as command line parameters. The command line switches have to be the same for all problems in each division. The systems are tested for soundness by submitting non-theorems to the systems in the THF, TFA, FOF, EPR and UEQ, divisions, and theorems to the systems in the FNT and EPR divisions. Claiming to have found a proof of a non-theorem or a disproof of a theorem indicates unsoundness. For the first time in many years, no systems failed this test in CASC-J7. While this is hopefully a sign of improved systems, more extensive soundness testing might be more revealing, and this is planned for CASC-25 in 2015.
The THF, TFA EPR and UEQ divisions were ranked according to the number of problems solved, but not necessarily accompanied by a proof or model (thus giving only an assurance of the existence of a proof or model). The FOF and FNT divisions were ranked according to the number of problems solved with an acceptable proof/model output.3
The models are “counter-models” for problems with an unprovable conjecture, and “models” for satisfiable sets of axioms.
Numbers of eligible and used problems
In addition to the ranking criteria, two other measures were made and presented in the results: The state-of-the-art (SoTA) contribution quantifies the unique abilities of each system. For each problem solved by a system, its SoTA contribution for the problem is the inverse of the number of systems that solved the problem, so that if a system is the only one to solve a problem then its SoTA contribution for the problem is 1.00, and if all the systems solve a problem their SoTA contribution for the problem is the inverse of the number of systems. A system’s overall SoTA contribution is its average SoTA contribution over the problems it solved. The efficiency measure is a combined measure that balances the time taken for each problem solved against the number of problems solved. It is the average solution rate over the problems solved (the solution rate for one problem is the inverse of the time taken to solve it), multiplied by the fraction of problems solved.
The problems were taken from the Thousands of Problems for Theorem Provers (TPTP) problem library [10], v6.1.0. The TPTP version used for CASC is not released until after the competition, so that new problems have not been seen by the entrants. The problems are given to the ATP systems in TPTP format, with no preprocessing.
TPTP problems that are tagged as “biased” are excluded from CASC, i.e., the problems cannot be designed specifically to be suited or ill-suited to some ATP system, calculus or control strategy. The problems’ TPTP difficulty ratings [14] are then used to determine the eligibility of problems, with the aim of using problems that differentiate the systems. The difficulty rating is based on ATP system performance data in the Thousands of Solutions from Theorem Provers (TSTP) solution library [11], and ranges from 0.00 (easiest) to 1.00 (unsolved). Problems with difficulty rating 0.21–0.99 are eligible.
The numbers of problems used in each division and problem category are roughly proportional to the numbers of eligible problems. These numbers are subject to a limitation on the number of very similar problems in each division and problem category [8]. In TPTP v6.1.0 there are very few TFR and TFE problems that are not “very similar”, and therefore 20 problems that were very similar to others had to be selected for use. The problems used are randomly selected from the eligible problems at the start of the competition, based on a seed provided by the competition panel. The random selection is biased to select (if possible) at least 50% new problems in each division and category. In CASC-J7 only the TFI problem category of the TFA division had sufficient new problems to reach 50%. Table 3 gives the numbers of eligible problems, the maximal numbers that could be used after taking into account the limitation on very similar problems (in the context of the numbers of problems used), and the numbers of problems used, in each division and category.
Resources
The competition computers each had two quad-core Intel® Xeon® E5-2609, 2.40 GHz CPUs, 256 GB memory, and ran the Red Hat Enterprise Linux Workstation release 6.3 (Santiago) operating system, kernel 2.6.32-431.1.2.el6.x86_64. Demonstration division systems can run on the competition computers, or the computer(s) can be supplied by the entrant. The CASC-J7 demonstration division system used the competition computers.
A 300 s CPU time limit was imposed for each problem. A wall clock time limit of 600 s was also imposed to limit very high memory usage that causes swapping.
Results
THF division results
THF division results
TFA division results
Table 4 summarizes the results of the THF division. Satallax-MaLeS is a strategy scheduling wrapper around Satallax. The strategies are generated using the MaLeS framework, which provides functions that predict how long the reasoning component might need to solve a given problem when using each strategy. The plain Satallax was not the winner at CASC-24, but came second by 3 problems out of 150, to the previous version of Satallax-MaLes. It is noteworthy that all the systems except agsyHOL and cocATP rely to some extent on translation to first-order form and calling the E prover (whose strong performance in the FOF division is noted in Section 3.3).
The SoTA contribution values are all very similar, indicating that all the systems have similar degrees of unique capability. Of the top systems, Satallax-MaLeS and Isabelle have lower micro-efficiencies, reflecting their use of strategy scheduling that leads to higher average CPU times. The systems’ abilities to solve the new problems, and their performances in the two problem categories, align with their overall performances, indicating no specific tuning to existing TPTP problems, or specialization to problems with or without equality.
FOF division results
FOF division results
Table 5 summarizes the results of the TFA division. The winner, CVC4, is a newcomer to the TFA division. It did not compete in the TFA division of CASC-24 because it did not have a TFF parser then. In the last year a TFF parser was added, and a variety of new techniques were implemented. These include a newly developed technique for finding conflicting instances of quantified formulae, and methods for recognizing quantified formulas whose relevant instantiations fall within a bounded integer range. A “weakness” of CVC4 is that it cannot solve problems that involve non-linear arithmetic. There were 13 such problems in CASC-J7, 12 in the TFI problem category and 1 in the TFE problem category. A brief system description of CVC4 is provided in Section 4.
The old version of SPASS+T, which won CASC-24, was beaten by Princess, and was differentiated from Beagle and the new version of SPASS+T only by average CPU time. In CASC-24 the old version of SPASS+T solved more problems than Princess and Beagle, indicating that there have been some improvements in those two systems in the last year. The new version of Princess has more efficient data structures, a built-in solver for non-linear integer arithmetic, better heuristics to instantiate quantified axioms, and an updated strategy portfolio. The new version of Beagle has a chaining rule, automatic selection of orderings, aggressive simplification of background terms and clauses, splitting to reduce the size of instances generated by Cooper’s algorithm, preprocessing based on Fourier–Motzkin elimination and equality rewriting, a heuristic to find unsatisfiable cores of theory formulas for backtracking, much more caching, basic clause indexing, and a new auto strategy that uses two strategies in sequence, each allotted half of the time. Brief system descriptions of Princess and Beagle are provided in Section 4.
SPASS+T relies on Yices as an underlying solver for arithmetic, and unfortunately 16 of the problems use reserved Yices keywords (
The rankings in the three problem categories are quite different from the overall ranking. Highlights include the strong performance of Zipperposition in the TFI category (compared to a complete inability to deal with rationals or reals), and Beagle’s strong performance in the TFR and TFE categories. It is also noteworthy that Zipperposition solved the most of the 50 (all TFI) new problems. Thirty-one of the new problems are about data structures that contain integers, and 18 are software verification problems exported from the Toccata project.4
FNT division results
EPR division results
Table 6 summarizes the results of the FOF division. The table is ordered by the number of proofs output, which was the ranking order for the FOF division. There was no new version of Vampire this year, but the two-year old Vampire 2.6 still won the division. The next three systems are all based on the E prover. E.T. runs preprocessing tools that are targeted mainly at problems with many redundant axioms, followed by the E prover with specific strategies. VanHElsing implements a new automatic mode for E, using machine learning techniques to predict a search strategy is most likely to lead to a proof. The gap between the Vampire/E systems and the lower ranked systems has been evident for several years now. The amount of experience and implementation effort required to develop such powerful ATP systems makes it difficult for others to reach their level of performance.
There were three problems for which Vampire did not output a complete proof in the 300 s time limit, despite having established the theoremhood well within the time limit. The problems were
The individual problem results show that there were 15 problems that were not solved, and 6 problems that were solved by all systems. There were 18 problems solved by only one system – 14 by Vampire, 2 by E.T., and 1 each by E and iProver. Vampire solved more problems than the union solved by all the other systems, and has the highest SOTA contribution and efficiency. The dominance of Vampire means that a portfolio approach does not yield much gain – a portfolio of Vampire and E given 150 s each would solve 378 problems.
The FNT division
Table 7 summarizes the results of the FNT division. The division was won by last year’s winner, with the other three systems trailing reasonably far behind. The new version of iProver in the demonstration division also outperformed the other three competition division systems, but did not improve on last year’s winner. The reason for this apparent lack of progress is discussed in the conclusion – Section 5. However, it is noteworthy that the other systems, E in particular, have significantly lower average CPU times, which in combination with the numbers of problems solved resulted in higher efficiencies for CVC4 and E. It is pleasing to note that all the systems output models.
The individual problem results show that 14 problems were solved by only one system – 8 by E, 4 by Crossbow and 2 by iProver. E’s relatively large number of unique solutions is the reason for its high SOTA contribution. A portfolio of iProver, Crossbow and E, given 100 s each, would solve 172 problems. A portfolio with unequal allocation of CPU time, giving iProver 201 s, Crossbow 74 s and E 25 s, would solve 181 problems.
The EPR division
Table 8 summarizes the results of the EPR division. iProver continued to demonstrate its strength in this division, having now won it for seven CASCs.
The individual problem results show that there were 13 problems that were not solved, and 85 problems that were solved by all three systems. There were 5 problems solved by only iProver, and 3 by only E. However, there is no gain to be had from running a portfolio of iProver and E with 150 s each, because most of their unique solutions took more than 150 s.
The UEQ division
UEQ division results
UEQ division results
Three systems had particularly salient performances in CASC-J7: CVC4 1.4, Princess 140704 and Beagle 0.9, all in the TFA division. The following brief descriptions of these systems were written by their developers.
CVC4 1.4 [1] is an SMT solver based on the DPLL(T) architecture, that includes built-in support for a variety of background theories. Like other SMT solvers, CVC4 treats quantified formulas using a two-tiered approach, where ground theory solver(s) are used in conjunction with an underlying SAT solver to determine satisfiability of the current set of ground clauses, while a quantifier instantiation module adds additional instances of quantified formulas to this set. For unsatisfiable inputs, CVC4 primarily uses heuristic approaches based on E-matching. This approach is accelerated using a newly developed technique for finding conflicting instances of quantified formulas [3], which significantly reduces the number of instantiations needed for solving unsatisfiable inputs. For satisfiable inputs, CVC4 uses a finite model finding approach, using a theory of finite cardinality constraints [4] to minimize the domain sizes of sorts in candidate models. In this approach, quantifier instantiation can be further constrained by considering only instantiations that refine the current candidate model [5]. CVC4 also uses a variety of strategies that specialize in handling inputs of a certain form, including methods for recognizing quantified formulas whose relevant instantiations fall within a bounded integer range.
Princess 140704 [6] is a theorem prover for first-order logic modulo arithmetic. Princess is designed and implemented using a combination of techniques from first-order theorem proving and SMT. The main underlying calculus is a free-variable tableau, extended with Presburger arithmetic constraints to enable backtracking-free proof expansion, and positive unit hyper-resolution for lightweight instantiation of quantified formulae. Uninterpreted functions are handled using a relational encoding as uninterpreted predicates. Princess tends to perform well for TFI problems because of its integration of built-in proof rules for integer arithmetic reasoning with free variables for handling quantifiers. Real and rational arithmetic are handled using less efficient explicit first-order axioms, leading to worse performance in the TFR and TFE problem categories. Improved APIs for integration into other systems were recently added to Princess, including a uniform interface for definition of new theory solvers. This interface was used in CASC-J7 for adding native support for non-linear integer arithmetic. Princess is written in Scala and runs on any recent Java virtual machine. In CASC, Princess used a static portfolio of configurations chosen based on training with a random sample of problems from TPTP.
Beagle 0.9 [2] is a Scala implementation of the Hierarchic Superposition calculus with Weak Abstraction. The usual superposition rules are extended with rules for splitting variable disjoint clauses and chaining for reasoning with inequalities (currently used as a backup). The first-order reasoning part uses a Discount loop for saturation and includes standard simplification rules, e.g., tautology deletion and subsumption deletion. Beagle integrates several theory solvers using the hierarchic superposition framework: an optimized version of Cooper’s quantifier elimination for linear integer arithmetic, and Fourier–Motzkin elimination for linear rational and real arithmetic theories. These are used in a black-box way to refute clause sets over the respective background theories. Beagle additionally includes a comprehensive set of arithmetic simplification rules that are aggressively applied, and a heuristic instantiation algorithm that is used to evaluate non-linear arithmetic terms. These simplification rules, as well as eager application of theory solvers on arithmetic subformulae in preprocessing, are responsible for quick solutions for many of the problems from the TPTP ARI (arithmetic) domain. Beagle performs well on data structure problems due to an informed selection of orderings, as well as the built in restrictions on inferencing that comes with the hierarchic superposition calculus. On the other hand, Beagle does not perform well on problems with large combinatorial search, for example those in the HWV (hardware verification) domain, which are easy for SMT solvers.
Conclusion
CASC-J7 was the nineteenth large scale competition for fully automatic, classical logic ATP systems. CASC-J7 fulfilled its objectives by evaluating the relative abilities of current ATP systems, and stimulating development and interest in ATP.
For the organizers and entrants, the move to StarExec provided some minor challenges in terms of building StarExec installation packages, and an increased need to comply with TPTP standards (e.g., where to look for
The highlight of CASC-J7 was the progress made in the TFA division. The comparative stability of the other divisions prompted some observers to ask whether those areas had reached a plateau of development, but this is believed to not be the case – one year of stability does not constitute a trend! Rather, the stability is believed to be an artifact of the TPTP problem rating scheme and the CASC problem selection criteria, by which there can be some advantage for systems that are used to populate the TSTP and thus affect the TPTP problem ratings, and hence the CASC problem selection. Many of the top performing systems from CASC-24 were used to populate the TSTP during the transition to StarExec, while some of the new systems were not submitted in time to be used. As a result the top performing systems from CASC-24 obtained some advantage which might have helped them do well again in CASC-J7. It is expected that by CASC-25 in 2015 all the system developers will be familiar with StarExec, and be able to submit their systems for problem rating. This will provide a leveled playing field. The new version of Vampire [15], expected to compete in CASC-25, might also shake things up a bit.
While the design of CASC is mature and stable, each year’s experiences lead to ideas for changes and improvements. Some changes that are being considered for CASC-25 are:
Requiring proof output in the TFA division.
Adding a THF with arithmetic division, or a polymorphic typed first-order form division, or a TFA non-theorem division.
Returning the LTB division from its hiatus.
More extensive soundness testing.
As always, the ongoing success and utility of CASC depends on ongoing contributions of problems to the TPTP. The automated reasoning community is encouraged to continue making contributions of all types of problems.
