Abstract
The CADE ATP System Competition (CASC) is the annual evaluation of fully automatic, classical logic Automated Theorem Proving (ATP) systems. CASC-J9 was the twenty-third competition in the CASC series. Twenty-three ATP systems and system variants competed in the various competition divisions. This paper presents an outline of the competition design, and a commentated summary of the results.
Introduction
The CADE ATP System Competition (CASC) [19] is the annual evaluation of fully automatic, classical logic Automated Theorem Proving (ATP) systems – the world championship for such systems. One purpose of CASC is to provide a public evaluation of the relative capabilities of ATP systems. Additionally, CASC aims to stimulate ATP research, motivate development and implementation of robust ATP systems that are useful and easily deployed in applications, provide an inspiring environment for personal interaction between ATP researchers, and expose ATP systems within and beyond the ATP community. CASC-J9 was held on 14th July 2018 in Oxford, United Kingdom, as part of the 9th International Joint Conference on Automated Reasoning, which was in turn part of the 2018 Federated Logic Conference (FLoC).1
The “FLoC Olympic Games” hosted 15 competitions in various areas of logic and reasoning, which provided the CASC-J9 entrants with opportunities to observe and interact with developers of other types of reasoning tools.
CASC is divided into divisions according to problem and system characteristics. There are competition divisions in which the systems are explicitly ranked, and a demonstration division in which systems demonstrate their abilities without being ranked (for systems that cannot be entered into the competition divisions for any reason, e.g., the system is experimental, or the entrant is a competition organizer). Each competition division uses problems that have certain logical, language, and syntactic characteristics, so that the systems that compete in the division are, in principle, able to attempt all the problems in the division. Some divisions are further divided into problem categories that make it possible to analyze, at a more fine-grained level, which systems work well for what types of problems. The demonstration division uses the same problems as the competition divisions, and the entry specifies which competition divisions’ problems are to be used. Table 1 catalogs the divisions and problem categories of CASC-J9.
Divisions and problem categories
Twenty-three ATP systems and system variants listed in Table 2 competed in the various divisions. The division winners of CASC-26 (the previous CASC) were automatically entered into the corresponding demonstration divisions, to provide benchmarks against which progress can be judged. Additionally, Prover9 1109a is entered into the FOF division each year, as a fixed point against which progress can be judged. System descriptions of the systems are in [21] and on the CASC-J9 web site.
The ATP systems and entrants
CASC-J9was organized by Geoff Sutcliffe, and overseen by a panel consisting of Martin Giese, Aart Middeldorp, and Florian Rabe. The competition was run on computers provided by the StarExec project [16] at the University of Iowa.
This paper is organized as follows: Section 2 outlines the design and organization of the competition. Section 3 provides a commentated summary of the results. Section 4 contains short descriptions of three of the ATP systems. Section 5 concludes and discusses plans for future CASCs.
The design and organization of CASC has evolved over the years to a sophisticated state. An outline of the CASC-J9 design and organization is provided here. The details are in [21] and on the CASC-J9 web site. Important changes for CASC-J9 were: (This summary of changes is most useful for readers who are already familiar with the general design of CASC. Other readers can skip to Section 2.1.)
The TFR problem category of the TFA division was put into a hiatus state due to the lack of eligible problems. The SLH division was not run, as there was no sign of a changed work profile being produced by the Sledgehammer system. The previous year’s winners were placed into the demonstration division, to ensure that one of the new systems wins. (Systems were not allowed to, and none tried to, circumvent this by making a trivial change, e.g., incrementing the system version number.)
System delivery, execution, and evaluation
The ATP systems entered into CASC are delivered to the competition organizer as StarExec installation packages, which the organizer installs and tests on StarExec. Source code is delivered separately for archiving on the competition web site.
The ATP systems are required to be fully automatic. They are executed as black boxes, on one problem (the non-LTB divisions) or one batch (the LTB division) at a time. Any command line parameters have to be the same for all problems/batches in each division. The ATP systems are required to be sound, and are tested for soundness by submitting non-theorems to the systems in the THF, TFA, FOF, EPR, and LTB divisions, and theorems to the systems in the FNT and EPR divisions. Claiming to have found a proof of a non-theorem or a disproof of a theorem indicates unsoundness. For the three CASCs from CASC-J7 to CASC-J8 no systems failed soundness testing, but last year in CASC-26 five systems were found to be unsound. For CASC-J9 things returned to the pleasing state of no systems failing the soundness testing.
The THF, TFA, FOF, FNT, and LTB divisions are ranked according to the number of problems solved with an acceptable proof/model output. The EPR division is ranked according to the number of problems solved, but not necessarily accompanied by a proof/model (but systems that do output proofs/models are highlighted in the presentation of results). Ties are broken according to the average time taken over problems solved (CPU time or wall clock time, depending on the type of limit in the division). Trophies are awarded to the division winners. Additionally in CASC-J9, Jasmin Blanchette at the Vrije Universiteit Amsterdam contributed a travel prize for the THF division, with expenses covered by the Matryoshka project [2].
In addition to the ranking criteria, three other measures are made and presented in the results: The state-of-the-art contribution (SotAC) quantifies the unique abilities of each system. For each problem solved by a system, its SotAC for the problem is the reciprocal of the number of systems that solved the problem, so that if a system is the only one to solve a problem then its SotAC for the problem is 1.00, and if all the systems solve a problem their SotAC for the problem is the inverse of the number of systems. A system’s overall SotAC is its average SotAC over the problems it solved. The core usage is the average of the ratios of CPU time to wall clock time used, over the problems solved. This measures the extent to which the systems take advantage of multiple cores. The competition ran on quad-core computers, thus the maximal core usage was 4.0. The efficiency measure combines the number of problems solved with the time taken. It is the average solution rate over the problems solved (the solution rate for one problem is the inverse of the time taken to solve it), multiplied by the fraction of problems solved. Efficiency is computed for both CPU time and wall clock time, to measure how efficiently the systems use one core and multiple cores respectively.
Numbers of eligible and used problems
Numbers of eligible and used problems
Problems for the TPTP-based divisions
The problems for the THF, TFA, FOF, FNT, and EPR divisions were taken from the Thousands of Problems for Theorem Provers (TPTP) problem library [18], v7.2.0. The TPTP version used for CASC is not released until after the competition has started, so that new problems in the release have not been seen by the entrants. The problems have to meet certain criteria to be eligible for selection:
The TPTP tags problems that are designed specifically to be suited or ill-suited to a particular ATP system, calculus, or control strategy as biased. Biased problems are excluded from the competition.
The problems have to be syntactically non-propositional.
The TPTP uses system performance data in the Thousands of Solutions from Theorem Provers (TSTP) solution library to compute problem difficulty ratings in the range 0.00 (easy) to 1.00 (unsolved) [22]. Difficult problems with ratings in the range 0.21 to 0.99 are eligible. Problems of lesser and greater ratings might also be eligible in some divisions if there are not enough problems with ratings in that range. In CASC-J9 two FNQ problems of rating 1.00 were made eligible so that 100 problems were eligible for that problem category. Systems can be submitted before the competition so that their performance data is used in computing the problem ratings.
In order to ensure that no system receives an advantage or disadvantage due to the specific presentation of the problems in the TPTP, the problems are preprocessed to strip out all comment lines (in particular, the problem header), randomly reorder the formulae/clauses (
The numbers of problems used in each division and problem category are roughly proportional to the numbers of eligible problems. The problems used are randomly selected from the eligible problems based on a seed supplied by the competition panel:
The selection is constrained so that no division or category contains an excessive number of very similar problems [17].
The selection is biased to select problems that are new in the TPTP version used, until 50% of the problems in each problem category have been selected, after which random selection (from old and new problems) continues. The number of new problems used depends on how many new problems are eligible and the limitation on very similar problems.
Table 3 gives the numbers of eligible problems, the maximal numbers that could have been used after taking into account the limitation on very similar problems (in the context of the numbers of problems used), and the numbers of problems used, in each division and category. There were very few new problems in CASC-J9, because there had been very little growth of the TPTP since v7.1.0. (The ATP community is encouraged to contribute interesting new problems to the TPTP!)
The problems are given to the ATP systems in TPTP format, in increasing order of TPTP difficulty rating.
Problems for the LTB division
The problems for the LTB division are taken from various sources, with each problem category being based on one source. CASC-J9 had one problem category:2
Usually the LTB division has multiple problem categories, but for CASC-J9 it was decided to have just one category, which allowed that category to have a very large number of problems.
Thanks to Ramana Kumar for the export.
The problems are given to the ATP systems in TPTP format, in the natural order of their source, i.e., for CASC-J9 in the order of their export from CakeML. The problem batch was unordered, which meant that ATP systems could attempt the problems in any order, and could attempt each problem multiple times. This provides increased opportunities for sharing logical and control results between proof attempts.
The computers had four Intel® Xeon® E5-2609, 2.40 GHz CPUs (a quad-core chip), 128 GB memory, and ran the Red Hat Enterprise Linux Server release 7.2 (Maipo) operating system, kernel 3.10.0-514.16.1.el7.x86_64.
In the TPTP-based divisions a 300 s CPU time limit was imposed for each problem. A wall clock time limit of 600 s was also imposed to limit very high memory usage that causes swapping.
In the LTB division a 150000 s wall clock time limit was imposed on the CML problem category, giving an average of 30 s per problem. There was no wall clock time limit for each problem, and no CPU time limits (i.e., systems could benefit from using all cores available on the CPUs). Time spent before starting the first problem in the batch, e.g., learning from the training set and pre-loading the common axioms, and time spent between ending a problem and starting the next, e.g., learning from previous proofs, is not part of the time taken on problems. However, the time taken on such tasks is part of the overall time taken for the batch.
Demonstration division systems can run on the competition computers, or the computer(s) can be supplied by the entrant. The CASC-J9 demonstration division systems all used the competition computers.
Results
For each ATP system, for each problem, four items of data were recorded: whether or not the problem was solved, the CPU time taken, the wall clock time taken, and whether or not a proof or model was output. This section summarizes the results, and provides commentary. The result tables below give the number of problems solved in the division, the average time over problems solved, whether or not proofs or models were output, the state-of-the-art contribution, the (micro-)efficiency, the core usage, the number of new problems solved, and the number of problems solved in each problem category. In each table the CASC-26 winner is emphasized. Detailed results, including the systems’ output files, are available from the CASC-J9 web site.
The THF division
Table 4 summarizes the results of the THF division. There were no great improvements to the systems since CASC-26, with Satallax 3.2 solving the most problems again, with Leo-III and LEO-II correspondingly behind. It was rather disappointing that effectively only three systems participated. As one non-entrant said when asked about interest in a THF-LTB division, “Probably won’t participate (THF without LTB is already hard enough!)”, while another long-standing THF entrant said “I don’t think I’ll be entering this year. Instead, I’d rather focus on the next generation of HO ATPs”. Hopefully these entrants and others will return for CASC-27.
THF division results
THF division results
The average CPU times were notably higher than in CASC-26, e.g., Satallax 3.2’s average CPU time in CASC-26 was 22.9 s. An examination of the problems’ difficulty ratings shows that the problems were harder than in CASC-26 – the TNE problems in CASC-26 had an average difficulty rating (at the time) of 0.49, and the TEQ problems had an average difficulty rating of 0.40. In CASC-J9 the averages were 0.53 and 0.43 respectively. This illustrates what is well known in ATP research – a small increase of difficulty can result in much higher CPU times for finding proofs. There were four problems (all from the CSR domain) that LEO-II solved without proofs, because the time limit was reached during proof generation.
The SotACs are quite even, with Leo-III’s being the highest thanks to a high number of unique solutions (as noted in the individual problem analysis below), but no system showed any outstanding distinguishing capabilities. In contrast, the efficiencies are quite varied, with LEO-II having the highest efficiency thanks to its lower CPU times, and Leo-III having lower efficiency due to its higher CPU times. Leo-III’s higher CPU times are due to (i) its implementation in Scala, which adds a fixed overhead for starting the JRE, (ii) running multiple external systems in parallel, and (iii) the Scala implementation doing garbage collection and JIT compilation in parallel with the main thread. On the upside, Leo-III is able to make good use of the cores available for the parallel processing.
The systems’ performances in the two problem categories align with their overall performances, except that Leo-III had a slightly stronger performance on problems with equality. Beyond that there is no specialization to problems with or without equality.
The individual problem results show that 47 problems were unsolved, 172 problems were solved by all the systems, and 30 problems were solved by only one system. Of the 30 unique solutions, 23 were by Leo-III, 4 by LEO-II, and 3 by Satallax 3.2. A portfolio of systems with higher SotAC values is often effective, and here a simple portfolio of Leo-III and Satallax 3.2 giving each 150 s would solve 435 problems.
Table 5 summarizes the results of the TFA division. The winner was the new Vampire 4.3, with the CASC-26 winner Vampire 4.1 close behind. Vampire 4.3 had a significantly lower average CPU time than Vampire 4.1 (as is also reflected in its higher efficiency). Vampire 4.3 improved over Vampire 4.1 thanks to improved theory instantiation [13] and extra heuristics to control the use of theory axioms in proof search [12]. The Vampire developers believe that Vampire 4.3 could have performed even better, as there was a bug in the detection of certain theory symbols in the problem that affected the addition of theory axioms. As a result Vampire 4.3 failed to solve some relatively easy problems, e.g.,
TFA division results
TFA division results
It was noted in the CASC-26 report that some of CVC4’s proofs were vacuous (the output is “
It was a busy year for the CVC4 developers, as CVC4 was entered into and did well in multiple reasoning competitions. In particular, CVC4 had outstanding performances in SMT-COMP (
The individual problem results show that only 1 problem was unsolved, 80 problems were solved by all the systems (40 in each problem category), and 23 problems were solved by only one system. Of the 23 unique solutions, 12 were by CVC4, 6 by Vampire 4.1, 3 by Vampire 4.3, and 2 by Princess. The best portfolio would be to give 100 s to each of CVC4, Vampire 4.1, and Vampire 4.3, which would solve 194 problems, i.e., almost complete coverage of the division. The high fraction of problems solved by all systems, particularly in the TFE problem category, indicates that harder problems might be chosen for the TFA division.
Table 6 summarizes the results of the FOF division. The new Vampire 4.3 beat the CASC-26 winner Vampire 4.2. Excluding Vampire 4.2, Vampire 4.3 also had the lowest average CPU time, highest SotAC, and highest efficiency. The improved performance of Vampire 4.3 is attributed to changed strategy schedules and (according to the developer) “getting lucky with problem selection”, as nothing significantly new was added to help solve FOF problems. The lefthand plot of Fig. 1 shows the ordered solution times for the FOF division. It clearly divides the systems into five groups: the two Vampires; E and CSE_E; CVC4, Leo-II and iProver; leanCoP, nanoCoP, CSC1.1, CSC1.0, and Prover9; and finally Twee and Geo-III. The same grouping is evident in the righthand plot of Fig. 1, which shows the fraction of problems solved, in the order that the problems were presented in the competition. The righthand plot additionally shows a different grouping for problems with lower difficulty ratings: the Vampires, CSE_E, E, CVC4, Leo-III, and iProver solving most of the easy problems; leanCoP, nanoCoP, and the CSEs solving less, and finally Prover9, Twee, and Geo solving the least (with gaps between those three systems). The consistent decrease in success rates after the first 200 problems also confirms that the TPTP problem ratings do reflect increasing difficulty of the problems for the systems.
FOF division results
FOF division results

Ordered solution times and success rates in the FOF division the ordered solution times are for problems solved, individually sorted for each system in ascending order. the success rates are for all problems in the order they were presented, i.e., in non-decreasing order of difficulty rating.
It is admirable that the newcomer CSE_E did so well. It is clearly the case, and it is fully acknowledged, that CSE_E stands on the shoulders of the giant E (version 2.1). CSE_E first tries to solve the problem with E, and if unsuccessful it tries with CSE 1.0. If neither solves the problem then selected clauses inferred by CSE are combined with the original problem’s clauses to form a new problem for E. The majority of CSE_E’s solutions were found by the initial run of E. Another 3 problems were solved by CSE alone, and 11 more by E with CSE clauses added. None of the 3 problems solved by CSE alone, and only one of the 11 problems solved by E with CSE clauses added, were solved by the E 2.2pre entry in the competition. Evidently the clauses inferred by CSE help E. CSE_E was awarded the “best newcomer” prize, and a brief system description of the CSE family of systems is provided in Section 4.
CSE_E, iProver, and Twee each solved a problem for which they did not output a proof. In CSE_E’s case the problem (
Twee and Geo, which were last ranked in the division, had high SotAC values. This demonstrates how quite different calculi can solve some different problems (as is noted in the individual problem analysis below). Twee converts FOF problems to CNF, then if they are Horn converts them to unit equality problems that it attempts to refute using unit equality reasoning. Geo-III is based on geometric resolution. Another system with a different calculus is nanoCoP, which is based on the non-clausal connection calculus. nanoCoP solved 16 problems not solved by its clausal predecessor leanCoP, but due to the additional overhead of dealing with the non-clausal formula structure, altogether slightly fewer problems than leanCoP. As interesting newcomers, brief system descriptions of Twee and nanoCoP are provided in Section 4.
FNT division results
There were three new problems in FOF division, all of which are encodings of simple calculations (e.g.,
Those problems were bugfixed in the TPTP after the competition, but even in their buggy form they were still theorems.
The individual problem results show that 14 problems were unsolved, 1 problem was solved by all the systems, and 17 problems were solved by only one system. Of the 17 unique solutions, 6 were by Vampire 4.3, 3 by Twee, 2 by Vampire 4.2, 2 by CSE_E, 2 by Geo-III, 1 by CVC4, and 1 by Leo-III. A simple portfolio of Vampire 4.3 and CSE_E, giving each 150 s, would solve 467 problems.
Table 7 summarizes the results of the FNT division. Vampire 4.3 came out well ahead. Vampire 4.3’s improved performance came mainly from two new features in its finite model building component: clique detection to detect a minimal model size, and, with a greater effect, blocked clause elimination [5].
As in the THF division the average CPU times were higher than in CASC-26, and again increased problem difficulty was the cause. In this case the average difficulty rating was about the same for the FNQ problem category – 0.46 in CASC-26 and 0.47 in CASC-J9, but significantly higher in the FNN problem category – 0.19 in CASC-26 and 0.40 in CASC-J9. This resulted in increased average CPU times in the FNN problem category, e.g., Vampire 4.1’s average CPU time jumped from 24.4 s to 70.9 s.
Apart from the CPU times, the results parallel those of CASC-26, with the Vampires first, followed by iProver, CVC4, and E. Similarly, the SotACs, efficiencies, and core usages are pretty much the same. Finally, the Vampires again showed stronger performance in the FNQ problem category. The reader is referred to [20] for explanations. As was noted in Section 2.2, two problems of rating 1.0 were made eligible and used in the FNQ problem category. None of the systems solved either of them.
The individual problem results show that 7 problems were unsolved, 7 problems were solved by all the systems, and 6 problems were solved by only one system. Of the 6 unique solutions, 4 were by Vampire 4.3, and 2 by Vampire 4.1. A portfolio approach does not help.
The EPR division
Table 8 summarizes the results of the EPR division. The winner, iProver 2.8, had better performance than its predecessor iProver 2.6. The main improvement in iProver 2.8 was the use of complete lookahead literal selection in the instantiation and resolution modules. If multiple literals are eligible for selection in the given clause, complete lookahead literal selection selects the literal for which the complexity of the inferred clauses, e.g., the number of inferred clauses, the number of symbols in inferred clauses etc., is minimal (similar to [4]). There is a significant gap from iProver and Vampire to the lower ranked systems, illustrating that special techniques are required for the EPR division.
EPR division results
EPR division results
LTB division results
It was pleasing to see that all the systems output proofs and models. However, both iProvers and Vampire solved some EPT problems without proofs. In all cases it was because the time limit was reached during proof production or output. For the one problem for which Vampire did not output a proof (
The individual problem results show that 6 problems were unsolved, no problems were solved by all the systems, and 15 problems were solved by only one system. Of the 15 unique solutions, 10 were by Vampire, 4 by iProver 2.8, and 1 by iProver 2.6. A simple portfolio of iProver 2.8 and Vampire, giving each 150 s, would solve 142 problems.
Table 9 summarizes the results of the LTB division. The winner, MaLARea, was the only system to take full advantage of the features of the division, viz. learning from the training data, making multiple attempts on problems, learning from proofs found, and using all cores. Before starting on the competition problems MaLARea used the training data for learning axiom selection, and augmented that with incremental learning of axiom selection after each competition problem was solved. MaLARea made two passes through the problems (in principle it can make more, but the time limit of the division prevented another pass). In the first pass it solved 828 problems, and in the second pass it solved another 48 problems by using higher time limits for each problem. MaLARea took advantage of the multiple cores by running various k-NN learners/premise-selectors and various E strategies in parallel.
Grackle was the only other system to make use of the training data. It used the training problems (but not the solutions) to develop strategies for E. It then used all the strategies to attempt each problem using E. Four problem-strategy pairs were run in parallel with a 20 s wall clock time limit per pair, to make use of the four cores available.
Vampire 4.0 had a bug that caused it to terminate after 100000 s wall clock time, at which point it had attempted only 3553 of the problems. Its success rate when it terminated was higher than that of Vampire 4.3 – 16% of attempted problems solved vs. 15%. Without the bug it might have outperformed Vampire 4.3.
It is clear that the LTB problems were hard for the systems. The wall clock efficiency values are low compared to the other divisions of the competition, due to the low numbers of problems solved. The wall clock efficiency values are also highly varied. Vampire has higher efficiency due to its low wall clock times, helped by it’s reasonable use of the cores available. Despite making maximal use of the cores available, iProver and Grackle have low efficiencies because of their higher wall clock times.
The individual problem results show that 4069 problems were unsolved, 249 problems were solved by all the systems, and 159 problems were solved by only one system. Of the 159 unique solutions, 131 were by MaLARea, 14 by Vampire 4.0, 13 by Vampire 4.2, and 1 by E. It is hard to judge whether or not a portfolio approach would help here, as the various systems impose different internal time limits on proof attempts.
System descriptions
Three new systems were entered into the FOF division of CASC-J9: the CSE family consisting of CSE 1.0, CSE 1.1, and CSE_E 1.0; nanoCoP 1.1; and Twee 2.2. The following brief descriptions of these new systems were written by their developers.
Conclusion
CASC-J9was the twenty-third large scale competition for fully automatic, classical logic ATP systems. The competition provided exposure for system builders both within and beyond the ATP community, and provided an overview of the implementation state of fully automatic classical logic ATP systems. CASC-J9 fulfilled its objectives by evaluating the relative abilities of current ATP systems, and stimulating development and interest in ATP.
The highlights of CASC-J9 were disappointing turnouts in the THF and TFA divisions (i.e., lowlights), three interesting new systems in the FOF division, and the re-emergence of MaLARea in the LTB division.
While the design of CASC is mature and stable, each year’s experiences lead to ideas for changes and improvements. Some changes that are being considered for CASC-27 are:
Two versions of the FOF and FNT divisions, one with a CPU time limit for each problem (as is currently the case), and another with a wall clock time limit for each problem, to promote the use of all cores on the CPU.
The UEQ (Unit EQuality) division will come out of its hiatus state.
A multi-format LTB category in which each problem is encoded in several of the available TPTP languages – TH1, TH0, TF1, TF0, and FOF. Systems will be able to attempt whichever versions they want, and a solution to any version will constitute a solution to the problem.
As always, the ongoing success and utility of CASC depends on ongoing contributions of problems to the TPTP. The automated reasoning community is encouraged to continue making contributions of all types of problems.
Footnotes
Acknowledgements
Development of the CSE family has been partially supported by the National Natural Science Foundation of China (NSFC) (Grant No.61673320) and the Fundamental Research Funds for the Central Universities in China (Grant No.2682018ZT10).
