The 7th IJCAR automated theorem proving system competition

Abstract

The CADE ATP System Competition (CASC) is an annual evaluation of fully automatic, classical logic Automated Theorem Proving (ATP) systems. CASC-J7 was the nineteenth competition in the CASC series. Twenty-four ATP systems and system variants competed in the various competition and demonstration divisions. An outline of the competition design, and a commentated summary of the results, are presented.

Keywords

Automated theorem proving competition

1. Introduction

The CADE ATP System Competition (CASC) is an annual evaluation of fully automatic, classical logic Automated Theorem Proving (ATP) systems – the world championship for such systems. One purpose of CASC is to provide a public evaluation of the relative capabilities of ATP systems. Additionally, CASC aims to stimulate ATP research, to motivate development and implementation of robust ATP systems that are useful and easily deployed in applications, to provide an inspiring environment for personal interaction between ATP researchers, and to expose ATP systems within and beyond the ATP community. Fulfillment of these objectives provides insight and stimulus for the development of more powerful ATP systems, leading to increased and more effective use.

CASC-J7 was held on 20th July 2014 in Vienna, Austria, as part of the 7th International Joint Conference on Automated Reasoning, which was in turn part of the Federated Logic Conference (FLoC) component of the 2014 Vienna Summer of Logic. The “FLoC Olympic Games” hosted 14 competitions in various areas of logic and reasoning, which provided the CASC-J7 entrants the opportunity to observe and interact with entrants with other types of reasoning tools. CASC-J7 was the nineteenth competition in the CASC series; see [13] and citations therein for information about individual previous competitions. Twenty-four ATP systems and system variants, listed in Table 1, competed in the various divisions of CASC-J7. The division winners of CASC-24 (the previous CASC) were automatically entered to provide benchmarks against which progress can be judged.1

¹
In the THF division (see Section 2.1 for details of the divisions) the CASC-24 runner-up, Satallax 2.7, was entered instead of the CASC-24 winner, Satallax-MaLeS 1.2, because it was too hard to install Satallax-MaLeS 1.2 on the new competition computers. This had the useful side effect of providing a comparison between the plain Satallax and the MaLeS enhanced version – see Section 3.1. As the UEQ division had been suspended since CASC-23 in 2011, the CASC-23 UEQ winner was entered.

Additionally, Prover9 2009-11A is entered every year as a fixed point against which progress can be judged (this is the third year that Prover9 has been used as the fixed point – in prior years Otter was used). System descriptions of the entered systems are in [12] and on the CASC-J7 web site. The web site also provides access to all systems and competition resources: http://www.tptp.org/CASC/J7.

Table 1

The ATP systems and entrants

ATP system	Divisions	Entrant (associates)	Entrant’s affiliation
agsyHOL 1.0	THF	Fredrik Lindblad	University of Gothenburg
Beagle 0.9	TFA	Peter Baumgartner (Joshua Bax)	NICTA and ANU
cocATP 0.2.0	THF	Cristóbal Camarero	University of Cantabria
CVC4 1.4	TFA FOF FNT	Andrew Reynolds (Clark Barrett, Cesare Tinelli, Morgan Deters)	EPFL
E 1.9	FOF FNT EPR UEQ	Stephan Schulz	DHBW Stuttgart
E.T. 0.1	FOF	Josef Urban	Radboud University Nijmegen
HOLyHammer 140616	THF	Cezary Kaliszyk (Stephan Schulz, Josef Urban)	University of Innsbruck
iProver 0.9	EPR	CASC	CASC-24 EPR winner
iProver 1.0	FNT	CASC	CASC-24 FNT winner
iProver 1.4	FOF FNT EPR (demo)	Konstantin Korovin	University of Manchester
Isabelle 2013	THF	Jasmin Blanchette (Lawrence Paulson, Tobias Nipkow, Makarius Wenzel)	Technische Universität München
leanCoP 2.2	FOF	Jens Otten	University of Potsdam
LEO-II 1.6.2	THF	Christoph Benzmüller	Freie Universität Berlin
Muscadet 4.4	FOF	Dominique Pastre	University Paris Descartes
Princess 140704	TFA	Philipp Rümmer (Peter Backeman)	Uppsala University
Prover9 2009-11A	FOF	CASC (William McCune, Bob Veroff)	CASC fixed point
Satallax 2.7	THF	CASC	CASC-24 THF runner-up
Satallax-MaLeS 1.3	THF	Daniel Kühlwein (Sil van de Leemput, Frank Dorssers, Wouter Geraedts)	Radboud University Nijmegen
SPASS+T 2.2.19	TFA	CASC	CASC-24 TFA winner
SPASS+T 2.2.20	TFA	Uwe Waldmann	Max-Planck-Institut für Informatik
Vampire 2.6	FOF	CASC	CASC-24 FOF winner
VanHElsing 1.0	FOF	Daniel Kühlwein (Sil van de Leemput, Frank Dorssers, Wouter Geraedts)	Radboud University Nijmegen
Waldmeister 710	UEQ	CASC	CASC-23 UEQ winner
Zipperposition 0.4	TFA FOF	Simon Cruanes	INRIA

CASC-J7 was organized by Geoff Sutcliffe, and overseen by a panel consisting of Bernhard Beckert, Maria Paola Bonacina and Aart Middeldorp. The competition was run on computers provided by the StarExec project [7], at the University of Iowa.

This paper is organized as follows: Section 2 outlines the design and organization of CASC-J7. Section 3 provides a commentated summary of the results. Section 4 contains short descriptions of three of the ATP systems. Section 5 concludes and discusses plans for future CASCs.

2. Outline of design and organization

The design and organization of CASC has evolved over the years to a sophisticated state. An outline of the CASC-J7 design and organization is provided here. The details are in [12] and on the CASC-J7 web site.

Table 2
Divisions and problem categories

Division	Problems	Problem categories
THF	Typed Higher-order Form theorems (axioms with a provable conjecture).	TNE – THF with No Equality, e.g., NUM738^1.
THF		TEQ – THF with EQuality, e.g., SET171^3.
TFA	Typed First-order form theorems with Arithmetic (axioms with a provable conjecture).	TFI – TFA with only Integer arithmetic, e.g., DAT016=1.
		TFR – TFA with only Rational arithmetic, e.g., ARI621=2.
		TFE – TFA with only rEal arithmetic, e.g., MSC022=2.
FOF	First-Order Form theorems (syntactically non-propositional, axioms with a provable conjecture).	FNE – FOF with No Equality, e.g., COM003+1.
FOF		FEQ – FOF with EQuality, e.g., SEU147+3.
FNT	FOF syntactically non-propositional Non-Theorems (axioms with a counter-satisfiable conjecture, and satisfiable sets of axioms).	FNN – FNT with No equality, e.g., KRS173+1.
FNT		FNQ – FNT with eQuality, e.g., MGT033+2.
EPR	Effectively PRopositional theorems and non-theorems in clause normal form (unsatisfiable and satisfiable clause sets). Effectively propositional means syntactically non-propositional but directly reducible to a propositional form, i.e., problems with a finite Herbrand universe, and problems with no variables and no equality.	EPT – Effectively Propositional Theorems (unsatisfiable clause sets), e.g., PUZ037-3.
EPR		EPS – Effectively Propositional non-theorems (Satisfiable clause sets), e.g., GRP123-2.005.
UEQ	Clause normal form non-propositional Unit EQuality theorems (unsatisfiable clause sets).	E.g., RNG026-7.

Important changes for CASC-J7 were:2

This summary of changes is most useful for readers who are already familiar with the general design of CASC. Other readers can skip to Section 2.1.

The competition was run on StarExec. Systems had to be delivered as StarExec installation packages.

Systems had to use the SZS ontology and standards [9] for reporting their results.

The LTB division took a one year hiatus, to simplify the move to StarExec.

The UEQ division returned from its three year hiatus.

2.1. Divisions

CASC is divided into divisions according to problem and system characteristics. There are competition divisions in which the systems are explicitly ranked, and a demonstration division in which systems demonstrate their abilities without being ranked. (The demonstration division is for systems that cannot be entered into the competition divisions for any reason, e.g., the system runs on specialist hardware, or the entrant is a competition organizer or panel member.) In CASC-J7 iProver 1.4 ran in the demonstration division because it was entered after the system registration deadline. Each competition division uses problems that have certain logical, language, and syntactic characteristics, so that the systems that compete in the division are, in principle, able to attempt all the problems in the division. Some divisions are further divided into problem categories that make it possible to analyze, at a more fine-grained level, which systems work well for what types of problems. The demonstration division uses the same problems as the competition divisions – the entry specifies which competition divisions’ problems are to be used. Table 2 catalogs the divisions and problem categories of CASC-J7. The example problems can be viewed online at http://www.tptp.org/cgi-bin/SeeTPTP?Category=Problems.

2.2. System execution and evaluation

The ATP systems were delivered to the competition organizer as StarExec installation packages, which the organizer installed and tested on StarExec. Source code was delivered separately for archiving on the competition web site.

The ATP systems entered into CASC are required to be sound and fully automatic. They are executed as black boxes, with one problem file name and any entrant specified switches as command line parameters. The command line switches have to be the same for all problems in each division. The systems are tested for soundness by submitting non-theorems to the systems in the THF, TFA, FOF, EPR and UEQ, divisions, and theorems to the systems in the FNT and EPR divisions. Claiming to have found a proof of a non-theorem or a disproof of a theorem indicates unsoundness. For the first time in many years, no systems failed this test in CASC-J7. While this is hopefully a sign of improved systems, more extensive soundness testing might be more revealing, and this is planned for CASC-25 in 2015.

The THF, TFA EPR and UEQ divisions were ranked according to the number of problems solved, but not necessarily accompanied by a proof or model (thus giving only an assurance of the existence of a proof or model). The FOF and FNT divisions were ranked according to the number of problems solved with an acceptable proof/model output.3

³
The models are “counter-models” for problems with an unprovable conjecture, and “models” for satisfiable sets of axioms.

The competition panel decides whether or not a system’s proof/model format is acceptable. Ties are broken according to the average time over problems solved. Trophies are awarded to the division winners.

Table 3

Numbers of eligible and used problems

Division	THF		TFA			FOF		FNT		EPR		UEQ

Category	TNE	TEQ	TFI	TFR	TFE	FNE	FEQ	FNN	FNQ	EPT	EPS
Eligible	199	885	351	113	64	360	3391	226	146	193	81	509
Usable	199	705	189	41	39	178	716	129	146	193	77	509
New	0	8	53	0	0	0	6	0	10	10	0	0
Used	150	250	100	50	50	150	250	100	100	125	75	200
New	0	8	50	0	0	0	6	0	10	10	0	0

In addition to the ranking criteria, two other measures were made and presented in the results: The state-of-the-art (SoTA) contribution quantifies the unique abilities of each system. For each problem solved by a system, its SoTA contribution for the problem is the inverse of the number of systems that solved the problem, so that if a system is the only one to solve a problem then its SoTA contribution for the problem is 1.00, and if all the systems solve a problem their SoTA contribution for the problem is the inverse of the number of systems. A system’s overall SoTA contribution is its average SoTA contribution over the problems it solved. The efficiency measure is a combined measure that balances the time taken for each problem solved against the number of problems solved. It is the average solution rate over the problems solved (the solution rate for one problem is the inverse of the time taken to solve it), multiplied by the fraction of problems solved.

2.3. Problems

The problems were taken from the Thousands of Problems for Theorem Provers (TPTP) problem library [10], v6.1.0. The TPTP version used for CASC is not released until after the competition, so that new problems have not been seen by the entrants. The problems are given to the ATP systems in TPTP format, with no preprocessing.

TPTP problems that are tagged as “biased” are excluded from CASC, i.e., the problems cannot be designed specifically to be suited or ill-suited to some ATP system, calculus or control strategy. The problems’ TPTP difficulty ratings [14] are then used to determine the eligibility of problems, with the aim of using problems that differentiate the systems. The difficulty rating is based on ATP system performance data in the Thousands of Solutions from Theorem Provers (TSTP) solution library [11], and ranges from 0.00 (easiest) to 1.00 (unsolved). Problems with difficulty rating 0.21–0.99 are eligible.

The numbers of problems used in each division and problem category are roughly proportional to the numbers of eligible problems. These numbers are subject to a limitation on the number of very similar problems in each division and problem category [8]. In TPTP v6.1.0 there are very few TFR and TFE problems that are not “very similar”, and therefore 20 problems that were very similar to others had to be selected for use. The problems used are randomly selected from the eligible problems at the start of the competition, based on a seed provided by the competition panel. The random selection is biased to select (if possible) at least 50% new problems in each division and category. In CASC-J7 only the TFI problem category of the TFA division had sufficient new problems to reach 50%. Table 3 gives the numbers of eligible problems, the maximal numbers that could be used after taking into account the limitation on very similar problems (in the context of the numbers of problems used), and the numbers of problems used, in each division and category.

2.4. Resources

The competition computers each had two quad-core Intel^® Xeon^® E5-2609, 2.40 GHz CPUs, 256 GB memory, and ran the Red Hat Enterprise Linux Workstation release 6.3 (Santiago) operating system, kernel 2.6.32-431.1.2.el6.x86_64. Demonstration division systems can run on the competition computers, or the computer(s) can be supplied by the entrant. The CASC-J7 demonstration division system used the competition computers.

A 300 s CPU time limit was imposed for each problem. A wall clock time limit of 600 s was also imposed to limit very high memory usage that causes swapping.

3. Results

For each ATP system, for each problem, three items of data were recorded: whether or not the problem was solved, the CPU time taken, and whether or not a proof or model was output. This section summarizes the results, and provides commentary. The result tables below give the number of problems solved, the average time over problems solved, whether or not proofs or models were output, the state-of-the-art contribution, the (micro-)efficiency, and the number of new problems solved. In each of the results tables, the CASC-24 winner is emphasized. Detailed results, including the systems’ output files, are available from the CASC-J7 web site.

Table 4
THF division results

ATP system	THF/400	Avg CPU	Prfs out	SOTA Con.	μ Eff.	New/8	TNE/150	TEQ/250
Sat’x-MaLeS 1.3	333	18.9		0.24	313	8	126	207
Satallax 2.7	324	5.7		0.25	576	8	128	196
Isabelle 2013	316	38.9		0.23	113	5	126	190
LEO-II 1.6.2	269	7.2	✓	0.23	554	7	106	163
agsyHOL 1.0	236	5.8		0.24	449	5	87	149
HOLyH’r 140616	96	25.9		0.23	130	0	27	69
cocATP 0.2.0	74	4.7	✓	0.21	139	1	37	37

Table 5

TFA division results

ATP system	TFA/200	Avg CPU	Prfs out	SOTA Con.	μ Eff.	New/50	TFI/100	TFR/50	TFE/50
CVC4 1.4	179	4.8		0.22	797	33	80	50	49
Princess 140704	176	11.8		0.21	307	35	81	49	46
SPASS+T 2.2.19	173	3.4	✓	0.19	402	30	75	50	48
SPASS+T 2.2.20	173	3.6	✓	0.19	402	30	75	50	48
Beagle 0.9	173	5.5		0.20	623	28	73	50	50
Zipperp’n 0.4	80	6.6		0.27	313	44	80	0	0

3.1. The THF division

Table 4 summarizes the results of the THF division. Satallax-MaLeS is a strategy scheduling wrapper around Satallax. The strategies are generated using the MaLeS framework, which provides functions that predict how long the reasoning component might need to solve a given problem when using each strategy. The plain Satallax was not the winner at CASC-24, but came second by 3 problems out of 150, to the previous version of Satallax-MaLes. It is noteworthy that all the systems except agsyHOL and cocATP rely to some extent on translation to first-order form and calling the E prover (whose strong performance in the FOF division is noted in Section 3.3).

The SoTA contribution values are all very similar, indicating that all the systems have similar degrees of unique capability. Of the top systems, Satallax-MaLeS and Isabelle have lower micro-efficiencies, reflecting their use of strategy scheduling that leads to higher average CPU times. The systems’ abilities to solve the new problems, and their performances in the two problem categories, align with their overall performances, indicating no specific tuning to existing TPTP problems, or specialization to problems with or without equality.

The individual problem results show that 13 problems were unsolved, and 18 problems were solved by only one system. Of those 18 unique solutions, 6 were by Satallax, 5 by agsyHOL, 2 by Isabelle, 2 by Satallax-MaLeS, 2 by LEO-II and 1 by cocATP. The unique solutions suggest that a portfolio approach would be effective for these THF problems. Giving Satallax-MaLeS, Satallax, Isabelle, LEO-II and agsyHOL 60 s each would solve 381 problems.

Table 6
FOF division results

ATP system	FOF/400	Avg CPU	Prfs out	SOTA Con.	μ Eff.	New/6	FNE/150	FEQ/250
Vampire 2.6	375	13.2	372	0.22	571	5	141	234
E.T. 0.1	339	29.3	339	0.18	361	5	115	224
E 1.9	321	22.9	321	0.17	466	0	116	205
VanHElsing 1.0	310	17.3	310	0.17	168	0	111	199
CVC4 1.4	215	46.0	215	0.15	228	0	83	132
leanCoP 2.2	158	55.2	158	0.14	129	0	84	74
Prover9 1109a	95	41.5	95	0.14	119	0	21	74
Zipperp’n 0.4	73	28.8	73	0.13	75	0	26	47
Muscadet 4.4	32	19.7	30	0.12	47	0	6	26
Princess 140704	134	69.3	0	0.13	17	0	47	87
Demonstration division

iProver 1.4	216	18.1	214	0.16	216	6	116	100

3.2. The TFA division

Table 5 summarizes the results of the TFA division. The winner, CVC4, is a newcomer to the TFA division. It did not compete in the TFA division of CASC-24 because it did not have a TFF parser then. In the last year a TFF parser was added, and a variety of new techniques were implemented. These include a newly developed technique for finding conflicting instances of quantified formulae, and methods for recognizing quantified formulas whose relevant instantiations fall within a bounded integer range. A “weakness” of CVC4 is that it cannot solve problems that involve non-linear arithmetic. There were 13 such problems in CASC-J7, 12 in the TFI problem category and 1 in the TFE problem category. A brief system description of CVC4 is provided in Section 4.

The old version of SPASS+T, which won CASC-24, was beaten by Princess, and was differentiated from Beagle and the new version of SPASS+T only by average CPU time. In CASC-24 the old version of SPASS+T solved more problems than Princess and Beagle, indicating that there have been some improvements in those two systems in the last year. The new version of Princess has more efficient data structures, a built-in solver for non-linear integer arithmetic, better heuristics to instantiate quantified axioms, and an updated strategy portfolio. The new version of Beagle has a chaining rule, automatic selection of orderings, aggressive simplification of background terms and clauses, splitting to reduce the size of instances generated by Cooper’s algorithm, preprocessing based on Fourier–Motzkin elimination and equality rewriting, a heuristic to find unsatisfiable cores of theory formulas for backtracking, much more caching, basic clause indexing, and a new auto strategy that uses two strategies in sequence, each allotted half of the time. Brief system descriptions of Princess and Beagle are provided in Section 4.

SPASS+T relies on Yices as an underlying solver for arithmetic, and unfortunately 16 of the problems use reserved Yices keywords (bool and int). As a result SPASS+T crashed on these problems – a simple fix might have given SPASS+T a victory! It is noteworthy that SPASS+T was also the only system to output proofs.

The rankings in the three problem categories are quite different from the overall ranking. Highlights include the strong performance of Zipperposition in the TFI category (compared to a complete inability to deal with rationals or reals), and Beagle’s strong performance in the TFR and TFE categories. It is also noteworthy that Zipperposition solved the most of the 50 (all TFI) new problems. Thirty-one of the new problems are about data structures that contain integers, and 18 are software verification problems exported from the Toccata project.4

⁴
http://toccata.lri.fr/gallery/why3.en.html.

The individual problem results show that there were 53 problems that were solved by all systems, all in the TFI problem category. There were 12 problems solved by only one system, including 4 by CVC4 and 6 by Zipperposition. A portfolio of these two systems giving them 150 s each would solve 196 problems, and adding Beagle to the portfolio so that each system gets 100 s would solve 198 problems.

Table 7

FNT division results

ATP system	FNT/200	Avg CPU	Mdls out	SOTA Con.	μ Eff.	New/10	FNN/100	FNQ/100
iProver 1.0	150	59.1	150	0.30	146	4	79	71
Crossbow 0.1	124	21.7	124	0.32	107	9	77	47
CVC4 1.4	116	24.4	116	0.25	374	6	54	62
E 1.9	82	7.2	82	0.38	292	0	42	40
Demonstration division

iProver 1.4	133	63.4	133	0.29	186	4	77	56

Table 8

EPR division results

ATP system	EPR/200	Avg CPU	Pr out	M out	SOTA Con.	μ Eff.	EPT/125	EPS/75
iProver 0.9	183	23.0	×	✓	0.44	400	116	67
E 1.9	88	31.0	✓	✓	0.36	98	56	32
Demonstration division

iProver 1.4	180	22.8	✓	✓	0.43	405	114	66

3.3. The FOF division

Table 6 summarizes the results of the FOF division. The table is ordered by the number of proofs output, which was the ranking order for the FOF division. There was no new version of Vampire this year, but the two-year old Vampire 2.6 still won the division. The next three systems are all based on the E prover. E.T. runs preprocessing tools that are targeted mainly at problems with many redundant axioms, followed by the E prover with specific strategies. VanHElsing implements a new automatic mode for E, using machine learning techniques to predict a search strategy is most likely to lead to a proof. The gap between the Vampire/E systems and the lower ranked systems has been evident for several years now. The amount of experience and implementation effort required to develop such powerful ATP systems makes it difficult for others to reach their level of performance.

There were three problems for which Vampire did not output a complete proof in the 300 s time limit, despite having established the theoremhood well within the time limit. The problems were PUZ073+1 (theoremhood in 28.7 s), LCL648+1.010 (theoremhood in 3.9 s) and LCL684+1.010 (theoremhood in 5.1 s). For all three problems Vampire splits off propositional parts of the clause set and uses binary decision diagrams on those parts. This results in very large proofs in TPTP format, e.g., the partial proof output for PUZ073+1 is a 1.6 GB file.

The individual problem results show that there were 15 problems that were not solved, and 6 problems that were solved by all systems. There were 18 problems solved by only one system – 14 by Vampire, 2 by E.T., and 1 each by E and iProver. Vampire solved more problems than the union solved by all the other systems, and has the highest SOTA contribution and efficiency. The dominance of Vampire means that a portfolio approach does not yield much gain – a portfolio of Vampire and E given 150 s each would solve 378 problems.

3.4. The FNT division

Table 7 summarizes the results of the FNT division. The division was won by last year’s winner, with the other three systems trailing reasonably far behind. The new version of iProver in the demonstration division also outperformed the other three competition division systems, but did not improve on last year’s winner. The reason for this apparent lack of progress is discussed in the conclusion – Section 5. However, it is noteworthy that the other systems, E in particular, have significantly lower average CPU times, which in combination with the numbers of problems solved resulted in higher efficiencies for CVC4 and E. It is pleasing to note that all the systems output models.

The individual problem results show that 14 problems were solved by only one system – 8 by E, 4 by Crossbow and 2 by iProver. E’s relatively large number of unique solutions is the reason for its high SOTA contribution. A portfolio of iProver, Crossbow and E, given 100 s each, would solve 172 problems. A portfolio with unequal allocation of CPU time, giving iProver 201 s, Crossbow 74 s and E 25 s, would solve 181 problems.

3.5. The EPR division

Table 8 summarizes the results of the EPR division. iProver continued to demonstrate its strength in this division, having now won it for seven CASCs.

The individual problem results show that there were 13 problems that were not solved, and 85 problems that were solved by all three systems. There were 5 problems solved by only iProver, and 3 by only E. However, there is no gain to be had from running a portfolio of iProver and E with 150 s each, because most of their unique solutions took more than 150 s.

3.6. The UEQ division

Table 9 summarizes the results of the UEQ division. The UEQ division had been on a hiatus since CASC-23 in 2001, due to the dominance of Waldmeister since CASC-14 in 1997 – it was getting boring. It was hoped that the hiatus would give other systems a chance to improve to be competitive with Waldmeister, but there is no evidence of that.

Table 9
UEQ division results

ATP system	UEQ/200	Avg CPU	Prfs out	SOTA Con.	μ Eff.
Waldmeister 710	194	11.8	✓	0.57	746
E 1.9	168	13.1	✓	0.50	633

4. System descriptions

Three systems had particularly salient performances in CASC-J7: CVC4 1.4, Princess 140704 and Beagle 0.9, all in the TFA division. The following brief descriptions of these systems were written by their developers.

CVC4 1.4 [1] is an SMT solver based on the DPLL(T) architecture, that includes built-in support for a variety of background theories. Like other SMT solvers, CVC4 treats quantified formulas using a two-tiered approach, where ground theory solver(s) are used in conjunction with an underlying SAT solver to determine satisfiability of the current set of ground clauses, while a quantifier instantiation module adds additional instances of quantified formulas to this set. For unsatisfiable inputs, CVC4 primarily uses heuristic approaches based on E-matching. This approach is accelerated using a newly developed technique for finding conflicting instances of quantified formulas [3], which significantly reduces the number of instantiations needed for solving unsatisfiable inputs. For satisfiable inputs, CVC4 uses a finite model finding approach, using a theory of finite cardinality constraints [4] to minimize the domain sizes of sorts in candidate models. In this approach, quantifier instantiation can be further constrained by considering only instantiations that refine the current candidate model [5]. CVC4 also uses a variety of strategies that specialize in handling inputs of a certain form, including methods for recognizing quantified formulas whose relevant instantiations fall within a bounded integer range.

Princess 140704 [6] is a theorem prover for first-order logic modulo arithmetic. Princess is designed and implemented using a combination of techniques from first-order theorem proving and SMT. The main underlying calculus is a free-variable tableau, extended with Presburger arithmetic constraints to enable backtracking-free proof expansion, and positive unit hyper-resolution for lightweight instantiation of quantified formulae. Uninterpreted functions are handled using a relational encoding as uninterpreted predicates. Princess tends to perform well for TFI problems because of its integration of built-in proof rules for integer arithmetic reasoning with free variables for handling quantifiers. Real and rational arithmetic are handled using less efficient explicit first-order axioms, leading to worse performance in the TFR and TFE problem categories. Improved APIs for integration into other systems were recently added to Princess, including a uniform interface for definition of new theory solvers. This interface was used in CASC-J7 for adding native support for non-linear integer arithmetic. Princess is written in Scala and runs on any recent Java virtual machine. In CASC, Princess used a static portfolio of configurations chosen based on training with a random sample of problems from TPTP.

Beagle 0.9 [2] is a Scala implementation of the Hierarchic Superposition calculus with Weak Abstraction. The usual superposition rules are extended with rules for splitting variable disjoint clauses and chaining for reasoning with inequalities (currently used as a backup). The first-order reasoning part uses a Discount loop for saturation and includes standard simplification rules, e.g., tautology deletion and subsumption deletion. Beagle integrates several theory solvers using the hierarchic superposition framework: an optimized version of Cooper’s quantifier elimination for linear integer arithmetic, and Fourier–Motzkin elimination for linear rational and real arithmetic theories. These are used in a black-box way to refute clause sets over the respective background theories. Beagle additionally includes a comprehensive set of arithmetic simplification rules that are aggressively applied, and a heuristic instantiation algorithm that is used to evaluate non-linear arithmetic terms. These simplification rules, as well as eager application of theory solvers on arithmetic subformulae in preprocessing, are responsible for quick solutions for many of the problems from the TPTP ARI (arithmetic) domain. Beagle performs well on data structure problems due to an informed selection of orderings, as well as the built in restrictions on inferencing that comes with the hierarchic superposition calculus. On the other hand, Beagle does not perform well on problems with large combinatorial search, for example those in the HWV (hardware verification) domain, which are easy for SMT solvers.

5. Conclusion

CASC-J7 was the nineteenth large scale competition for fully automatic, classical logic ATP systems. CASC-J7 fulfilled its objectives by evaluating the relative abilities of current ATP systems, and stimulating development and interest in ATP.

For the organizers and entrants, the move to StarExec provided some minor challenges in terms of building StarExec installation packages, and an increased need to comply with TPTP standards (e.g., where to look for included files). At the same time, the availability of the large number of high-performance compute nodes made it possible to use many more problems – in the THF, TFA, FOF, FNT and EPR divisions that were common to CASC-24 and CASC-J7, the combined number of problems used rose from 800 in CASC-24 to 1400 in CASC-J7. The efforts of the StarExec development team were much appreciated during the preparation and running of CASC-J7!

The highlight of CASC-J7 was the progress made in the TFA division. The comparative stability of the other divisions prompted some observers to ask whether those areas had reached a plateau of development, but this is believed to not be the case – one year of stability does not constitute a trend! Rather, the stability is believed to be an artifact of the TPTP problem rating scheme and the CASC problem selection criteria, by which there can be some advantage for systems that are used to populate the TSTP and thus affect the TPTP problem ratings, and hence the CASC problem selection. Many of the top performing systems from CASC-24 were used to populate the TSTP during the transition to StarExec, while some of the new systems were not submitted in time to be used. As a result the top performing systems from CASC-24 obtained some advantage which might have helped them do well again in CASC-J7. It is expected that by CASC-25 in 2015 all the system developers will be familiar with StarExec, and be able to submit their systems for problem rating. This will provide a leveled playing field. The new version of Vampire [15], expected to compete in CASC-25, might also shake things up a bit.

While the design of CASC is mature and stable, each year’s experiences lead to ideas for changes and improvements. Some changes that are being considered for CASC-25 are:

Requiring proof output in the TFA division.

Adding a THF with arithmetic division, or a polymorphic typed first-order form division, or a TFA non-theorem division.

Returning the LTB division from its hiatus.

More extensive soundness testing.

As always, the ongoing success and utility of CASC depends on ongoing contributions of problems to the TPTP. The automated reasoning community is encouraged to continue making contributions of all types of problems.

References

[1]

Barrett,

Conway,

Deters,

Hadarean,

Jovanovic,

King,

Reynolds and

Tinelli, CVC4, in: Proceedings of the 23rd International Conference on Computer Aided Verification,

Gopalakrishnan and

Qadeer, eds, Lecture Notes in Computer Science, Vol. 6806, Springer-Verlag, 2011, pp. 171–177.

[2]

Baumgartner and

Waldmann, Hierarchic superposition with weak abstraction, in: Proceedings of the 24th International Conference on Automated Deduction,

M.P.

Bonacina, ed., Lecture Notes in Artificial Intelligence, Vol. 7898, Springer-Verlag, 2013, pp. 39–57.

[3]

Reynolds,

Tinelli and

de Moura, Finding conflicting instances of quantified formulas in SMT, in: Proceedings of the 14th Conference on Formal Methods in Computer-Aided Design,

Claessen and

Kuncak, eds, 2014, pp. 195–202.

[4]

Reynolds,

Tinelli,

Goel and

Krstic, Finite model finding in SMT, in: Proceedings of the 25th International Conference on Computer Aided Verification,

Sharygina and

Veith, eds, Lecture Notes in Computer Science, Vol. 8044, Springer-Verlag, 2013, pp. 640–655.

[5]

Reynolds,

Tinelli,

Goel,

Krstic,

Deters and

Barrett, Quantifier instantiation techniques for finite model finding in SMT, in: Proceedings of the 24th International Conference on Automated Deduction,

M.P.

Bonacina, ed., Lecture Notes in Artificial Intelligence, Vol. 7898, Springer-Verlag, 2013, pp. 377–391.

[6]

Rümmer, A constraint sequent calculus for first-order logic with linear integer arithmetic, in: Proceedings of the 15th International Conference on Logic for Programming Artificial Intelligence and Reasoning,

Cervesato,

Veith and

Voronkov, eds, Lecture Notes in Artificial Intelligence, Vol. 5330, Springer-Verlag, 2008, pp. 274–289.

[7]

Stump,

Sutcliffe and

Tinelli, StarExec: A cross-community infrastructure for logic solving, in: Proceedings of the 7th International Joint Conference on Automated Reasoning,

Demri,

Kapur and

Weidenbach, eds, Lecture Notes in Artificial Intelligence, Vol. 8562, 2014, pp. 367–373.

[8]

Sutcliffe, The CADE-16 ATP system competition, Journal of Automated Reasoning 24(3) (2000), 371–396.

[9]

Sutcliffe, The SZS ontologies for automated reasoning software, in: Proceedings of the LPAR Workshops, Knowledge Exchange: Automated Provers and Proof Assistants, and the 7th International Workshop on the Implementation of Logics,

Sutcliffe,

Rudnicki,

Schmidt,

Konev and

Schulz, eds, CEUR Workshop Proceedings, Vol. 418, 2008, pp. 38–49.

10.

[10]

Sutcliffe, The TPTP problem library and associated infrastructure. The FOF and CNF parts, v3.5.0, Journal of Automated Reasoning 43(4) (2009), 337–362.

11.

[11]

Sutcliffe, The TPTP world – Infrastructure for automated reasoning, in: Proceedings of the 16th International Conference on Logic for Programming Artificial Intelligence and Reasoning,

Clarke and

Voronkov, eds, Lecture Notes in Artificial Intelligence, Vol. 6355, Springer-Verlag, 2010, pp. 1–12.

12.

[12]

Sutcliffe, in: Proceedings of the 7th IJCAR ATP System Competition, Vienna, Austria, 2014.

13.

[13]

Sutcliffe, The CADE-24 automated theorem proving system competition – CASC-24, AI Communications 27(4) (2014), 405–416.

14.

[14]

Sutcliffe and

C.B.

Suttner, Evaluating general purpose automated theorem proving systems, Artificial Intelligence 131(1,2) (2001), 39–54.

15.

[15]

Voronkov, AVATAR: The new architecture for first-order theorem provers, in: Proceedings of Automated Reasoning,

Biere and

Bloem, eds, Lecture Notes in Computer Science, Vol. 8559, 2014, pp. 696–710.

The 7th IJCAR automated theorem proving system competition – CASC-J7

Abstract

Keywords

1. Introduction

Table 2 Divisions and problem categories

2.2. System execution and evaluation

3 The models are “counter-models” for problems with an unprovable conjecture, and “models” for satisfiable sets of axioms.

2.4. Resources

3. Results

Table 4 THF division results

Table 6 FOF division results

4 http://toccata.lri.fr/gallery/why3.en.html.

3.4. The FNT division

3.5. The EPR division

3.6. The UEQ division

Table 9 UEQ division results

5. Conclusion

References

Table 2
Divisions and problem categories

³
The models are “counter-models” for problems with an unprovable conjecture, and “models” for satisfiable sets of axioms.

Table 4
THF division results

Table 6
FOF division results

⁴
http://toccata.lri.fr/gallery/why3.en.html.

Table 9
UEQ division results