Hierarchical invention of theorem proving strategies

Abstract

State-of-the-art automated theorem provers (ATPs) such as E and Vampire use a large number of different strategies to traverse the search space. Inventing targeted proof search strategies for specific problem sets is a difficult task. Several machine learning methods that invent strategies automatically for ATPs have been proposed previously. One of them is the Blind Strategymaker (BliStr) system for inventing strategies of the E prover.

In this paper we describe BliStrTune – a hierarchical extension of BliStr. BliStrTune explores much larger space of E strategies than BliStr by interleaving search for high-level parameters with their fine-tuning. We use BliStrTune to invent new strategies based also on new clause weight functions targeted at problems from large ITP libraries. We show that the new strategies significantly improve E’s performance.

Keywords

Automated theorem proving parameter learning proof search heuristics clause weight functions

1. Introduction: ATP strategy invention

State-of-the-art automated theorem provers (ATPs) such as E [22,23] and Vampire [14] achieve their performance by using sophisticated proof search strategies and their combinations. Constructing good ATP search strategies is a hard task that is potentially very rewarding. Until recently, there has been, however, little research in this direction in the ATP community.

With the arrival of large ATP problem sets and benchmarks extracted from the libraries of today’s interactive theorem prover (ITP) systems [2–4,10,11], automated generation of targeted ATP strategies became an attractive topic. It seems unlikely that manual (“theory-driven”) construction of targeted strategies can scale to large numbers of ATP problems spanning many different areas of mathematics and computer science. Starting with the Blind Strategymaker (BliStr) [28] that was used to invent E’s strategies for MaLARea [13,29] on the 2012 Mizar@Turing competition problems [24], several systems have been recently developed to invent targeted ATP strategies [15,21]. The underlying methods used so far include genetic algorithms and iterated local search, as popularized by the ParamILS [7] system.

A particular problem of the methods based on iterated local search is that their performance degrades as the number of possible strategy parameters gets high. This is the case for E, where a domain specific language allows construction of astronomic numbers of strategies. This gets worse as more and more sophisticated templates for strategies are added to E, such as our recent family of conjecture-oriented weight functions implementing various notions of term-based similarity [8]. The pragmatic solution used in the original BliStr consisted of re-using manually pre-designed high-level strategy components, rather than allowing the system to explore the space of all possible strategies. This is obviously unsatisfactory.

In this article we describe BliStrTune – a hierarchical extension of BliStr. BliStrTune allows exploring much larger space of E strategies by factoring the search into invention of good high-level strategy components and their low-level fine-tuning. The high-level invention and the low-level invention communicate to each other their best solutions, iteratively improving all parts of the strategy space. Together with our new conjecture-oriented weight functions, the hierarchical invention produces so far the strongest schedule of strategies on the small (bushy) versions of the Mizar@Turing problems. The improvement over Vampire 4.0 on the training set is nearly 10%, while the improvement on the testing (competition) set is over 5%.

The rest of the paper is organized as follows. Section 2 introduces the notion of proof search strategies, focusing on resolution/superposition ATPs and E prover. We also summarize our recent conjecture-oriented strategies that motivated the work on BliStrTune. Section 3 describes the ideas behind the original Blind Strategymaker based on the ParamILS system (see Section 3.2 for more details on ParamILS). Section 4 introduces the hierarchical invention algorithm and its implementation. The system is evaluated in several ways in Sections 5 and 6, showing significant improvements over the original BliStr and producing significantly improved ATP strategies.

This article is an extended version of our work presented at the CPP conference [9]. Several parts have been updated. The main Section 4 describing BliStrTune now explains the hierarchical invention in greater detail, Section 5 contains further analysis of the results, and Section 7 describing the practical use of the system has been added. The system has been streamlined and its distribution made publicly available.1

¹
https://github.com/ai4reason/BliStrTune

2. Proof search strategies

In this section we briefly describe the proof search of saturation-based automated theorem provers (ATPs). Section 2.1 describes the proof search control possibilities of E prover [22,23]. Section 6.1 describes our previous development of similarity based clause selection strategies [8] which we make use of and evaluate here.

Many state-of-the-art ATPs are based on the given clause algorithm introduced by Otter [18–20]. The input problem $T \cup {\neg C}$ is translated into a refutationally equivalent set of clauses. Then the search for a contradiction, represented by the empty clause, is performed maintaining two sets: the set P of processed clauses and the set U of unprocessed clauses. Initially, all the input clauses are unprocessed. The algorithm repeatedly selects a given clause g from U and generates all possible inferences using g and the processed clauses from P. Then, g is moved to P, and U is extended with the newly produced clauses. This process continues until a resource limit is reached, or the empty clause is inferred, or P becomes saturated, that is, nothing new can be inferred.

2.1. Proof search strategies in E prover

E [22 ,23] is a state-of-the-art theorem prover which we use as a basis for implementation. The selection of a given clause in E is implemented by a combination of priority and weight functions. A priority function assigns an integer to a clause and is used to pre-order clauses for weight evaluation. A weight function takes additional specific arguments and assigns to each clause a real number called weight. A clause evaluation function ( $CEF$ ) is specified by a priority function, weight function, and its arguments. Each $CEF$ selects the clause with the smallest pair $(priority, weight)$ for inferences. Each $CEF$ is specified using the syntax $\begin{matrix} WeightFunction(PriorityFunction, …) \end{matrix}$ with a variable number of comma separated arguments of the weight function. E allows a user to select an expert heuristic on a command line in the format $\begin{matrix} (n_{1} * {CEF}_{1}, …, n_{k} * {CEF}_{k}) \end{matrix}$ where integer $n_{i}$ indicates how often the corresponding ${CEF}_{i}$ should be used to select the given clause. E additionally supports an auto-schedule mode where several expert heuristics are tried, each for a selected time period. The heuristics and time periods are automatically chosen based on input problem properties.

One of the well-performing weight functions in E, which we also use as a reference for evaluation of our weight functions, is the conjecture symbol weight. This weight function counts symbol occurrences with different weights based on their appearance in the conjecture as follows. Different weights $δ_{f}$ , $δ_{c}$ , $δ_{p}$ , and $δ_{v}$ are assigned to function, constant, and predicate symbols, and to variables. The weight of a symbol which appears in the conjecture is multiplied by $γ_{conj}$ , typically $γ_{conj} < 1$ to prefer clauses with conjecture symbols. To compute a term weight, the given symbol weights are summed for all symbol occurrences. This evaluation is extended to equations and to clauses.

Fig. 1.

An outline of the BliStr strategy invention loop.

Apart from clause selection, E prover introduces other parameters which influence the choice of the inference rules, term orderings, literal selection, etc. The selected values of the parameters which control the proof search are called a strategy.2

Also called a protocol in the literature.

Because strategy is a crucial notion in this paper, we provide a simple example for reader’s convenience.

Example 1.

Let us consider the following simplified E strategy written in the E prover command line syntax as follows.

-tKBO6 -WSelectComplexG -H’(13*Refinedweight(PreferGoals,1,2,2,3,2), 2*Clauseweight(ByCreationDate,-2,-1,0.5))’

This strategy selects term ordering KBO6, literal selection function SelectComplexG, and two CEFs. The first CEF has frequency 13, weight function Refinedweight, priority function PreferGoals, and weight function arguments “1,2,2,3,2”. The exact meaning of specific strategy parameters can be found in the E manual [23].

3. The blind strategymaker (BliStr)

In this section we describe Blind Strategymaker (BliStr) [28] which we further extend in the following section to BliStrTune system. Both BliStr and BliStrTune are based on the BliStr strategy invention loop [28] which improves a given ATP system with given initial strategies (Initials) on a given set of training problems (Problems). This is done by gradually evolving new strategies specialized for classes of the training problems. The specialization of a strategy on a specific subset of problems is done by the ParamILS [7] automated algorithm configuration framework. Section 3.1 describes the BliStr loop and Section 3.2 details on how ParamILS is utilized for strategy improvement.

3.1. BliStr strategy invention loop

Figure 1 provides an outline of the BliStr strategy invention loop which is common both for BliStr and BliStrTune presented in this work. The loop consists of four basic steps. Because no strategy can be improved more than once on the same set of problems (see Step 3) and the set of all possible strategies is typically finite, the loop must eventually terminate.3

³
Note however that the set of all possible strategies is usually astronomically large, and relatively fast termination is due to more complicated factors influenced by the BliStr settings.

A more detailed explanation of the four basic steps follows.

Step 1: Generation evaluation. In the first phase, all strategies ( $All$ ) are evaluated on all training problems Problems. The ATP is run on each problem with time limit $β_{eval}$ yielding (1) the overall result (solved/unsolved) and (2) a number measuring the length of the proof search. In BliStr and BliStrTune, the proof search length is measured in the number of given clauses processed by E Prover during the proof search.

For each strategy S, we compute its set of best-performing problems $P_{S}$ ( $\subseteq P ROBLEMS$ ), that is, all the problems where S outperforms all other strategies. The set $P_{S}$ is further restricted to contain only the problems provable with the proof search length ranging between $β_{min}$ and $β_{max}$ . This is to later avoid improving S on problems which are “too easy” (because there is not much to improve) and which are “too hard” (to speed up the learning process).

Step 2: Generation reduction. In the next step, we reduce the strategies invented so far ( $All$ ) to contain only the strategies S with at least $β_{bests}$ best-performing problems (that is, with $| P_{S} | ⩾ β_{bests}$ ). From the remaining strategies, we take only $β_{tops}$ best strategies, where the strategies are compared by the number of best-performing problems ( $| P_{S} |$ ). The first restriction keeps only the best-performing strategies (that is, the strongest individuals), while the second reduces their count keeping the size of G within the selected bound ( $| G | ⩽ β_{tops}$ ). This is done to prevent overfitting on the training problems by having a large number of overspecialized strategies.

Step 3: Strategy selection. The next step is to select a strategy to be improved on its best-performing problems. As a rule, no strategy can be improved on the same problems more than once within one execution of the BliStrLoop function. Because the sets of best-performing problems vary in time, the same strategy can be improved more than once but only on different problems. Our selection approach is to prefer improving strategies on diverse problems. Hence we prefer to improve strategies whose best-performing problems have not been used for improving so often.4

⁴

In more detail, for each problem p, we keep a counter $c_{p}$ which is increased by $1 / | P_{S_{0}} |$ whenever a strategy $S_{0}$ is improved on p (that is, when $p \in P_{S_{0}}$ ). We select the strategy S with (currently) the lowest average $c_{p}$ over the best-performing problems $P_{S}$ . In the case of equal values, we prefer the strategy with higher $| P_{S} |$ .

If no strategy can be selected, the algorithm terminates with the current generation G as the result.

Step 4: Strategy improvement. The strategy improvement is done by the ParamILS [7] automated algorithm configuration framework. Given a strategy S and a set of problems P ( $\subseteq P ROBLEMS$ ), ParamILS attempts to find an ATP strategy $S^{'}$ with possibly the best performance on P. ParamILS is a generic framework which is capable of finding well performing configurations for an arbitrary algorithm. In our case, the algorithm is E with a time evaluation limit $β_{cutoff}$ . More details on ParamILS are given in Section 3.2.

We always launch ParamILS to improve a strategy S on its currently best-performing problems $P_{S}$ . Either a single run of ParamILS can be used, as in BliStr [28], or, alternatively, several ParamILS runs can be combined to improve a strategy in a hierarchical manner, as in BliStrTune (see Section 4). In both cases, ParamILS is always launched for a specific time limit, which is provided as the input parameter $β_{imp}$ .

The guiding idea for strategy improvement in the BliStr loop is to use a data-driven approach. Problems in a given mathematical field often share a lot of structure and solution methods. Mathematicians become better and better by solving the problems, they become capable of doing larger and larger steps with confidence, and as a result they can gradually attack problems that were previously too hard for them. By this analogy, it is plausible to think that if the solvable problems become much easier for an ATP system, the system will be able to solve some more (harder, but related) problems. For this to work, a method that can improve an ATP on a set of solvable problems is needed. As already mentioned, the established ParamILS system can be used for this.

3.2. Using ParamILS for strategy improvement

Let A be an algorithm whose parameters come from a configuration space (product of possible values) Θ. A parameter configuration is an element $θ \in Θ$ , and $A (θ)$ denotes the algorithm A with the parameter configuration θ. Given a distribution (set) of problem instances D, the algorithm configuration problem is to find the parameter configuration $θ \in Θ$ resulting in the best performance of $A (θ)$ on the distribution D. ParamILS is an a implementation of an iterated local search (ILS) algorithm for the algorithm configuration problem. In short, starting with an initial configuration $θ_{0}$ , ParamILS loops between two steps: (i) perturbing the configuration to escape from a local optimum, and (ii) iterative improvement of the perturbed configuration. The result of step (ii) is accepted if it improves the previous best configuration.

To fully determine how to use ParamILS in a particular case, A, Θ, $θ_{0}$ , D, and a performance metric need to be instantiated. In our case, algorithm A is an E run with a low timelimit $β_{cutoff}$ . The configuration space Θ describes the set of expressible E strategies as a finite set of parameters where each parameter is assigned a finite domain of possible values. A configuration $θ \in Θ$ is then a finite assignment of specific values to all parameters from space Θ. ParamILS additionally allows to specify conditional arguments and forbidden values (see [7] for details). As a performance metric we use the proof search length, that is, the number of given-clause loops done by E during solving the problem. If E cannot solve a problem within the low timelimit, a sufficiently high value ( $10^{6}$ ) is used.

Since it is unlikely that there is one best E strategy for all of the given benchmark problems, it would be counterproductive to use all problems as the set D for ParamILS runs. Hence we use only best-performing problems ( $P_{S}$ ) as described above in Section 3.1.

4. BliStrTune: Hierarchical invention

BliStr uses a fixed set of CEFs for inventing new strategies. The arguments of these fixed CEFs (the priority function, weight function arguments) cannot be modified during the iterative strategy improvement done by ParamILS. A straightforward way to achieve invention (fine-tuning) of CEF arguments would be to extend the ParamILS configuration space Θ. This, however, makes the configuration space grow from ca. $10^{7}$ to $10^{129}$ of possible combinations. Preliminary experiments revealed that with a configuration space of this size ParamILS does not produce satisfactory results in a reasonable time.

In this section we describe our new extension of BliStr – BliStrTune – where the invention of good high-level strategy parameters (Section 4.1) is interleaved with the invention of good CEF arguments (Section 4.3). The basic idea behind BliStrTune is iterated hierarchical invention: The large space of the optimized parameters is naturally factored into two (in general several) layers, and at any time only one layer is subjected to invention, while the other layer(s) remain fixed. The results then propagate between the layers, and the layer-tuning and propagation are iterated. BliStrTune is experimentally evaluated in Section 5.

4.1. Global parameter invention

The ParamILS runs used in the BliStrTune’s global-tuning phase are essentially the same as in the case of BliStr, with the following minor exceptions. BliStr uses a fixed configuration space Θ for all ParamILS runs. This is possible because a small set (currently 12) of CEFs is hard coded in Blistr’s Θ. BliStrTune uses in the global-tuning phase a parametrized configuration space $Θ_{C}$ where C is a collection of CEFs that can be different for each ParamILS run. This collection can be arbitrary but we use only the 50 best performing CEFs in order to limit the configuration space size for the global-tuning phase. The notion of “best performing CEFs” develops in time and it is discussed in details in Section 4.5. Furthermore, BliStrTune introduces additional argument $β_{cef}$ to limit the maximum number of CEFs which can occur in a single strategy ( $β_{cef} = 12$ for the case of BliStr).

BliStrTune’s global-tuning usage of ParamILS is otherwise the same as in BliStr, that is, given $Θ_{C}$ , the initial configuration $θ_{0} \in Θ_{C}$ , and problems D, the result of the global tuning is a configuration $θ_{1} \in Θ_{C}$ which has the best found performance on D. This configuration $θ_{1}$ then serves as an input for the next fine-tuning phase.

Example 2.
Let us consider the E strategy from Example 1. In the global-tuning phase we instruct ParamILS to modify top level arguments, that is, term ordering (“-t”), literal selection (“-W”), CEF frequencies (“13” and “2”), and also the whole CEF blocks and their count. We do not, however, allow ParamILS to change CEF arguments (priority functions and weight function arguments). The whole CEF must be changed to another CEF from collection C.

4.2. Global-tuning of the configuration space

This section describes in details the parameter space $Θ_{C}$ for the ParamILS runs in the global-tuning phase. The space is defined by parameters and their respective domains. The parameters for the global-tuning correspond to the E Prover options that influence the proof search. Although E Prover implements a very large set of proof search options, we support the subset listed below. For details on the specific E Prover options, see the E Prover manual [23]. For each of the E Prover options below, we count the possible configurations explored by ParamILS, in order to be able to compute the size of the configuration space $Θ_{C}$ .

Given clause selection

is one of the most significant proof search settings. We instruct ParamILS to select from 2 to $β_{cef}$ distinct CEFs out of the 50 best CEFs in C. We use a fixed set of 10 different frequencies ( $1, 2, 3, 4, 5, 8, 10, 13, 21, 34$ ). We forbid the same CEF to be used more than once in the same strategy (even with a different frequency). Hence we have altogether $\begin{matrix} \sum_{n = 2}^{β_{cef}} \frac{50!}{(50 - n)!} \cdot 10^{n} \end{matrix}$ possible settings. For example, if $β_{cef} = 6$ , then ParamILS explores around $10^{16}$ possibilities of the given clause selection mechanism.

Term ordering

significantly influences the proof search by restricting applications of the inference rules. We instruct ParamILS to try various kinds of orderings and various methods for the generation of a precedence. We support 12 possible term ordering settings.

Literal selection method

selects the literals on which superpositions can be applied. Four settings are supported.

Simultaneous paramodulation

can be used to implement superposition. Three settings are supported.

Contextual simplify-reflect

are simplification inference rules implemented in E Prover. We support two possible settings.

Clause splitting

is used to split a clause into shorter clauses by introducing a fresh predicate symbol. We support 6 settings.

SInE

is a method for axiom filtering [6] which can be optionally turned on in E Prover. We support 561 possible SInE settings.

For the two values of $β_{cef}$ used in the experiments (6 and 10), we obtain that the approximate sizes of $Θ_{C}$ are $10^{21}$ and $10^{32}$ respectively. This means that ParamILS explores $10^{21}$ (or $10^{32}$ ) different E Prover proof strategies. Different proof strategies might, however, behave in the same way. In future work, we would like to identify similarly behaved strategies to reduce the size of the BliStrTune’s search space.

4.3. Invention of the CEF arguments

Given the result of the global-tuning phase $θ_{1} \in Θ_{C}$ a new configuration space for the fine-tuning phase $Θ_{θ_{1}}$ is constructed by (1) fixing the parameter values from $θ_{1}$ and by (2) an introduction of new parameters that allow to change the values of the arguments of the CEFs used in $θ_{1}$ . In order to do that, we need to describe the space of the possible values of the CEF arguments.

The CEF arguments (see Section 2.1) consist of the priority function and the weight function specific arguments. Because of the different number and semantics of the weight function arguments, we do not allow to change the CEF’s weight functions during the fine-tuning. They are fixed to the values provided in $θ_{1}$ . For each weight function argument, we know its type (such as the symbol weight, operation cost, weight multiplier, etc.). For each type we have pre-designed the set of reasonable values. For the original E weight functions, we extract the reasonable values from the auto-schedule mode of E. For our new weight functions, we use our preliminary experiments [8] enhanced with our intuition.

Given the configuration space $Θ_{θ_{1}}$ , a configuration $θ_{1} \in Θ_{C}$ can be easily converted to an equivalent configuration $θ_{1}^{'} \in Θ_{θ_{1}}$ by setting the parameter values to those CEFs arguments that were previously fixed in $θ_{1}$ and C. Then we can run ParamILS with the configuration space $Θ_{θ_{1}}$ , the initial configuration $θ_{1}^{'}$ , and with the same problem set D as in the global-tuning phase. The result is a configuration $θ_{2}^{'} \in Θ_{θ_{1}}$ providing the best found performance on D.

The global invention (global tuning) and the local invention (fine-tuning) phases can be iterated. To do that, we need to transform the result of the fine-tuning $θ_{2}^{'} \in Θ_{θ_{1}}$ to an equivalent initial configuration $θ_{2} \in Θ_{C}$ for the next global-tuning phase. In order to do that, the CEFs invented by $θ_{2}^{'}$ must be present in the CEFs collection C. If this is not the case, we simply extend C with the new CEFs. In practice, we now use two iterations of this process (that is, two phases of global-tuning and two phases of fine-tuning) which was experimentally evaluated to provide good results.

Example 3.
Recall the strategy from Example 1 and Example 2. In the fine-tuning phase we would fix all the top level arguments modified in the global-tuning phase (“-t”, and so on, as described in Example 2) and we would instruct ParamILS to change the individual CEF arguments. That is, the values

PreferGoals,1,2,2,3,2 ByCreationDate,-2,-1,0.5

might be changed to different values while the rest of the strategy stays untouched.
4.4. Fine-tuning of the configuration space

The size of the configuration space $Θ_{θ}$ depends on θ, in particular, on the CEF clause selection scheme $(n_{1} * {CEF}_{1}, …, n_{k} * {CEF}_{k})$ which is induced by θ. For every ${CEF}_{i}$ , different values of arguments are considered by ParamILS. Two CEFs with different weight functions have generally different types of arguments, except for the priority function which is set by every $CEF$ . We use 10 different priority functions.

In order to simplify the construction of the configuration space $Θ_{θ}$ , we identify the argument types that are used frequently. Some are shared across different weight functions (e.g., weights, multiplicators, costs), and some are specific for a single weight function. For each weight function argument, we identify its type τ. For each argument type τ, we select the domain $⟦ τ ⟧$ of possible values. We use 11 argument types and the domain sizes vary from 2 to 18.

For example, a typical weight function ConjectureRelativeSymbolWeight

has 8 weight function arguments with approximately $10^{18}$ possible values in total. In average, a weight function has $10^{12}$ possible values in total. A typical θ has 4 CEFs, thus a typical size of $Θ_{θ}$ is around $10^{49}$ , that is, comparable to the global-tuning phase. The space size can, however, grow up to $10^{108}$ with 6 CEFs.

4.5. Maintaining collections of CEFs

The global-tuning phase of BliStrTune requires the collection C of CEFs as an input. It is desirable that this collection C is limited in size (currently we use max. 50 CEFs) and that it contains the best performing CEFs.

Initially, for each weight function w defined in E, we have extracted the CEF that is most often used in the E auto-schedule mode. We have added a CEF for each of our new weight functions. This gave us the initial collection of 21 CEFs. Then we use a global database (shared by different BliStrTune runs) in which we store all CEFs together with a usage counter. This counter remembers how often was each CEF used in a strategy invented by BliStrTune. Recall that in one BliStrTune iteration, ParamILS is ran four times (two phases of global-tuning and two phases of fine-tuning). Whenever a CEF is contained in a strategy invented by any BliStrTune iteration (i.e., after the four ParamILS runs), we increase the CEF usage counter, perhaps adding a new CEF to the database when used for the first time.

To select the 50 best performing CEFs we start with $C = \emptyset$ . We extract all the weight functions W that are used in the global CEF database. The set W stays constant, because the global database already contains all possible weight functions from the very beginning. For each $w \in W$ , we compute the list $C_{w}$ of all CEFs from the database which use w and we sort $C_{w}$ by the usage counter. Then we iterate over W and for each w we move the most frequently used CEF from $C_{w}$ to C. We repeat this until C has the desirable size (or we are out CEFs). This ensures that C contains at least one CEF for each weight function.

5. Experimental evaluation

This section provides an experimental evaluation5

⁵
All the experiments were run on $2 \times 16$ cores Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30 GHz with 128 GB memory. One prover run was however limited to 1 GB memory limit.

of BliStrTune. In Section 5.1 we compare our improved BliStrTune with the original BliStr. In Section 5.2 we evaluate several BliStrTune runs with different parameters. In Section 5.3 we discuss and compare several methods to construct a strategy scheduler that tries several strategies to solve a problem. Section 5.4 then compares the best strategy scheduler with state-of-the-art ATPs, namely, with E 1.9 using its auto-schedule mode and with Vampire 4.0.

For the evaluation we use problems from division Mizar@Turing of the CASC 2012 (Turing100) competition mentioned in Section 1. These problems come from the MPTP translation [1,26,27] of the Mizar Mathematical Library [5]. The problems are divided into 1000 training and 400 testing problems. The training problems were published before the competition, while the testing problems were used in the competition. This fits our evaluation setting: we can use BliStrTune to invent targeted strategies for the training problems and then evaluate them on the testing problems.

5.1. Hierarchical invention and weight functions

To evaluate the hierarchical invention we ran BliStr and BliStrTune with equivalent arguments. Furthermore, we ran two instances of BliStrTune to evaluate the performance added by the new weight functions from Section 6.1. The first instance was allowed to use only the original E 1.9 weight functions, while the second additionally used our new weight functions.

BliStr and BliStrTune used the same input arguments. The first argument is the set of the training problems. We use the 1000 training problems from the Mizar@Turing competition in all experiments. Other arguments are: $β_{imp}$

the time limit (seconds) for one ParamILS run,

β_{cutoff}

the time limit for E prover runs within ParamILS,

β_{eval}

the time limit for the strategy evaluation in BliStr/Tune.

In BliStrTune, ParamILS is run four times in each iteration, hence we set

β_{imp} = 100

in BliStrTune and

β_{imp} = 400

in BliStr.6

⁶
So that the times used to improve a strategy are equal.

We set

β_{cutoff} = 1

and

β_{eval} = 5

and additionally, in the case of BliStrTune,

β_{cef} = 6

Fig. 2.

Value added by parameter fine-tuning and by new weight functions (Section 5.1).

The results are shown in Fig. 2. In each iteration (x-axis, logarithmic scale) we count the total number of the training problems solved (y-axis) by all the strategies invented so far, provided each strategy is given the time limit $β_{eval}$ . This metric gives us relatively good idea of the BliStr/Tune progress.

The original BliStr solved 673 problems. BliStrTune without the new weights solved 702 problems, while BliStrTune with the new weights solved 711 problems. From this and from the figure we can see that the greatest improvement is thanks to the hierarchical parameter invention. However, the new weight functions still provide 9 more solved problems which is a useful additional improvement.

5.2. Influence of the BliStrTune input arguments

Table 1
Evaluation of different BliStrTune training runs on Mizar@Turing problems (Section 5.2)

$β_{imp}$ $β_{cutoff}$ $β_{eval}$ $β_{cef}$ iters protos run time best proto $solved$ useful

100 1 5 6 115 116 1d0h 572 711 28%

100 1 5 10 111 115 1d3h 594 715 14%

300 1 5 6 83 87 1d13h 596 698 4%

300 1 5 10 82 85 1d22h 611 711 11%

100 2 10 6 152 148 1d20h 579 720 27%

100 2 10 10 88 88 1d4h 567 698 1%

300 2 10 6 153 153 3d18h 583 727 19%

300 2 10 10 139 139 3d9h 587 719 15%

$β_{imp}$	$β_{cutoff}$	$β_{eval}$	$β_{cef}$	iters	protos	run time	best proto	$solved$	useful
100	1	5	6	115	116	1d0h	572	711	28%
100	1	5	10	111	115	1d3h	594	715	14%
300	1	5	6	83	87	1d13h	596	698	4%
300	1	5	10	82	85	1d22h	611	711	11%
100	2	10	6	152	148	1d20h	579	720	27%
100	2	10	10	88	88	1d4h	567	698	1%
300	2	10	6	153	153	3d18h	583	727	19%
300	2	10	10	139	139	3d9h	587	719	15%

In this section we evaluate several BliStrTune runs with different input arguments. We run all the combinations of $β_{imp} \in {100, 300}$ and $β_{cef} \in {6, 10}$ and $β_{cutoff} \in {1, 2}$ . This gives us 6 different BliStrTune runs. We always set $β_{eval} = 5 \cdot β_{cutoff}$ .

The results are summarized in Table 1. Column iters contains the number of iterations executed by the appropriate BliStrTune run, proto is the total number of strategies generated, run time is the total run time of the given BliStrTune run, best proto is the number of training problems solved by the best strategy within $β_{eval}$ time limit, and solved is the total number of the training problems solved by all the generated strategies, provided each strategy is given time limit $β_{eval}$ . We can see that a huge amount of strategies were generated. Only a few of them were used for the final evaluation as described in Section 5.3. Those used for the final evaluation are considered “useful” and the column useful states how many percent of the useful strategies come from the appropriate BliStrTune run.

We can see that the most useful runs are the basic runs with smaller $β_{imp}$ which also have lower run times. Higher $β_{imp}$ leads to higher run times but it produces better strategies in the sense that a smaller number of strategies can solve equal number of problems. From the table we can see that when $β_{cutoff}$ and $β_{cef}$ are increased, $β_{imp}$ should be increased as well to provide ParamILS with enough time for strategy improvement.

5.3. Selecting the best protocol scheduler

The 6 runs of BliStrTune described above in Section 5.2 generated more than 900 different strategies. In this section we try to select the best subset of the strategies and construct a strategy scheduler which sequentially tries several strategies to solve a problem. We only experiment with the simplest schedulers where the time limit for solving a problem is equally distributed among all the strategies within a scheduler.7

⁷
We only use integer time limits for running the strategies, because E currently only supports integer time limits. If there are n seconds remaining after the integer division of the global time limit, they are equally distributed among the first n strategies in the scheduler.

Hence the problem of scheduler construction is reduced to the selection of the right strategies.

We use three different ways to select the scheduler strategies. Firstly we use a greedy approach as follows. We evaluate all the strategies on all the training problems with a fixed time limit t. Then we construct a greedy covering sequence which starts with the best strategy, and each next strategy in the sequence is the strategy that adds most solutions to the union of the problems solved by all the previous strategies in the sequence. The resulting scheduler is denoted ${greedy}_{t}$ .

The second way to construct a scheduler is based on the state-of-the-art contribution (SOTAC) used by CASC. The SOTAC for a problem is the inverse of the number of strategies that solved the problem. A strategy SOTAC is the average SOTAC over the problems solved by the strategy. We can sort the strategies by their SOTAC and select the first n strategies from this sequence. The resulting scheduler is denoted ${SOTAC}_{n}$ .

The SOTAC of a strategy will be high even if the strategy solves only one problem which no other strategy can solve. That is why also the $Σ -SOTAC$ metric [10] is introduced: this is the sum of the problem SOTACs over all the problems. This gives us schedulers denoted $Σ {-SOTAC}_{n}$ .

Table 2

Evaluation of the BliStrTune schedulers (Section 5.3)

		training		testing

scheduler	protos	solved	V+	solved	V+
${greedy}_{1}$	33	744	+9.8%	280	+5.2%
${greedy}_{2}$	27	742	$+ 9.6 %$	279	+4.8%
${greedy}_{5}$	28	734	$+ 8.4 %$	280	+5.2%
${greedy}_{10}$	22	719	$+ 6.2 %$	276	+3.8%
${SOTAC}_{15}$	15	663	$- 2.0 %$	261	−1.8%
${SOTAC}_{30}$	30	693	$+ 2.3 %$	266	+0%
${SOTAC}_{45}$	45	698	$+ 3.1 %$	270	+1.5%
${SOTAC}_{60}$	60	699	+3.2%	270	+1.5%
$Σ {-SOTAC}_{15}$	15	692	$+ 2.2 %$	268	+0.7%
$Σ {-SOTAC}_{30}$	30	711	$+ 5.0 %$	273	+2.6%
$Σ {-SOTAC}_{45}$	45	712	+5.1%	276	+3.8%
$Σ {-SOTAC}_{60}$	60	707	$+ 4.4 %$	275	+3.4%

The evaluation of 12 different schedulers with 60 seconds time limit on the training problems is provided in Table 2. The protos column specifies the count of the strategies within the scheduler. We shall use this evaluation to select the best scheduler, hence the results on the 400 testing problems are provided for reference only. The solved column is the number of the problems solved in 60s. The V+ column is the percentage gain/lost compared to the state-of-the-art theorem prover Vampire 4.0 which solves 667 of the 1000 training problems and 266 of the 400 testing problems.

We can see that the best results are achieved by the scheduler ${greedy}_{1}$ , which also gives the best results on the testing problems. Generally, it is better to run a bigger number of strategies with lower individual time limit.

5.4. Best protocol scheduler evaluation

Table 3
Evaluation of the best BliStrTune scheduler on the 400 Mizar@Turing testing problems with 60 seconds time limit

training testing

prover solved V+ solved V+

E (BliStrTune) 744 +9.8% 280 +5.2%

Vampire 4.0 677 +0% 266 +0%

E (auto-schedule) 605 −10.6% 231 −13.1%

	training	testing
E (BliStrTune)	744	+9.8%	280	+5.2%
Vampire 4.0	677	+0%	266	+0%
E (auto-schedule)	605	−10.6%	231	−13.1%

Fig. 3.

Progress of ATPs on the 400 Mizar@Turing testing problems with 60 seconds time limit.

In this section we evaluate the best strategy scheduler ${greedy}_{1}$ selected in the previous Section 5.3 on the 400 Mizar@Turing testing problems with 60 seconds time limit. We compare ${greedy}_{1}$ with two state-of-the-art ATPs: (1) E prover 1.9 run in the auto-schedule mode and (2) Vampire 4.0 in the CASC mode.

The results are summarized in Table 3. We can see that E with scheduler ${greedy}_{1}$ invented by BliStrTune outperforms Vampire by 5.2% and the improvement over E using its auto-schedule mode is even more significant. Figure 3 provides a graphical representation of the progress of the ATPs. For each second (x-axis, logarithmic scale) we count the number of problems solved so far (y-axis). We can see that ${greedy}_{1}$ was outperforming Vampire during the whole evaluation.

6. Using tuning to evaluate weight functions

We can use the BliStrTune architecture to evaluate usefulness of various clause weight functions implemented in E. This evaluation can be done by (1) an analysis of the final best schedulers constructed previously in Section 5.4, and by (2) an analysis of all the invented strategies. Previously we proposed [8] several new E weight functions based on conjecture-similarity. Section 6.1 briefly summarizes the new weight functions, while Section 6.2 proceeds with the evaluation.

6.1. Similarity based clause selection functions

Many of the best-performing weight functions in E are based on a similarity of a clause with the conjecture, for example, the conjecture symbol weight from the previous section. A natural question arises whether or not it makes sense to extend the symbol-based similarity to more complex term-based similarities. Previously we proposed [8], implemented, and evaluated several weight functions which utilize conjecture similarity in different ways. Typically they extend the symbol-based similarity by similarity on terms. Using finer formula features improves the high-level premise selection task [12], which motivated us on steering also the internal selection in E. The following sections summarizes the new weight functions which we further evaluate later in Section 5.1 and Section 5.3.

Conjecture subterm weight (Term). The first of our weight functions is similar to the standard conjecture symbol weight, counting instead of symbols the number of subterms a term shares with the conjecture. The clause weight function Term takes five specific arguments $γ_{conj}$ , $δ_{f}$ , $δ_{c}$ , $δ_{p}$ and $δ_{v}$ . The weight of a term equals weight $δ_{f}$ for functional terms, $δ_{c}$ for constants, $δ_{p}$ for predicates, and $δ_{v}$ for variables, possibly multiplied by $γ_{conj}$ when t appears in the conjecture. To compute a clause weight, terms weights are summed for all subterms from a clause.

Conjecture frequency weight (TfIdf). Term frequency –inverse document frequency (Tf-Idf), is a numerical statistic intended to reflect how important a word is to a document in a corpus [16]. A term frequency is the number of occurrences of the term in a given document. A document frequency is the number of documents in a corpus which contain the term. The term frequency is typically multiplied by the logarithm of the inverse of document frequency to reduce frequency of terms which appear often. We define $tf (t)$ as the number of occurrences of t in a conjecture. We consider a fixed set of clauses denoted $Docs$ . We define $df (t)$ as the count of clauses from $Docs$ which contain t. Out weight function TfIdf takes one specific argument $δ_{doc}$ to select documents, either (1) $ax$ for the axioms (including the conjecture) or (2) $pro$ for all the processed clauses. First we define the value $tfidf (t)$ of term t as follows. $\begin{matrix} tfidf (t) = tf (t) * log \frac{1 + | Docs |}{1 + df (t)} \end{matrix}$ The weight of term t is computed as $\frac{1}{1 + tfidf (t)}$ and extended to clauses.

Conjecture term prefix weight (Pref). The previous weight functions rely on an exact match of a term with a conjecture related term. The following weight function loosen this restriction and consider also partial matches. We consider terms as symbol sequences. Let $max-pref (t)$ be the longest prefix t shares with a conjecture term. A term prefix weight (Pref) counts the length of $max-pref (t)$ using weight arguments $δ_{match}$ and $δ_{miss}$ . These are used to define the weight of term t as follows. $\begin{matrix} δ_{match} * | max-pref (t) | + δ_{miss} * (| t | - | max-pref (t) |) \end{matrix}$

Conjecture Levenshtein distance weight (Lev). A straightforward extension of Pref is to employ the Levenshtein distance [17] which measures a distance of two strings as the minimum number of edit operations (character insertion, deletion, or change) required to change one word into the other. Our weight function Lev defines the weight of term t as the minimal Levenshtein distance from t to some conjecture term. It takes additional arguments $δ_{ins}$ , $δ_{del}$ , $δ_{ch}$ to assign different costs for edit operations.

Conjecture tree distance weight (Ted). The Levenshtein distance does not respect a tree structure of terms. To achieve that, we implement the Tree edit distance [31] which is similar to Levenshtein but uses tree editing operations (inserting a node into a tree, deleting a node while reconnecting its child nodes to the deleted position, and renaming a node label). Our weight function Ted takes the same arguments as Lev above and term weight is defined similarly.

Conjecture structural distance weight (Struc). With Ted, a tree produced by the edit operations does not need to represent a valid term as the operations can change number of child nodes. To avoid this we define a simple structural distance which measures a distance of two terms by a number of generalization and instantiation operations. Generalization transforms an arbitrary term to a variable while instantiation does the reverse. Our weight function Struc takes additional arguments $δ_{miss}$ , $γ_{inst}$ , and $γ_{gen}$ as penalties for variable mismatch and operation costs. The distance of a variable x to a term t is the cost of instantiating x by t, computed as $Δ_{Struc} (x, t) = γ_{inst} * | t |$ . The distance of t to x is defined similarly but with $γ_{gen}$ . A distance of non-variable terms t and s which share the top-level symbol is the sum of distances of the corresponding arguments. Otherwise, a generic formula $Δ_{Struc} (t, x_{0}) + Δ_{Struc} (x_{0}, s)$ is used. The term weight is as for Lev but using $Δ_{Struc}$ .

6.2. Usefulness of weight functions

We can use our experiments to evaluate the usefulness of different E Prover weight functions, in particular of our new weight functions (Section 6.1). The first evaluation method is to inspect the best schedulers constructed in Section 5.3 and check which weight functions are used. The second method is to extract the weight function frequencies from the CEF database described in Section 4.5. Information about the usefulness of the different weight functions could be used in the future to restrict the configuration space of the ParamILS runs.

Table 4
Usage of the weight functions in the best schedulers

Weight function count freq

Term 111 782

RelevanceLevelWeight2 104 353

Pref 45 297

ConjectureGeneralSymbolWeight 43 235

FIFOWeight 39 174

StaggeredWeight 36 199

SymbolTypeweight 34 96

ConjectureRelativeSymbolWeight 28 110

ConjectureSymbolWeight 23 44

RelevanceLevelWeight 21 40

OrientLMaxWeight 13 24

Refinedweight 12 88

Defaultweight 10 131

Clauseweight 9 64

ClauseWeightAge 6 10

PNRefinedweight 5 21

Struc 5 14

Lev 3 37

Uniqweight 2 2

Ted 1 13

TfIdf 1 1

total 551 2735

Weight function	count	freq
Term	111	782
RelevanceLevelWeight2	104	353
Pref	45	297
ConjectureGeneralSymbolWeight	43	235
FIFOWeight	39	174
StaggeredWeight	36	199
SymbolTypeweight	34	96
ConjectureRelativeSymbolWeight	28	110
ConjectureSymbolWeight	23	44
RelevanceLevelWeight	21	40
OrientLMaxWeight	13	24
Refinedweight	12	88
Defaultweight	10	131
Clauseweight	9	64
ClauseWeightAge	6	10
PNRefinedweight	5	21
Struc	5	14
Lev	3	37
Uniqweight	2	2
Ted	1	13
TfIdf	1	1
total	551	2735

First, we use the schedulers from Section 5.3 to evaluate the contribution of our new weight functions. The more a weight function is represented in the final best schedulers, the more the weight function contributes to the achieved results and hence can be considered more useful. Table 4 summarizes the usage of the different weight functions in the final schedulers. Our weight functions are referred to by their names from Section 6.1 while the original weights are called by their E prover names. The count column states how many times the corresponding weight function was used in some selected scheduler strategy, while the freq column sums the frequencies of the occurrences of CEFs which use the given weight function. We can see that our new weight function Term was the most often used weight function. Four of our weight functions were, however, not used very often which we attribute to their higher time complexity. We can conclude that two of our new weight functions, namely Term and Ted, contributed greatly to the achieved results.

Table 5

Weight functions in the invented strategies

Weight function	use
ConjectureGeneralSymbolWeight	12.73%
ConjectureRelativeSymbolWeight	10.82%
Pref	10.24%
ConjectureSymbolWeight	8.96%
Term	8.87%
RelevanceLevelWeight2	7.38%
FIFOWeight	6.34%
Refinedweight	5.14%
StaggeredWeight	5.05%
RelevanceLevelWeight	3.15%
OrientLMaxWeight	3.04%
SymbolTypeweight	2.78%
Clauseweight	2.78%
ClauseWeightAge	2.74%
PNRefinedweight	2.47%
Lev	1.81%
Struc	1.70%
Uniqweight	1.25%
Defaultweight	1.23%
TfIdf	1.07%
Ted	0.46%

In Section 4.5, we have described a global database which stores all the CEFs invented by different BliStrTune iterations. This database stores for each CEF the usage counter, describing how often the CEF was used in a strategy invented by BliStrTune. For each weight function, we can count how many times it was used in an invented CEF. This gives us a slightly different notion of weight function usefulness than the previous approach, because not every invented CEF needs to be used in the best schedulers. However, when a CEF is invented in some particular BliStrTune iteration, we can conclude that there is a set of problems for which the invented CEF is useful.

The results are provided in Table 5. The use column describes how often the given weight function was used in a CEF generated in a BliStrTune iteration. This gives us a slightly different results, however, two of our weight functions (Term and Ted) are again considered much more useful than the other four (Lev, Struc, TfIdf, Ted).

Another approach to evaluate the usefulness of some weight function w would be to run BliStrTune with and without a possibility to use function w. We could then construct best schedulers as in Section 5.3 and compare their performance. This could be done for every weight function w but it would be very time consuming. Instead, we resorted to a simplified evaluation and launched two instances of BliStrTune, with and without our new weight functions (Section 6.1). The result is that our new weight functions provided a valuable improvement of 9 newly solved (training) problems. Our new weight functions were used in the experiments from Section 5.

7. Brief overview of the software implementation

Similarly to BliStr, BliStrTune is an open-source software and its implementation is publicly available.8

⁸
http://github.com/ai4reason/BliStrTune

We hope that this will allow interested users to improve the performance of E on their own problems. It should also make it easy for other researchers to re-use and modify the system in various ways (e.g., for other ATPs), and to compare BliStrTune’s performance with their own strategy invention methods.9

⁹

Given the performance improvement obtained here over E’s auto mode, the strategy invention field might see more development in near future.

In this section we provide a brief manual for the interested users.

The software package consists of

the BliStrTune system proper,

a database for the E Prover results, and

utilities to create and execute schedulers.

The package assumes a standard Linux machine, no installation is required.10

¹⁰

Common software packages like Perl, Python, and Ruby need to be installed. Other software, namely ParamILS, E Prover, and GNU Parallel [25], is provided in our distribution.

The system can be executed directly from the downloaded directory

7.1. Preparing for the strategy invention

To start the strategy invention, the user has to provide (1) the training problems in the TPTP format, and (2) an initial set of E’s strategies. The initial strategies might be collected from previous BliStrTune runs, or selected from the example strategies. We provide the scripts that import the benchmark problems and register the initial strategies.11

¹¹
import-benchmark.sh and import-inits.sh

7.2. Selecting the tuning options

Before running BliStrTune, the following options can be adjusted by editing the main BliStrTune script:12

¹²
BliStrTune-run.sh

CORES

sets the number of cores used for the ParamILS runs and for the evaluation phase. More cores should result in better ParamILS results and faster evaluation. We recommend at least 8 cores.

TIMEOUT_GLOBAL

sets the time limit in seconds for one global tuning phase. This corresponds to $β_{imp}$ from Section 5. The default is 100.

TIMEOUT_FINETUNES

sets the time limit for one fine tuning phase (again $β_{imp}$ from Section 5, but different values can be used for global and fine tuning). The default is 100.

CUTOFF

sets the E Prover time limit $β_{cutoff}$ in seconds for the ParamILS runs.

EVAL_LIMIT

sets the E Prover evaluation time limit $β_{eval}$ in seconds.

MIN_CEFS

sets the lower bound on the count of CEFs within a strategy. Use at least 2.

MAX_CEFS

sets the upper bound on the count of CEFs in a strategy, that is, the value $β_{cef}$ from Section 4.1.

MIN_PROC

sets a lower bound on the number of processed clauses in a proof. Problems with less processed clauses are considered “too easy” and never used for the ParamILS runs. The default is 500. This parameter should be decreased if BliStrTune terminates too fast without producing enough new strategies.

MAX_PROC

sets the upper bound on the number of processed clauses. Problems with more processed clauses are considered “too hard” and never used for the ParamILS runs. The default is 50000. This parameter should be increase if there is enough CPU time or if all the problems are hard.

VERS

sets versatility, the minimal number of problems on which a strategy must be best performing (better than all other strategies) in order to be considered by BliStrTune for improvement. Protocols which are not best performing at some problems are forgotten and never improved. The default is 10. Decrease this value if BliStrTune terminates too fast without generating satisfactory results.

TOPS

sets the number of top strategies that are kept for improving. The default is 20. Decrease the number to make BliStrTune finish faster.

7.3. Generating and evaluating the schedulers

Our distribution additionally includes utilities to process the BliStrTune results. There are scripts which can import generated strategies into the result database and evaluate strategies on the provided evaluation problems. Other tools are provided to process the result database and to construct the strategy schedulers as described in Section 5.3. Finally, schedulers can be evaluated on the provided testing problems.

8. Conclusions and future work

In this paper we have described BliStrTune, an extension of the Blind Strategymaker (BliStr) system. BliStrTune can be used for hierarchical invention of strategies that are targeting a given set of ATP problems. The system is publicly available and can be used by interested users on any problem set.

The main contribution of BliStrTune is that it considers a much bigger space of strategies by interleaving the global-tuning phase with the argument fine-tuning. We have evaluated the original BliStr and BliStrTune on the same input data and showed that BliStrTune significantly outperforms BliStr on them. We have evaluated several ways of creating strategy schedulers and showed that E 1.9 with the best strategy scheduler constructed from the BliStrTune strategies invented for the training problems outperforms the state-of-the-art Vampire 4.0 ATP on the independent testing problems by more than 5%.

Furthermore, we have used BliStrTune to evaluate the contribution of our previously designed conjecture-oriented weight functions for E prover. We have shown that these new weight functions allow us to solve more problems and that (at least two of them) were often used in the best scheduler strategies. Interestingly, more complex structural weights (like Lev, Ted) were not used very often in the schedulers even though our previous experiments suggested that they might be very useful. We attribute this to their higher time complexity and we would like to investigate this in our future research.

Several topics are suggested for future work. We have shown that new weight functions can enhance E’s performance, hence more weight functions which consider the term structure could be implemented. It seems that it might be better to design weight functions with lower time complexity, perhaps even providing approximate results. For example, we could use fast approximations of the Levenshtein distance.

Another direction for future research is to design more complex strategy schedulers. We have achieved good results with the simplest strategy schedulers where each strategy is given an equal amount of time when solving a problem. It would be interesting to design “smarter” schedulers and to see how many more problems can be solved.

And another direction for future research are enhancements of the BliStr/Tune main loop. We could experiment with addition of further layers to handle even more parameters in an efficient way, and with automated factoring of the parameters into the layers based on their joint behavior. There are also many settings of the various BliStrTune parameters to be experimented with, as well as selection of the training problems, etc. We could also try other underlying parameter improvement methods than ParamILS [30].

Footnotes

Acknowledgements

We thanks the CPP and AICOMM reviewers for their useful comments to the first version of this paper. This work was supported by the AI4REASON ERC Consolidator grant 649043, and by the Czech project AI & Reasoning CZ.02.1.01/0.0/0.0/15_003/0000466 and the European Regional Development Fund.

References

Alama,

Heskes,

Kühlwein,

Tsivtsivadze and

Urban, Premise selection for mathematics by corpus analysis and kernel methods, J. Autom. Reasoning52(2) (2014), 191–213. doi:10.1007/s10817-013-9286-5.

Blanchette,

Kaliszyk,

Paulson and

Urban, Hammering towards QED, Journal of Formalized Reasoning9(1) (2016), 101–148.

J.C.

Blanchette,

Greenaway,

Kaliszyk,

Kühlwein and

Urban, A learning-based fact selector for Isabelle/HOL, J. Autom. Reasoning57(3) (2016), 219–244. doi:10.1007/s10817-016-9362-8.

Gauthier and

Kaliszyk, Premise selection and external provers for HOL4, in: Certified Programs and Proofs (CPP’15), LNCS, Springer, 2015. doi:10.1145/2676724.2693173.

Grabowski,

Korniłowicz and

Naumowicz, Mizar in a nutshell, J. Formalized Reasoning3(2) (2010), 153–245.

Hoder and

Voronkov, Sine qua non for large theory reasoning, in: CADE,

Bjørner and

Sofronie-Stokkermans, eds, LNCS, Vol. 6803, Springer, 2011, pp. 299–314.

Hutter,

H.H.

Hoos,

Leyton-Brown and

Stützle, ParamILS: An automatic algorithm configuration framework, J. Artificial Intelligence Research36 (2009), 267–306.

Jakubův and

Urban, Extending E prover with similarity based clause selection strategies, in: Intelligent Computer Mathematics – 9th International Conference, CICM 2016, Proceedings, Bialystok, Poland, July 25–29, 2016, 2016, pp. 151–156. doi:10.1007/978-3-319-42547-4_11.

Jakubuv and

Urban, BliStrTune: Hierarchical invention of theorem proving strategies, in: Proceedings of the 6th ACM SIGPLAN Conference on Certified Programs and Proofs, CPP 2017, Paris, France, January 16–17, 2017,

Bertot and

Vafeiadis, eds, ACM, 2017, pp. 43–52. doi:10.1145/3018610.3018619.

10.

Kaliszyk and

Urban, Learning-assisted automated reasoning with Flyspeck, J. Autom. Reasoning53(2) (2014), 173–213. doi:10.1007/s10817-014-9303-3.

11.

Kaliszyk and

Urban, MizAR 40 for Mizar 40, J. Autom. Reasoning55(3) (2015), 245–256. doi:10.1007/s10817-015-9330-8.

12.

Kaliszyk,

Urban and

Vyskocil, Efficient semantic features for automated reasoning over large theories, in: IJCAI’15,

Yang and

Wooldridge, eds, AAAI Press, 2015, pp. 3084–3090.

13.

Kaliszyk,

Urban and

Vyskocil, Machine learner for automated reasoning 0.4 and 0.5, in: PAAR-2014. 4th Workshop on Practical Aspects of Automated Reasoning,

Schulz,

L.D.

Moura and

Konev, eds, EPiC Series in Computing, Vol. 31, EasyChair, 2015, pp. 60–66.

14.

Kovács and

Voronkov, First-order theorem proving and Vampire, in: CAV,

Sharygina and

Veith, eds, LNCS, Vol. 8044, Springer, 2013, pp. 1–35.

15.

Kühlwein and

Urban, MaLeS: A framework for automatic tuning of automated theorem provers, J. Autom. Reasoning55(2) (2015), 91–116. doi:10.1007/s10817-015-9329-1.

16.

Leskovec,

Rajaraman and

J.D.

Ullman, Mining of Massive Datasets, 2nd edn, Cambridge University Press, 2014.

17.

Levenshtein, Binary codes capable of correcting deletions, insertions and reversals, Soviet Physics Doklady10 (1966), 707.

18.

McCune, Otter 2.0, in: International Conference on Automated Deduction, Springer, 1990, pp. 663–664. doi:10.1007/3-540-52885-7_131.

19.

W.W.

McCune, Otter 1. 0 users’ guide, Technical report, Argonne National Lab., IL (USA), 1989.

20.

W.W.

McCune, Otter 3.0 Reference Manual and Guide, Vol. 9700, Argonne National Laboratory, Argonne, IL, 1994.

21.

Schäfer and

Schulz, Breeding theorem proving heuristics with genetic algorithms, in: Global Conference on Artificial Intelligence, GCAI, 2015, Tbilisi, Georgia, October 16–19, 2015,

Gottlob,

Sutcliffe and

Voronkov, eds, EPiC Series in Computing, Vol. 36, EasyChair, 2015, pp. 263–274.

22.

Schulz, E – A brainiac theorem prover, AI Communications15(2) (2002), 111–126.

23.

Schulz, System description: E 1.8, in: LPAR,

K.L.

McMillan,

Middeldorp and

Voronkov, eds, LNCS, Vol. 8312, Springer, 2013, pp. 735–743.

24.

Sutcliffe, The 6th IJCAR automated theorem proving system competition – CASC-J6, AI Commun.26(2) (2013), 211–223.

25.

Tange, Gnu parallel – the command-line power tool, ; login: The USENIX Magazine36(1) (2011), 42–47.

26.

Urban, MPTP – Motivation, implementation, first experiments, J. Autom. Reasoning33(3–4) (2004), 319–339. doi:10.1007/s10817-004-6245-1.

27.

Urban, MPTP 0.2: Design, implementation, and initial experiments, J. Autom. Reasoning37(1–2) (2006), 21–43. doi:10.1007/s10817-006-9032-3.

28.

Urban, BliStr: The blind strategymaker, in: GCAI 2015. Global Conference on Artificial Intelligence,

Gottlob,

Sutcliffe and

Voronkov, eds, EPiC Series in Computing, Vol. 36, EasyChair, 2015, pp. 312–319.

29.

Urban,

Sutcliffe,

Pudlák and

Vyskočil, MaLARea SG1 – Machine learner for automated reasoning with semantic guidance, in: IJCAR,

Armando,

Baumgartner and

Dowek, eds, LNCS, Vol. 5195, Springer, 2008, pp. 441–456.

30.

Wang,

Hutter,

Zoghi,

Matheson and

de Feitas, Bayesian optimization in a billion dimensions via random embeddings, Journal of Artificial Intelligence Research55 (2016), 361–387.

31.

Zhang and

Shasha, Simple fast algorithms for the editing distance between trees and related problems, SIAM J. Comput.18(6) (1989), 1245–1262. doi:10.1137/0218082.

	training		testing

prover	solved	V+	solved	V+
E (BliStrTune)	744	+9.8%	280	+5.2%
Vampire 4.0	677	+0%	266	+0%
E (auto-schedule)	605	−10.6%	231	−13.1%

Hierarchical invention of theorem proving strategies

Abstract

Keywords

1. Introduction: ATP strategy invention

1 https://github.com/ai4reason/BliStrTune

2.1. Proof search strategies in E prover

3.1. BliStr strategy invention loop

3 Note however that the set of all possible strategies is usually astronomically large, and relatively fast termination is due to more complicated factors influenced by the BliStr settings.

4. BliStrTune: Hierarchical invention

4.1. Global parameter invention

4.3. Invention of the CEF arguments

4.5. Maintaining collections of CEFs

5. Experimental evaluation

5 All the experiments were run on 2 × 16 cores Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30 GHz with 128 GB memory. One prover run was however limited to 1 GB memory limit.

6 So that the times used to improve a strategy are equal.

7 We only use integer time limits for running the strategies, because E currently only supports integer time limits. If there are n seconds remaining after the integer division of the global time limit, they are equally distributed among the first n strategies in the scheduler.

Table 3 Evaluation of the best BliStrTune scheduler on the 400 Mizar@Turing testing problems with 60 seconds time limit training testing prover solved V+ solved V+ E (BliStrTune) 744 +9.8% 280 +5.2% Vampire 4.0 677 +0% 266 +0% E (auto-schedule) 605 −10.6% 231 −13.1%

6.1. Similarity based clause selection functions

6.2. Usefulness of weight functions

8 http://github.com/ai4reason/BliStrTune

11 import-benchmark.sh and import-inits.sh

12 BliStrTune-run.sh

8. Conclusions and future work

Footnotes

Acknowledgements

References

¹
https://github.com/ai4reason/BliStrTune

³
Note however that the set of all possible strategies is usually astronomically large, and relatively fast termination is due to more complicated factors influenced by the BliStr settings.

⁵
All the experiments were run on $2 \times 16$ cores Intel(R) Xeon(R) CPU E5-2698 v3 @ 2.30 GHz with 128 GB memory. One prover run was however limited to 1 GB memory limit.

⁶
So that the times used to improve a strategy are equal.

⁷
We only use integer time limits for running the strategies, because E currently only supports integer time limits. If there are n seconds remaining after the integer division of the global time limit, they are equally distributed among the first n strategies in the scheduler.

Table 3
Evaluation of the best BliStrTune scheduler on the 400 Mizar@Turing testing problems with 60 seconds time limit

training testing

prover solved V+ solved V+

E (BliStrTune) 744 +9.8% 280 +5.2%

Vampire 4.0 677 +0% 266 +0%

E (auto-schedule) 605 −10.6% 231 −13.1%

⁸
http://github.com/ai4reason/BliStrTune

¹¹
import-benchmark.sh and import-inits.sh

¹²
BliStrTune-run.sh