A statistical anytime algorithm for the Halting Problem

Abstract

In a previous paper we used computer running times to define a class of computable probability distributions on the set of halting programs and developed a probabilistic anytime algorithm for the Halting Problem. The choice of a computable probability distribution – essential for the algorithm – can be rather subjective and hard to substantiate.

In this paper we propose and study an efficient statistical anytime algorithm for the Halting Problem. The main advantage of the statistical algorithm is that it can be implemented without any prior information about the running times on the specific model of computation and the cut-off temporal bound is reasonably small. The algorithm has two parts: the pre-processing which is done only once (when the parameters of the quality of solutions are fixed) and the main part which is run for any input program. With a confidence level as large as required, the algorithm produces correct decisions with a probability as large as required. Three implementations of the algorithm are presented and numerically illustrated.

Keywords

Halting Problem anytime algorithm running time order statistics

1. Introduction

The Halting Problem asks to decide, from a description of an arbitrary program and an input, whether the computation of the program on that input will eventually stop or continue forever. In 1936 A. Church, and independently A. Turing, proved that there is no algorithm solving the Halting Problem for all possible program-input pairs. The Halting Problem has many applications in logic and theoretical as well as applied computer science, mathematics, physics, biology, etc. Due to its practical importance approximate solutions for this problem have been proposed for quite a long time, see [2,6–10,14,17,20,22,25].

Anytime algorithms trade execution time for quality of results [13]. An anytime algorithm returns a result together with a “quality measure” which evaluates how close the obtained result is to the result that would be returned if the algorithm ran until completion (which may be prohibitively long). To improve the quality of the solution, anytime algorithms can be continued after they have halted if the output is not considered acceptable.

Here we use a more general form of anytime algorithm as an approximation for a computation which may never stop (see Manin [22]). Running times play an important role in this problem because halting programs are not uniformly distributed, see [16,27,29,30] for experimental work and [7,9,14,15] for theoretical results. Furthermore, every program either stops “quickly” or never stops [9]. This result was used in [8] to design an anytime probabilistic algorithm which simulates the program to be tested up to a threshold stopping time (this cut-off temporal bound is computed from the a priori accepted decision error and the probability distribution of stoping times of halting programs): if the computation still has not terminated by then, the algorithm reports (possibly wrongly) ‘The program does not halt!’. This anytime algorithm uses essentially a computable probability distribution on the set of stopping times of halting programs “reflecting” the halting behaviour of the chosen universal machine. The quantile of this probability distribution is utilised to compute the stoping threshold time, hence, the name “anytime probabilistic algorithm”. The probability of a wrong decision is no larger than the accepted error.

In this paper we propose a statistical anytime algorithm for the Halting Problem which improves the anytime probabilistic algorithm developed in [8] using a different strategy. In a pre-processing stage we sample sufficiently many terminating programs in an independent way (see [3,18]), determine their running times and consider the induced empirical cumulative distribution as approximation to the true, but unknown, cumulative distribution. Using an appropriate statistical framework we construct an anytime statistical algorithm which uses three parameters: the probability of an erroneous decision, the precision of and the confidence level in the estimation. The input program – tested for termination – is run for up to the largest number of steps made by any of the sampled programs. If the computation does not terminate by this time threshold (cut-off), then the (possible wrong) decision is that the program will never stop. With a confidence level as large as required, the anytime statistical algorithm produces correct decisions with a probability as large as required. Three implementations of the anytime algorithm are presented; numerical illustrations show that their time complexities are reasonably small.

The paper is organised as follows. We start with Section 2 on computability and complexity part for the Halting problem; Section 3 presents the probability framework and the probabilistic anytime algorithm for the Halting Problem, Section 4 presents the statistical framework, then Section 5 presents the statistical anytime algorithm for the Halting Problem and the proof of its main properties. Finally, in Section 6 we discuss three possible implementations of the statistical algorithm and present numerical illustrations; the last section is devoted to conclusions and possible extensions.

2. The Halting Problem

We denote by $Z^{+}$ the set of positive integers ${1, 2, \dots}$ ; $\overline{Z^{+}} = Z^{+} \cup {\infty}$ and $R$ is the set of reals. For $α \in R$ , $⌈ α ⌉$ is the ceiling function that maps α to the least integer greater than or equal to α. The domain of a partial function $F : Z^{+} ⟶ \overline{Z^{+}}$ is denoted by $dom (F)$ : $dom (F) = {x \in Z^{+} : F (x) < \infty}$ . We denote by $# S$ the cardinality of the set S and by $P (X)$ the power set of X. The indicator (or characteristic) function of a set M is denoted by $1_{M}$ .

We assume familiarity with elementary computability theory and algorithmic information theory [5,12,21]. For a partially computable function $F : Z^{+} ⟶ \overline{Z^{+}}$ we denote by $F (x) [t] < \infty$ the statement “the algorithm computing F has stopped on x exactly in time t”. For $t \in Z^{+}$ we consider the computable set $Stop (F, t) = {x \in Z^{+} : F (x) [t] < \infty}$ , and note that $\begin{matrix} (1) & dom (F) = ⋃_{t \in Z^{+}} Stop (F, t) . \end{matrix}$

The algorithmic complexity relative to a partially computable function $F : Z^{+} ⟶ \overline{Z^{+}}$ is the partial function $\nabla_{F} : Z^{+} ⟶ \overline{Z^{+}}$ defined by $\nabla_{F} (x) = inf {y \in Z^{+} : F (y) = x}$ . If $F (y) \neq x$ for every $y ⩾ 1$ , then $\nabla_{F} (x) = \infty$ . A partially computable function U is universal if for every partially computable function $F : Z^{+} ⟶ \overline{Z^{+}}$ there exists a constant $c_{U, F}$ such that for every $x \in dom (F)$ we have $\nabla_{U} (F (x)) ⩽ c_{U, F} \cdot x$ , see [7].

The set $dom (U)$ (see (1) for $U = F$ ) is computably enumerable, but not computable (the undecidability of the Halting Problem); its complement $\overline{dom (U)}$ is not computably enumerable, but the sets ${(Stop (U, t))}_{t ⩾ 1}$ are computable. To solve the Halting Problem means to determine for an arbitrarily pair $(F, x)$ , where F is a partially computable function and $x \in Z^{+}$ , whether $F (x)$ stops or not, or equivalently, whether $x \in dom (F)$ , that is, $x \in Stop (F, t)$ , for some $t \in Z^{+}$ . Solving the Halting Problem for a fixed universal U is enough to solve the Halting Problem. From now on we fix a universal function U and study the Halting Problem “For every $x \in Z^{+}$ , does $U (x) < \infty$ ?”.

3. The probabilistic anytime algorithm for the Halting Problem

A probability space is a triple $(Ω, B (Ω), Pr)$ , where $(Ω, B (Ω))$ is a measurable space and $Pr : B (Ω) ⟶ [0, 1]$ is a probability measure, see [11,24]. A random variable is a measurable function defined on Ω with values in a set of real numbers; its probability distribution is denoted $P_{X}$ . The random variable X has a discrete probability distribution if A is at most countable. A computable probability distribution $P_{X}$ is a discrete probability distribution such that the function $x \in A ↪ P_{X} ({x})$ is computable (in particular, $P_{X} ({x})$ is a computable real for each $x \in A$ , see [23,28]). The mean (or expected value) of the discrete random variable $X : Ω ⟶ A$ is defined by $E (X) = \sum_{x \in A} x \cdot Pr ({ω \in Ω : X (ω) = x})$ , if the series converges.

The Cumulative Distribution Function of a random variable X is the function ${CDF}_{X} : R ⟶ [0, 1]$ defined by ${CDF}_{X} (y) = Pr (X ⩽ y)$ , $y \in R$ . The Inverse Cumulative Distribution Function or Quantile function of the random variable X with a discrete distribution is the function $q_{X} : [0, 1] ⟶ A$ defined by $q_{X} (p) = inf {y \in A : p ⩽ {CDF}_{X} (y)}$ . For more details see [1].

Next we present the probability framework introduced in [8]. The finite running times of the computations $U (x)$ are the set of exact stopping times for the halting programs of U: $T_{U} = {t \in Z^{+} : there exists x \in Z^{+} such that x \in Stop (U, t)}$ . The family of (finite and countable) unions of sets $Stop (U, t), t \in Z^{+}$ , generates the Borel field $B (dom (U))$ . A computable running time probability space $(dom (U), B (dom (U)), \Pr_{ρ_{U}})$ is defined from a computable probability distribution ρ on $T_{U}$ by setting $Pr = {Pr}_{ρ} : B (dom (U)) ⟶ [0, 1]$ , $Pr (Stop (U, t)) = ρ (t)$ , $t \in T_{U}$ .

We introduce a probability structure on the set $T_{U}$ via a random variable. Let $B (T_{U})$ be the family of all subsets of $T_{U}$ . The function $RT = {RT}_{U} : dom (U) ⟶ T_{U}$ , $RT (x) = min {t > 0 : x \in Stop (U, t)}$ has the property that for every $t \in T_{U}$ , ${RT}^{- 1} ({t}) = Stop (U, t) \in B (dom (U))$ . The random variable RT – called the running time associated with U – induces the probability space $(T_{U}, B (T_{U}), P_{RT})$ on $T_{U}$ in which the probability is defined by $P_{RT} ({t}) = Pr ({RT}^{- 1} ({t}))$ , $t \in T_{U}$ . For every $t \in T_{U}$ we have: $P_{RT} ({t}) = Pr (Stop (U, t)) = ρ (t)$ . For more details see [8].

Inference-based decisions are made using statistical procedures based on sets of observations. An inference-based decision of a hypothesis results in one of two outcomes: the hypothesis is accepted or rejected. The outcome can be correct or erroneous. The set of observations leading to the decision “reject the hypothesis” is called the critical region and its complement is called the acceptance region.

Consider a probability space $(A, B (A), P_{X})$ induced by a random variable X. Consider an acceptance region $D \subset A$ with $D \in B (A)$ . For every observed value $z \in A$ , a hypothesis $H_{z}$ is a predicate such that the sets ${z \in A : H_{z} is true}$ and ${z \in A : H_{z} is false}$ are in $B (A)$ .

An inference-based decision has the following form: $\begin{array}{l} If the observed value z \in A belongs to A ∖ D, then decide to reject the hypothesis H_{z} . \end{array}$

An error occurs if we reject $H_{x}$ on the basis of $A ∖ D$ when $H_{x}$ is true. The probability of error, that is, the probability of an erroneous decision, is $P_{X} ({x \in A ∖ D} ∣ {H_{x} is true})$ .

We can reformulate the Halting Problem as an inference-based decision in which we test, for an arbitrary $z \in Z^{+}$ , the hypothesis $H_{z} = ‘ U (z) < \infty^{'}$ (remember $H_{z}$ is a predicate) against the alternative $H_{z}^{'} = ‘ U (z) = \infty^{'}$ . An “erroneous decision” means rejecting $H_{z}$ when $U (z) < \infty$ .

The construction of the anytime algorithm for the Halting Problem is based on a computable acceptance region $D \subset dom (U)$ . Accordingly, the algorithm rejects the hypothesis $H_{z}$ if $z \notin D$ . This decision is correct if $z \in Z^{+} ∖ dom (U)$ and it is wrong if $z \in dom (U) ∖ D$ . In detail, for an arbitrary program $z \in Z^{+}$ there are three possibilities: a) $z \in D$ , b) $z \in dom (U) ∖ D$ , c) $z \notin dom (U)$ . Condition a) is decidable, but conditions b) and c) are undecidable. If $z \in D$ the anytime algorithm gives the correct decision $U (z) < \infty$ ; otherwise $z \in (dom (U) ∖ D) \cup (Z^{+} ∖ dom (U))$ , so the anytime algorithm decides – rightly or wrongly – that $U (z) = \infty$ . Furthermore, for every $z \in Z^{+}$ , $\begin{matrix} (2) & rejecting H_{z} on the basis of D is an erroneous decision if and only if z \in dom (U) ∖ D . \end{matrix}$

As the right-hand side of (2) is a subset of $dom (U)$ we don’t need to work in $Z^{+}$ but in $dom (U)$ , that is, in the probability space $(dom (U), B (dom (U)), \Pr)$ or, equivalently, in $(T_{U}, B (T_{U}), P_{RT})$ . The goal to minimise the probability of an erroneous decision can be achieved in this space as long as D and $dom (U) ∖ D$ are measurable sets. This condition is satisfied by taking the decidable set of the form $D = ⋃_{t = 1}^{T} Stop (U, t)$ , for some appropriate T.

The anytime probabilistic algorithm proposed in [8] is:

The acceptance program region of the algorithm is $\begin{matrix} (3) & D_{U} (P_{RT}, ε) = {x \in dom (U) : RT (x) ⩽ q_{RT} (1 - ε)}, \end{matrix}$ where $ε \in (0, 1)$ is the decision error and the threshold $q_{RT} (1 - ε)$ is the $(1 - ε)$ -quantile of the probability distribution $P_{RT}$ . The correctness of the algorithm comes from the inequality on the critical program region $C (P_{RT}, ε) = dom (U) ∖ D_{U} (P_{RT}, ε)$ : $Pr (C (P_{RT}, ε)) ⩽ ε$ .

Condition (3) can be equivalently stated in terms of running times as: $\begin{matrix} (4) & D_{T_{U}} (P_{RT}, ε) = {t \in T_{U} : t ⩽ q_{RT} (1 - ε)}) . \end{matrix}$ The critical time region is $B (P_{RT}, ε) = {t \in T_{U} : t > q_{RT} (1 - ε)}$ and the correctness comes from $P_{RT} (B (P_{RT}, ε)) ⩽ ε$ .

4. Statistical framework

The notions and results discussed in Section 3 are based on the assumption that the computable probability distribution on the set of finite running times of programs of U and the random variable RT are known. In case this assumption is not satisfied, can an inferential approach be used to extract information about the true but unknown probability distribution of the random variable RT from observations of the phenomenon described by RT? More precisely, instead of working with the theoretical ${CDF}_{X}$ , can we can we approximate the true, unknown probability distribution of the random variable RT by means of a (long-enough) sequence $X_{1}, \dots, X_{N}$ of independent, identically distributed random variables with the same distribution as RT? The answer is affirmative.

The following form of Hoeffding’s inequality (see [26]) is essential in what follows:

Theorem 4.1.
Let $N > 0$ be an integer and $a, b \in R$ , $a < b$ . For every $X_{1}, \dots, X_{N}$ independent random variables defined on $(Ω, B (Ω), Pr)$ with values in $[a, b]$ we have: $\begin{matrix} Pr ({ω \in Ω : \frac{1}{N} \sum_{i = 1}^{N} X_{i} (ω) - E (\frac{1}{N} \sum_{i = 1}^{N} X_{i}) ⩽ λ}) ⩾ 1 - exp (- \frac{2 N λ^{2}}{{(b - a)}^{2}}) . \end{matrix}$

Consider the probability space $(Ω, B (Ω), Pr)$ , the random variable $X : Ω ⟶ A$ and N replicates $X_{1}, \dots, X_{N}$ of X. In what follows $(x_{1}, \dots, x_{N}) \in A^{N}$ will denote the observed values of a sample of size N corresponding to the random variables $X_{1}, \dots, X_{N}$ : $(x_{1}, \dots, x_{N}) = (X_{1} (ω), \dots, X_{N} (ω)) \in A^{N}$ . The vector $(x_{1}, \dots, x_{N})$ will be called an N-dimensional sample and its values $x_{1}, \dots, x_{N}$ data points. The Empirical Cumulative Distribution Function is defined by $\begin{matrix} (5) & {ECDF}_{X, N} (y) = \frac{# {1 ⩽ i ⩽ N : x_{i} ⩽ y}}{N}, y \in R . \end{matrix}$

Suppose that we order increasingly the observed data points and denote the sequence by $\begin{matrix} (6) & x_{(1)} ⩽ x_{(2)} ⩽ \dots ⩽ x_{(N - 1)} ⩽ x_{(N)} . \end{matrix}$

The order statistics of rank k is the kth smallest value in (6): $X_{(k)} (ω) = x_{(k)}$ . See more in [11, Ch. 6]).

The statistical anytime algorithm assumes that the probability distribution of RT is unknown. Therefore, the cumulative distribution function of RT, ${CDF}_{RT} (t) = Pr ({x \in dom (U) : RT (x) ⩽ t})$ , is also unknown and has to be estimated. For evaluating the quality of the approximation we fix a positive integer N and consider the true, unknown N-dimensional program sampling space $(dom {(U)}^{N}, B (dom {(U)}^{N}))$ .

The elements of $dom {(U)}^{N}$ will be denoted by $x = (x_{1}, \dots, x_{N})$ . Projections ${{pr}_{1}, \dots, {pr}_{N}}$ , ${pr}_{i} : dom {(U)}^{N} ⟶ dom (U)$ , ${pr}_{i} (x) = x_{i}$ , $i = 1, \dots, N$ , are independent random variables. If we denote by ${RT}_{i} = RT \circ {pr}_{i} : dom {(U)}^{N} ⟶ T_{U}$ , ${RT}_{i} (x) = RT (x_{i})$ , $i = 1, \dots, N$ , then ${{RT}_{1}, \dots, {RT}_{N}}$ are independent, identical distributed random variables. Furthermore, for every $1 ⩽ i ⩽ N$ , we have: $\begin{array}{l} {CDF}_{{RT}_{i}} (t) & = {Pr}^{N} ({x \in {(dom (U))}^{N} : RT (x_{i}) ⩽ t, 1 ⩽ i ⩽ N}) \\ = {Pr}^{N} (dom (U) \times \dots \times {x_{i} \in dom (U) : RT (x_{i}) ⩽ t} \times dom (U) \times \dots \times dom (U)) \\ = Pr (dom (U)) \dots Pr ({x_{i} \in dom (U) : RT (x_{i}) ⩽ t}) \cdot Pr (dom (U)) \dots Pr (dom (U)) \\ = {CDF}_{RT} (t) . \end{array}$

For every $x \in dom {(U)}^{N}$ we put ${RT}_{i} (x) = t_{i} (x)$ , $1 ⩽ i ⩽ N$ and denote the N-dimensional time sampling space by $(T_{U}^{N}, B (T_{U}^{N}), P_{RT}^{N})$ .

In the following Lemma 4.2, ${CDF}_{RT} (t)$ is estimated by the Empirical Cumulative Distribution Function (5) $\begin{array}{l} {ECDF}_{RT, N} (({RT}_{1} (x), \dots, {RT}_{N} (x)); t) & = \frac{# {1 ⩽ i ⩽ N : {RT}_{i} (x) ⩽ t}}{N} \\ (7) & = \frac{# {1 ⩽ i ⩽ N : t_{i} (x) ⩽ t}}{N} . \end{array}$
Lemma 4.2.
For every positive integer N, $t \in T_{U}$ and $λ \in (0, 1)$ , we have: $\begin{array}{l} {Pr}^{N} ({x \in dom {(U)}^{N} : {ECDF}_{RT, N} (({RT}_{1} (x), \dots, {RT}_{N} (x)); t) - {CDF}_{RT} (t) ⩽ λ}) \\ (8) & ⩾ 1 - exp (- 2 N \cdot λ^{2}) . \end{array}$
Proof.
On one hand, from (7) we have: $\begin{matrix} {ECDF}_{RT, N} (({RT}_{1} (x), \dots, {RT}_{N} (x)); t) = \frac{1}{N} \sum_{i = 1}^{N} 1_{{x \in dom U^{N} : {RT}_{i} (x) ⩽ t}} . \end{matrix}$

On the other hand, using the linearity of the operator E we have: $\begin{array}{l} E (\frac{1}{N} \sum_{i = 1}^{N} 1_{{x \in dom {(U)}^{N} : {RT}_{i} (x) ⩽ t}}) & = \frac{1}{N} \sum_{i = 1}^{N} E (1_{{x \in dom {(U)}^{N} : {RT}_{i} (x) ⩽ t}}) \\ = \frac{1}{N} \sum_{i = 1}^{N} {Pr}^{N} ({x \in dom {(U)}^{N} : {RT}_{i} (x) ⩽ t}) \\ = {CDF}_{RT} (t) . \end{array}$

As ${RT}_{i} : dom {(U)}^{N} ⟶ T_{U}$ , ${RT}_{i} (x) = RT (x_{i})$ , $i = 1, \dots, N$ are independent random variables, for every $t \in T_{U}$ , $1_{{x \in dom {(U)}^{N} : {RT}_{i} (x) ⩽ t}}$ : $dom {(U)}^{N} ⟶ [0, 1]$ , $i = 1, \dots, N$ are also independent random variables. Consequently, the inequality (8) follows from Theorem 4.1 applied to the random variables ${1_{{x \in dom {(U)}^{N} : {RT}_{i} (x) ⩽ t}}, i = 1, \dots, N}$ . □

If we define the set of “good program samples” by $\begin{matrix} G_{N, λ, t} = {x \in dom {(U)}^{N} : {ECDF}_{RT, N} (({RT}_{1} (x), \dots, {RT}_{N} (x)); t) - λ ⩽ {CDF}_{RT} (t)}, \end{matrix}$ then by Lemma 4.2 we have $\begin{matrix} {Pr}^{N} (G_{N, λ, t}) ⩾ 1 - exp (- 2 N λ^{2}), \end{matrix}$ were λ is the precision parameter and $(1 - exp (- 2 N λ^{2}))$ can be interpreted as the confidence level that a program is in $G_{N, λ, t}$ , i.e. it is a good program sample.

With this interpretation, Lemma 4.2 says that the probability ${Pr}^{N}$ of the set of programs $x \in dom {(U)}^{N}$ on which ${ECDF}_{RT, N} (({RT}_{1} (x), \dots, {RT}_{N} (x)); t)$ estimates ${CDF}_{RT} (t)$ with precision at least λ can be made as “large” as one wishes. To measure the size of this set (according to ${Pr}^{N}$ ) we introduce the confidence level $(1 - δ)$ by the condition $\begin{matrix} (9) & (1 - exp (- 2 N \cdot λ^{2})) ⩾ (1 - δ), \end{matrix}$ which is equivalent to $\begin{matrix} (10) & N ⩾ N (λ, δ) = ⌈ \frac{1}{2 λ^{2}} \cdot ln \frac{1}{δ} ⌉ . \end{matrix}$

The following result shows that for every $N ⩾ N (λ, δ)$ the set of good program samples $G_{N, λ, t}$ can be made as “large” as required in probability ${Pr}^{N}$ : Corollary 4.3.
For every $t \in T_{U}$ , $λ \in (0, 1)$ , $δ \in (0, 1)$ and $N ⩾ ⌈ \frac{1}{2 λ^{2}} \cdot ln \frac{1}{δ} ⌉$ we have $\begin{matrix} (11) & {Pr}^{N} (G_{N, λ, t}) ⩾ 1 - δ . \end{matrix}$

5. A statistical anytime algorithm for the Halting Problem

For a fixed confidence level $(1 - δ)$ , precision parameter λ and good program sample x (which produces a “reliable” estimate of ${CDF}_{RT}$ ) we use the critical time region (see (4)) to reject the hypothesis $H_{z}$ and to measure the probability of an erroneous decision of the anytime algorithm. Accordingly, for $ε, λ \in (0, 1)$ , the critical time region should satisfy the following two conditions: $\begin{matrix} (12) & B (RT, x; ε, λ) = {t \in T_{U} : t > threshold (x, ε, λ)}, P_{RT} (B (RT, x; ε, λ)) ⩽ ε . \end{matrix}$

For a sample of programs x we use the notation $t (x) = (t_{1} (x), \dots, t_{N} (x))$ , where $t_{i} = {RT}_{i} (x)$ , $1 ⩽ i ⩽ N$ . We increasingly order the observed running times $t_{i}$ and get the values of the corresponding order statistics $t_{(1)} (x) ⩽ \dots ⩽ t_{(N)} (x)$ . As one of these order statistics will be the choice for the threshold, $threshold (x, ε, λ)$ , we must find the smallest number $1 ⩽ K ⩽ N$ such that $x \in G_{N, λ, t_{(K)} (x)}$ . In terms of order statistics, $t_{(K)} (x)$ generates a statistical $threshold (x, ε, λ)$ which must satisfy (12). Explicitly these two requirements are: $\begin{array}{l} (13) & {ECDF}_{RT, N} ((t_{1} (x), \dots, t_{N} (x)); t_{(K)} (x)) - λ ⩽ {CDF}_{RT} (t_{(K)} (x)), \\ (14) & P_{RT} ({t \in T_{U} : t > t_{(K)} (x)}) ⩽ ε . \end{array}$

As from (7), $\begin{matrix} {ECDF}_{RT, N} ((t_{1} (x), \dots, t_{N} (x)); t_{(K)} (x)) = \frac{K}{N}, \end{matrix}$ both conditions are satisfied if $\begin{matrix} (15) & 1 - ε ⩽ \frac{K}{N} - λ ⩽ {CDF}_{RT} (t_{(K)} (x)) . \end{matrix}$

Indeed, from the definition of ${CDF}_{RT}$ , if $x \in G_{N, λ, t_{(K)}}$ , then $\begin{matrix} \frac{K}{N} - λ ⩽ {CDF}_{RT} (t_{(K)} (x)), \end{matrix}$ so (13) is satisfied. Furthermore, as $\begin{matrix} P_{RT} ({t \in T_{U} : t > t_{(K)} (x)}) = 1 - {CDF}_{RT} (t_{(K)} (x)), \end{matrix}$ if $x \in G_{N, λ, t_{(K)} (x)}$ and $1 - ε ⩽ \frac{K}{N} - λ$ , then (14) is satisfied.

From the first inequality in (15) we get $K ⩾ N (1 - ε + λ)$ . As we must have $0 < 1 - ε + λ < 1$ , we get $λ < ε$ . For $N = N (λ, δ)$ as in (10) we can take $K = K (ε, λ, δ) = ⌈ N (1 - ε + λ) ⌉$ – the minimum integer $1 ⩽ K ⩽ N$ satisfying (15) – hence $\begin{matrix} (16) & threshold (x, ε, λ) = t_{K (ε, λ, δ)} (x) = t_{⌈ N (1 - ε + λ) ⌉} (x) . \end{matrix}$

From (15) we have $\begin{matrix} (17) & 1 - ε ⩽ {CDF}_{RT} (t_{(⌈ N (1 - ε + λ) ⌉)} (x)) . \end{matrix}$

The statistical anytime algorithm for the Halting Problem will operate with three parameters: a) a bound $ε \in (0, 1)$ for the decision error, b) a precision parameter $1 < λ < ε$ which is a bound on the approximation of ${CDF}_{RT}$ with ${ECDF}_{RT, N}$ , and c) a confidence parameter $1 - δ \in (0, 1)$ which is a probabilistic bound on the confidence in the precision parameter. In detail, the approximation parameter and confidence level control the quality of the approximation of ${CDF}_{RT}$ by ${ECDF}_{RT}$ : a) the precision parameter λ is the numerical difference between the values of the two functions in a given point, b) the confidence level is the probability that the N sampled programs produce an approximation of ${CDF}_{RT}$ with a requested precision, c) the decision error is the probability that the decision $‘ U (z) = \infty^{'}$ is returned when, in reality, $U (z) < \infty$ .

We sample N independent halting programs $x_{1}, \dots, x_{N} \in dom (U)$ (see [3,18]) and, by running them till they stop, calculate their respective running times $t_{1} (x), \dots, t_{N} (x) \in T_{U}$ . Let $x = (x_{1}, \dots, x_{N})$ and $t (x) = (t_{1} (x), \dots, t_{N} (x))$ . Randomisation is done according to the probability distribution induced by an injective computable enumeration of the halting programs.

The statistical anytime algorithm is:

We now evaluate the error the statistical anytime algorithm can make by deciding that $U (z)$ does not stop when in fact it stops. To this aim we use the threshold $t_{(⌈ N (1 - ε + λ) ⌉)}$ and the critical regions $\begin{array}{l} B (RT, x; ε, λ) = {t \in T_{U} : t > t_{(⌈ N (1 - ε + λ) ⌉)} (x)}, \\ C (RT, x; ε, λ) = {y \in dom (U) : RT (y) > t_{(⌈ N (1 - ε + λ) ⌉)} (x)} . \end{array}$

Lemma 5.1.
For every $x \in dom {(U)}^{N}$ , $ε, λ \in (0, 1)$ with $λ < ε$ , we have: $\begin{matrix} (18) & Pr (C (RT, x; ε, λ)) = P_{RT} (B (RT, x; ε, λ)) . \end{matrix}$
Proof.
We have: $\begin{array}{l} Pr (C (RT, x; ε, λ)) & = P_{RT} ({t \in T_{U} : t > t_{(⌈ N (1 - ε + λ) ⌉)} (x)}) = P_{RT} (B (RT, x; ε, λ)) . \end{array}$ □
Lemma 5.2.
For every integer $N > 0$ , $ε, λ \in (0, 1)$ with $λ < ε$ , we have: $\begin{array}{l} {Pr}^{N} ({x \in dom {(U)}^{N} : P_{RT} (B (RT, x; ε, λ)) ⩽ ε}) \\ (19) & ⩾ {Pr}^{N} ({x \in dom {(U)}^{N} : {CDF}_{RT} (t_{(⌈ N (1 - ε + λ) ⌉)} (x)) ⩾ 1 - ε}) . \end{array}$
Proof.
We only need to prove the implication: $\begin{matrix} {CDF}_{RT} (t_{(⌈ N (1 - ε + λ) ⌉)} (x)) ⩾ 1 - ε ⟹ P_{RT} (B (RT, x; ε, λ)) ⩽ ε . \end{matrix}$

If $T_{1} = {k \in T_{U} : 1 ⩽ k ⩽ t_{(⌈ N (1 - ε + λ) ⌉)} (x)}$ and $T_{2} = {j \in T_{U} : j > t_{(⌈ N (1 - ε + λ) ⌉)} (x)}$ , then $T_{1} \cap T_{2} = \emptyset$ , ${CDF}_{RT} (t_{(⌈ N (1 - ε + λ) ⌉)} (x)) = P_{RT} (T_{1})$ and $P_{RT} (B (RT, x; ε, λ)) = P_{RT} (T_{2})$ .

Consequently, if $P_{RT} (T_{1}) ⩾ 1 - ε$ , then $\begin{matrix} 1 - ε + P_{RT} (T_{2}) ⩽ P_{RT} (T_{1}) + P_{RT} (T_{2}) ⩽ P_{RT} (T_{1} \cup T_{2}) ⩽ 1, \end{matrix}$ so $P_{RT} (B (RT, x; ε, λ)) = P_{RT} (T_{2}) ⩽ ε$ . □
Theorem 5.3.
For every $ε, λ, δ \in (0, 1)$ with $λ < ε$ and $N = N (λ, δ)$ we have: $\begin{matrix} (20) & {Pr}^{N} ({x \in dom {(U)}^{N} : Pr (C (RT, x; ε, λ)) ⩽ ε}) ⩾ 1 - δ . \end{matrix}$
Proof.
From the definition (7) and the choice of the statistical threshold (16) we have $\begin{matrix} {ECDF}_{RT, N} (({RT}_{1} (x), \dots, {RT}_{N} (x)); t) = \frac{⌈ N (1 - ε + λ) ⌉}{N} . \end{matrix}$ In view of (17), (18) and (19) and (11) we have $\begin{array}{l} {Pr}^{N} ({x \in dom {(U)}^{N} : Pr (C (RT, x; ε, λ)) ⩽ ε}) \\ = {Pr}^{N} ({x \in dom {(U)}^{N} : P_{RT} (B (RT, x; ε, λ)) ⩽ ε}) \\ ⩾ {Pr}^{N} ({x \in dom {(U)}^{N} : {CDF}_{RT} (t_{(⌈ N (1 - ε + λ) ⌉)} (x)) ⩾ 1 - ε}) \\ ⩾ {Pr}^{N} ({x \in dom {(U)}^{N} : {CDF}_{RT} (t_{(⌈ N (1 - ε + λ) ⌉)} (x)) ⩾ \frac{⌈ N (1 - ε + λ) ⌉}{N}} - λ) \\ ⩾ 1 - δ . \end{array}$ □

According to (20), the probability ${Pr}^{N}$ of the event that the statistical anytime algorithm gives a wrong decision, that is, it declares $U (z) = \infty$ when there exists $t > t_{(⌈ N (1 - ε + λ) ⌉)} (x)$ such that $U (z)$ stops in time t, is smaller or equal than ε, is larger than $1 - δ$ , i.e. the probability of error is smaller than or equal than ε with confidence larger than $1 - δ$ .
6. Implementations of the statistical anytime algorithm

In this section we present three implementations of the anytime algorithm; numerical illustrations show that their time complexities are reasonably small.

The standard implementation of the statistical anytime algorithm is as follows. Given three rational numbers $ε, λ, δ \in (0, 1)$ with $λ < ε$ , first compute the sample size from (10), $N = ⌈ \frac{1}{2 λ^{2}} \cdot ln \frac{1}{δ} ⌉$ ; this positive integer is fixed as long as λ, δ are fixed. Then use an algorithm to generate a random injective computable enumeration of $dom (U)$ till N programs $x_{1}, \dots, x_{N}$ and their running times $t_{1}, \dots, t_{N}$ are obtained; again, these programs are fixed with ε, λ, δ. Then, for every program $z \in Z^{+}$ , if the computation $U (z)$ does not stop in time $t_{(⌈ N (1 - ε + λ) ⌉)} (x)$ , then declare that $U (z) = \infty$ . In the latter case the probability of error is smaller than or equal to ε with confidence larger than $1 - δ$ .

In Table 1 we illustrate numerically $N (λ, δ)$ and $⌈ N (λ, δ) \cdot (1 - ε + λ) ⌉$ for the first implementation with fixed parameters ε, λ, δ having a few statistically standard values.

Table 1
Numerical illustration of the first implementation of the anytime algorithm

δ ε λ ( $< ε$ ) $N (λ, δ)$ $⌈ N (λ, δ) \cdot (1 - ε + λ) ⌉$

$\frac{1}{100}$ $\frac{5}{1000}$ $\frac{1}{1000}$ $2.3026 \times 10^{6}$ $2.2934 \times 10^{6}$

$\frac{1}{100}$ $\frac{1}{1000}$ $\frac{5}{10000}$ $9.2103 \times 10^{6}$ $9.2057 \times 10^{6}$

$\frac{5}{1000}$ $\frac{5}{1000}$ $\frac{1}{1000}$ $2.6492 \times 10^{6}$ $2.6386 \times 10^{6}$

$\frac{5}{1000}$ $\frac{1}{1000}$ $\frac{5}{10000}$ $1.0597 \times 10^{7}$ $1.0592 \times 10^{7}$

$\frac{1}{1000}$ $\frac{5}{1000}$ $\frac{1}{1000}$ $3.4539 \times 10^{6}$ $3.4401 \times 10^{6}$

$\frac{1}{1000}$ $\frac{1}{1000}$ $\frac{5}{10000}$ $1.3816 \times 10^{7}$ $1.3809 \times 10^{7}$

δ	ε	λ ( $< ε$ )	$N (λ, δ)$	$⌈ N (λ, δ) \cdot (1 - ε + λ) ⌉$
$\frac{1}{100}$	$\frac{5}{1000}$	$\frac{1}{1000}$	$2.3026 \times 10^{6}$	$2.2934 \times 10^{6}$
$\frac{1}{100}$	$\frac{1}{1000}$	$\frac{5}{10000}$	$9.2103 \times 10^{6}$	$9.2057 \times 10^{6}$
$\frac{5}{1000}$	$\frac{5}{1000}$	$\frac{1}{1000}$	$2.6492 \times 10^{6}$	$2.6386 \times 10^{6}$
$\frac{5}{1000}$	$\frac{1}{1000}$	$\frac{5}{10000}$	$1.0597 \times 10^{7}$	$1.0592 \times 10^{7}$
$\frac{1}{1000}$	$\frac{5}{1000}$	$\frac{1}{1000}$	$3.4539 \times 10^{6}$	$3.4401 \times 10^{6}$
$\frac{1}{1000}$	$\frac{1}{1000}$	$\frac{5}{10000}$	$1.3816 \times 10^{7}$	$1.3809 \times 10^{7}$

In a second implementation we start with two rational numbers $ε, λ \in (0, 1)$ with $λ < ε$ and an “affordable” size $\tilde{N}$ of samples (programs and running times), then compute the rational $δ (\tilde{N}, λ) \in (0, 1)$ satisfying the inequality (9). We continue with the standard implementation of the statistical anytime algorithm with parameters ε, λ ( $λ < ε$ ) and $\begin{matrix} (21) & δ (\tilde{N}, λ) = exp (- 2 \tilde{N} \cdot λ^{2}) \end{matrix}$ to calculate the size sample $N (λ, δ (\tilde{N}, λ)) = \tilde{N}$ . However, the value $δ (\tilde{N}, λ)$ in (21) is not rational, so to preserve the inequality (9) we need to calculate a rational approximation $\begin{matrix} δ (\tilde{N}, λ) ⩾ exp (- 2 \tilde{N} \cdot λ^{2}), \end{matrix}$ which implies the inequality $N (λ, δ (\tilde{N}, λ)) < \tilde{N} + 1$ .

The “price” paid working with an affordable, smaller sample size $\tilde{N}$ is a (possibly sharp) decrease in the confidence level; below is a numerical illustration of the second implementation with fixed parameters ε, λ, $\tilde{N}$ . Table 2 illustrates the second implementation.

Table 2

Numerical illustration of the second implementation of the anytime algorithm

ε	λ ( $< ε$ )	$\tilde{N}$	$δ (\tilde{N}, λ)$	$⌈ \tilde{N} (1 - ε + λ) ⌉$
$\frac{1}{100}$	$\frac{5}{1000}$	$10^{5}$	$6.7379 \times 10^{- 3}$ (very good)	$9.95 \times 10^{4}$
$\frac{1}{100}$	$\frac{4}{1000}$	$10^{5}$	$4.0762 \times 10^{- 2}$ (good)	$9.94 \times 10^{4}$
$\frac{1}{100}$	$\frac{5}{1000}$	$2 \cdot 10^{5}$	$4.54 \times 10^{- 5}$ (excellent)	$1.99 \times 10^{5}$
$\frac{1}{100}$	$\frac{4}{1000}$	$2 \cdot 10^{5}$	$1.6616 \times 10^{- 3}$ (very good)	$1.988 \times 10^{5}$
$\frac{1}{100}$	$\frac{1}{1000}$	$10^{6}$	$1.3534 \times 10^{- 1}$ (hardly acceptable)	$9.91 \times 10^{5}$
$\frac{1}{1000}$	$\frac{5}{10000}$	$10^{6}$	$6.0653 \times 10^{- 1}$ (unacceptable)	$9.995 \times 10^{5}$

In a third implementation we start with two rational numbers $ε, λ \in (0, 1)$ with $λ < ε$ and an “affordable” upper bound T on the running time of the computation $U (z)$ in the statistical anytime algorithm. We then use an injective dovetailing algorithm to generate as many elements of $dom (U)$ as possible in time T. In this way we will obtain a sample of $N (T)$ programs $x_{1}, \dots, x_{N (T)}$ and their respective running times $t_{1}, \dots, t_{N (T)}$ such that each program stops in time at least (in fact, much smaller than) T: $t_{i} ⩽ T$ , for all $1 ⩽ i ⩽ N (T)$ . We then continue with the second implementation with parameters $ε, λ, \tilde{N} = N (T)$ .

Both second and third implementations can be improved by increasing the sample size or the time bound, respectively.

Approximations in algorithmic information theory [4], for example, the approximations of Solomonoff universal distribution1

Solomonoff universal distribution [19] is an priori incomputable probability distribution over the set of finite binary strings which is different from the computable probability distribution on the set of stopping times discussed in Section 3.

[19], involve constants; in contrast, all implementations of the statistical anytime algorithm are free from such uncertainty.

7. Final comments

The anytime probabilistic algorithm for the Halting Problem proposed in [8] uses essentially a computable probability distribution on the set of stopping times of halting programs which reflects the halting behaviour of the chosen universal machine. The quantile of this probability distribution is used to compute the stopping threshold time. The probability of a wrong decision is no larger than the accepted error.

The statistical anytime algorithm for the Halting Problem – which is inspired by the probabilistic one – does not make any assumption on the probability distribution on the halting programs and uses an order statistics to compute the stopping threshold time (the cut-off temporal bound). In a nutshell, this anytime algorithm works on an arbitrary program as follows:

“Sample” sufficiently many halting programs independently at random.

Determine their running times and consider the induced empirical distribution as approximation to the true but unknown distribution.

Simulate the given program for the largest number of steps made by any of the sampled programs: if it still has not terminated by then, report (possibly wrongly) ‘The program does not halt!’.

In detail, the statistical algorithm uses three parameters for evaluating the quality of solutions, namely the probability of an erroneous decision ε, the precision λ and the confidence level δ of the statistical approximation. The sample size and critical regions are constructed based on these parameters. The main advantage of the statistical algorithm is that it can be implemented without any prior information about the distribution of running times. Another advantage is that the threshold (cut-off) temporal bound $⌈ N (λ, δ) \cdot (1 - ε + λ) ⌉$ is calculated only once (when $ε, λ, δ \in (0, 1)$ with $λ < ε$ are fixed) and then used for running the algorithm on any input. We proved that with a confidence level as large as required, the algorithm produces correct decisions with a probability as large as required.

The main advantage of the anytime statistical algorithm is that it can be implemented without any prior information about the running times on the specific model of computation – the choice of a computable probability distribution for the anytime probabilistic algorithm can be rather subjective and hard to substantiate; also, the cut-off temporal bound is reasonably small.

Finally, three implementations of the algorithm have been presented and numerically illustrated. Recent experimental work with Turing machines in [16] shows that “the halting probability of a TM decreases with time and will have a smaller chance of halting at every step it progresses”, reflecting the behaviour discovered in [9] which is at the core of both anytime probabilistic and statistic algorithms for the Halting Problem. It will be interesting to experiment these algorithms, particularly the statistical one, with different models of computations in order to understand their practical utility.

Footnotes

Acknowledgements

We thank Dr. Ned Allen for discussions and comments and for motivating one author (CC) to study practical approximate solutions to the Halting Problem. We thank the anonymous referees for useful comments that improved the presentation; we distinctly thank the referee who made many excellent proposals including the description of the statistical anytime algorithm presented in Section . This work was supported in part by the Quantum Computing Research Initiatives at Lockheed Martin.

References

B.C.

Arnold,

Balakrishnan and

H.N.

Nagaraja, A First Course in Order Statistics, John Wiley, New York, 2008.

Bienvenu,

Desfontaines and

Shen, What percentage of programs halt?, in: Automata, Languages, and Programming I,

M.M.

Halldórsson,

Iwama,

Kobayashi and

Speckmann, eds, LNCS, Vol. 9134, Springer, 2015, pp. 219–230. doi:10.1007/978-3-662-47672-7_18.

Bringmann and

Panagiotou, Efficient sampling methods for discrete distributions, Algorithmica (2016), 1–25.

Calude, Theories of Computational Complexity, North Holland, Amsterdam, 1988.

C.S.

Calude, Information and Randomness: An Algorithmic Perspective, 2nd edn, Springer, Berlin, 2002.

C.S.

Calude and

Desfontaines, Universality and almost decidability, Fundamenta Informaticae 138(1–2) (2015), 77–84. doi:10.3233/FI-2015-1199.

C.S.

Calude and

Desfontaines, Anytime algorithms for non-ending computations, International Journal of Foundations of Computer Science 26(4) (2015), 465–475. doi:10.1142/S0129054115500252.

C.S.

Calude and

Dumitrescu, A probabilistic anytime algorithm for the Halting Problem, Computability 7 (2018), 259–271. doi:10.3233/COM-170073.

C.S.

Calude and

M.A.

Stay, Most programs stop quickly or never halt, Advances in Applied Mathematics 40 (2008), 295–308. doi:10.1016/j.aam.2007.01.001.

10.

Cook,

Podelski and

Rybalchenko, Proving program termination, Communications ACM 54(5) (2011), 88–98. doi:10.1145/1941487.1941509.

11.

DasGupta, Probability for Statistics and Machine Learning, Springer, New York, 2011.

12.

Downey and

Hirschfeldt, Algorithmic Randomness and Complexity, Springer, Heidelberg, 2010.

13.

Grass, Reasoning about computational resource allocation. An introduction to anytime algorithms, Magazine Crossroads 3(1) (1996), 16–20. doi:10.1145/332148.332154.

14.

J.D.

Hamkins and

Miasnikov, The halting problem is decidable on a set of asymptotic probability one, Notre Dame Journal of Formal Logic 47(4) (2006), 515–524. doi:10.1305/ndjfl/1168352664.

15.

Köhler,

Schindelhauer and

Ziegler, On approximating real-world halting problems, in: Fundamentals of Computation Theory 2005,

Liskiewicz and

Reischuk, eds, LNCS, Vol. 3623, Springer, 2005, pp. 454–466. doi:10.1007/11537311_40.

16.

Krzyzanska, Exploring halting times for unconventional halting schemes, Complex Systems 27(1) (2018), 85–99. doi:10.25088/ComplexSystems.27.1.85.

17.

R.H.

Lathrop, On the learnability of the uncomputable, in: Proceedings International Conference on Machine Learning,

Saitta, ed., Morgan Kaufmann, 1996, pp. 302–309.

18.

P.S.

Levy and

Lemeshow, Sampling of Populations. Methods and Applications, 3rd edn, John Wiley, 1999.

19.

Li and

P.M.B.

Vitányi, An Introduction to Kolmogorov Complexity and Its Applications, 3rd edn, Springer Verlag, New York, 2008.

20.

Lynch, Approximations to the Halting Problem, Journal of Computer and System Sciences 9 (1974), 143–150. doi:10.1016/S0022-0000(74)80003-6.

21.

Yu.I.

Manin, A Course in Mathematical Logic for Mathematicians, 2nd edn, Springer, Berlin, 2010.

22.

Yu.I.

Manin, Renormalisation and computation II: Time cut-off and the Halting Problem, Mathematical Structures in Computer Science 22 (2012), 729–751. doi:10.1017/S0960129511000508.

23.

Mori,

Tsujii and

Yasugi, Computability of probability distributions and distribution functions, in: 6th International Conference on Computability and Complexity in Analysis, Schloss Dagstuhl–Leibniz-Zentrum Für Informatik,

Bauer,

Hertling and

K.-I

Ko, eds, Dagstuhl, 2009, pp. 185–196.

24.

Olofsson, Probability, Statistics, and Stochastic Processes, Wiley-Interscience, New York, 2005.

25.

Rybalov, On the generic undecidability of the halting problem for normalized Turing machines, Theory of Computing Systems (2016), 1–6.

26.

Scott, Statistical learning theory, topic 3: Hoeffding’s inequality, University of Toronto, 2014, https://www.coursehero.com/file/18068309/03-hoeffding, retrieved 4 June 2019.

27.

Soler-Toscano,

Zenil,

J.-P.

Delahaye and

Gauvrit, Calculating Kolmogorov complexity from the output frequency distributions of small Turing machines, PLoS ONE 9(5) (2014), e96223.

28.

Weihrauch, Computable Analysis. An Introduction, Springer, Berlin, 2000.

29.

Zenil, Computer runtimes and the length of proofs, in: Computation, Physics and Beyond,

M.J.

Dinneen,

Khoussainov and

Nies, eds, LNCS, Vol. 7160, Springer, 2012, pp. 224–240. doi:10.1007/978-3-642-27654-5_17.

30.

Zenil and

J.-P.

Delahaye, On the algorithmic nature of the world, in: Information and Computation. Essays on Scientific and Philosophical Understanding of Foundations of Information and Computation,

Dodig-Crnkovic and

Burgin, eds, World Scientific, Singapore, 2010, pp. 477–499.