This paper considers a family of so-called 2-partitions of some finite set. Each of them divides the set under study into two disjoint parts. Under the assumption that two such partitions are chosen randomly, the exact probability distribution of the special cluster metric on this family is found. On this basis, a new statistical test for checking the significance of differences between 2-partitions is proposed. In addition, the distribution of the values of this metric is found for the case when both partitions are of the ledge type in ordering the set of objects in ascending order of values of some numerical indicator. This means that one of the parts of each partition, which in some sense is the main one, is a segment. The boundaries of such a segment are called normative. By comparing various estimates of the normative boundaries based on sample data, it is introduced the concept of indicative certainty of the numerical indicator. It can be regarded as the degree of confidence in this indicator as a basis for decision whether an object belongs to the main set of the ledge partition. Some application of the results to medical data processing is considered.
One of the urgent issues in structuring a set of statistical data is the problem of comparing different ways of dividing this set into disjoint parts. Such a problem appears, for example, in medicine when trying to make differential diagnostics of incoming patients by different methods. A relatively small difference in the partitions, one of which was obtained by traditional, generally accepted method, and the other by a new, for example, less expensive one, can contribute to the decision to introduce a new method into medical practice. Another task requiring similar comparison is the estimation of the labor costs required to translate actually existing partition into another one. Such an assessment is carried out by calculating the so-called editorial metrics of Levenshtein or Jaro (see, for example, Cohen, 2003), and the like.
Obviously, a correct estimate of the degree of difference between two partitions is possible only when there is a metric on the family of partitions of the main set of objects. Some ways to define such a metric are discussed in Inal, 2016. See also the bibliography there. Sometimes it turns out to be convenient to introduce some function of two arguments on a family of partitions, which is not a metric, but it is easy to study and interpret within the framework of the class of issues to be solved. Probably the best known functions of this kind are the Kullback-Leibler divergence and the mutual information of two partitions (see Kullback & Leibler, 1951; Archer et al., 2013, respectively).
The conclusion about the degree of difference between the partitions in the simplest way can be made by comparing the value of the metric on the actual pair of partitions with its maximum possible value. For a special cluster metric on , the calculation method of which will be presented a little below, this step was taken by the autor in Dronov, 2021. It turned out that the maximum value of the metric when fixing the partition A, which is referred to as the main one, strongly depends on the specific form of A. Therefore the use of a unified coefficient of the relative difference of any pairs of partitions, as was originally proposed in Dronov and Dementjeva, 2012, apparently turns out to be unreasonably rude.
For a more accurate and reasoned conclusion regarding the significance of the difference of partitions, one should, as it is common in data processing, compare the probability of an actually existing or greater difference between randomly selected partitions with a certain threshold value. To determine such a threshold, it is necessary to find the distribution of a discrete random variable when choosing partitions A, B from the family at random. Unfortunately, this problem in the most general setting turned out to be too difficult. We need an easier start.
So, mostly keeping in mind medical applications, it was decided to restrict ourselves to the case when each of the partitions A, B is composed of two subsets only. For example, if each of the partitions is a result of a different diagnostic method, then it is composed of a set of patients for whom a certain diagnosis was accepted and a set of those for whom it was not confirmed.
Another plus of this restriction is that for an arbitrary partition A of the chosen structure the maximum of the metric value on the partitions B of the same type does not depend on A.
Further it will be considered two types of 2-partitions. In the first, no additional assumptions are made. Having a partition of a second type, it is assumed that all objects are pre-ordered according to the value of some numerical indicator . Then any admissible partition is constructed as follows: its main set includes those objects in which the indicator lies within some segment while all other objects belong to the second set. For different admissible partitions, the segments may differ. Let’s call the boundaries of this segment (and the segment itself) normative.
Such 2-partitions often appear in the processing of medical data. For example, the presence of an inflammatory process is related to whether the level of blood leukocytes goes beyond the boundaries of 4 to 9 billion per liter. The ability of an athlete to perform a certain difficult element often depends on his age or weight category. Each of these categories is determined by setting the boundaries of age or weight and these boundaries are normative. There are many such examples from very different areas. This type of dependence is studied, for example, in Li et al., 2018.
The situation described was studied partially in Dronov and Boyko, 2015, and there such partitions were called ledge partitions. At the same time, it was demonstrated there that the normative boundaries for , as a rule, according to sample data are ambiguous. Consequently, comparing the differences between the two ledge partitions arising at different choices of normative boundaries with the same sample data, one can assess the degree of competence of the decision made on the basis of the values of the indicator . Or, in medical language, the less admissible partitions show differences in the choice of normative boundaries, the more reliable the diagnosis can be made using solely.
The main task of the work is to find the exact probability distribution of the metric under the assumption that each of the partitions A, B divides the main finite set of objects into two disjoint parts, and these partitions are chosen randomly and independently. As an immediate consequence, a statistical test establishing the significance of the differences between such partitions arrives.
Main theorem
Any set further is assumed to be finite. Denote by the number of elements of a set . An arbitrary partition of a set into two parts that have no common elements, will be called a 2-partition. Each of the two parts of the main set that make up the partition will be considered as a cluster.
You can imagine the construction of 2-partition as follows. Suppose that for each of the objects of the main set , the values of some binary characteristic are given. In other words, there is a sample of size consisting only of zeros and ones. Then put
so and 2-partition is constructed.
Such partition will be called induced by the values of the binary characteristic . It is probably worth emphasizing that when passing from a family of all possible 2-partitions to a family of partitions induced like that, the number of elements of the family doubles. Indeed, replacing all values 1 by 0 and vice versa, does not change the partition, but the binary characteristic is formally different.
Any 2-partiton consisting of clusters will be referred to as .
To estimate the degree of difference between two partitions with some numerical value, the cluster metric on the family of all partitions of will be used. Its rigorous definition will not be important to us, but, according to the results of Dronov and Eudokimov, 2018 (where the definition of is given in the most general form), for , it can be calculated using the formula
that will be used in what follows instead of the definition.
Let’s move on to listing all possible values of the metric on the family of 2-partitions of the set . For the clusters of partitions A, B the notation will be retained. Let [] denote an integer part of a number and .
Lemma 1. Equation (1) can be converted to any of the types
On the integer values of their arguments both have maxima [] which are both reached when the value of the argument is [].
It is clear that if is even, there is exactly one maximum point on the set of integer arguments, for example, for a function . For an odd , there are two maxima of the same magnitude attained at .
Everything concerning this maximum, as well as the maximum of , is a trivial consequence of the properties of a quadratic function of one variable.
Now turn to the substantiation of Eqs (2) and (3).
This completes the proof of Eq. (2). And for Eq. (3) it remains only to note that
An immediate consequence of this proof is
Lemma 2. All possible values of the metric d on a family of 2-partitions are contained in the set of integers and can be specified by the sequential listing of values for integers . The value is certainly attained on some pair of 2-partitions.
Each of its values, except, possibly, the maximum, takes at two different , but only one of them is such that .
Lemma 3. Let integer be such that , . Then if then the number of ordered pairs of 2-partitions of the main set such that is regardless of the parity of . If is even, then for the maximum value there are required pairs.
Proof. Obviously, to obtain the value of the metric equal to , it is necessary and sufficient to construct partitions such that, in the notation introduced in the proof of Lemma 1, . Let’s explore over all such possibilities as follows: first, choose some consisting of elements (in ways), and then proceed to construct four disjoint sets so that
For any specific choice of the introduced sets, all the needed partitions have the form
Note that in this process each ordered pair of partitions is formed exactly once, if . For even and each pair of partitions appears twice in the described construction. This will happen due to the possibility of choosing a set as , because it also consists of elements.
In total, the set can be divided into two non-intersecting parts in ways by encoding each of the parts 0 or 1. But at the same time, the reversal of these symbols will lead to the same partition; therefore there are only different ways. Similarly, you can build in ways. Then the total number of ways to obtain the required metric value on an ordered pair of partitions is
If, however, then this amount should be divided by two due to the repetition in choice described earlier. Lemma is proved.
Lemma 3 immediately leads to the exact distribution of . Here is the result.
Theorem 1. Let 2-partitions A, B be chosen randomly and independently of each other. Then all values of may be written as where z is an integer between 0 and .
Suppose . Then
If n is even, and thus the value is admissible, then
A case of ledge partitions
Let us turn to the study of a more special class of 2-partitions, which are important, in particular, in the processing of medical data. The general formulas obtained in the previous section can be applied to these partitions, but the distribution of the metric values on the family of just such partitions can be found more accurately. As above, for each of the subsets of that make up its partition the term cluster will be used, although in this case it is especially clear that one of them is very different in structure from what is usually called so.
Dronov and Boyko, 2015 deals with 2-partitions of the desired type. To build such a partition, the elements of are ordered and numbered in accordance with the increase in some numerical indicator . In this ordering, several successive objects were distinguished to form the first cluster (the “middle part” of the main set), and the rest form a zero cluster. The binary characteristic associated with the partition, is assumed to be equal to 0 on the objects of the zero cluster and 1 on the objects of the first cluster. The division into clusters, which is constructed in the described manner, will be called ledge partition.
The degree of conformity of the 2-partition, which is generated by the observed values of the characteristic to the most suitable ledge partition, was estimated in the cited article using so-called ledge coefficient. It was proposed as one of the possible alternatives to conventional correlation coefficients between the binary characteristic and the numerical indicator .
Let us call the sequentially located objects a segment, and the numbers of the first and last objects of the segment its beginning and end. Note that for a ledge partition, the first cluster is always a segment, and the zero cluster is the union of two segments, one of which has the beginning 1, and the second – the end . Note that the case when one of these two segments is empty is not excluded.
Consider two ledge partitions. By analogy with the previous section, introduce the notation when considering partitions . Then Eq. (3) implies
Here, as it is easy to understand, the quantity is , the number of elements of the symmetric difference of the first clusters. Since both clusters participating in this formula are segments, their symmetric difference is most often consists of two segments with lengths (possibly in a different order). A continuous segment of length may occur when one of these numbers is equal to 0, which means that our symmetric difference is either the initial or the final segment of the larger of the clusters. There is one more possibility in the case . This is a situation when the two segments do not intersect, but their union is a segment of the required length. This can be organized in ways.
According to Lemma 1, the distance between two 2-partitions reaches its maximum value at in Eq. (4). It turns out that restricting the class of the considered partitions to only the ledge ones does not diminish this maximum.
Lemma 4. For any ledge partition A some ledge partition B exists such that .
Proof. Let’s show that is achievable. It is clear that one of the clusters A has at least elements. Let it be the first one. Consider a segment inside having length and the beginning of which coincides with the first element of , as . It remains to choose
and .
If then within this zero cluster the part is selected, which has length and is adjacent to. This part can be a segment attached to either side of the first cluster, or it can consist of two segments adjacent to the left and right sides of respectively. Let , and the zero cluster of the second partition, as above, be composed of the rest of the elements of the main set. It is clear that the symmetric difference of the first clusters of two partitions will coincide with the part . The lemma is proved.
In fact, it will be shown below that, no matter what an arbitrary integer between 0 and , for any ledge partition A there is always a ledge partition B such that Eq. (4) holds. Since the value of the metric on any pair of ledge partitions fits this formula, then it is enough to iterate over all and determine the number of ways in which each of them can be represented as a sum of non-negative numbers to calculate the number of pairs of partitions of the type. It is also necessary to understand in how many ways one can construct a pair of ledge partitions so that for the fixed selected values it holds .
Note that if such partitions are constructed, and none of the numbers equals 0, then the entire segment is divided into 5 segments, the outermost of which may be empty, and the second and the penultimate ones have lengths (not necessarily in this order). Let us assume that the partition A is induced by the values of the binary characteristic , and the partition B is induced by the values of the characteristic of the same type. All possibilities for unordered pairs of partitions are listed in Table 1. The rows of Table 1 are formed in the following sequence: at first, the clusters have no common objects, then, in the second row, they run over each other with a common segment c, and, finally, the larger first cluster of the first partition absorbs the first cluster of the second.
Three variants of forming pairs of partitions after choosing (length of segment b), if
a
b
c
d
e
Case 1
0
1
0
0
0
0
0
0
1
0
Case 2
0
1
1
0
0
0
0
1
1
0
Case 3
0
1
1
1
0
0
0
1
0
0
Since at least one object must be placed between the segments b and d, which have the total length , let’s form two artificial objects. The first will include subsequent elements, and one more additional element adjacent to the right of the selected. The second is made up of the other consecutive elements. Taking into account the number of used objects, now there are objects. Adding two formed artificial ones, it is clear that the choice of the segments b and d is possible in ways for each .
So, if the symmetric difference of the first clusters of the considered partitions is a union of two segments of total length , which is not a segment, then such an unordered pair of step partitions can be constructed in ways.
If is equal to 0 or , then the symmetric difference is a segment, and the length of this segment is . It can be placed in ways by choosing a specific place for its beginning. As already noted, here exist two ways to construct the required pair of partitions. If our segment is the union of two first clusters of the partitions, then the number of possibilities is obviously equals . If it is the left or right end of the larger of the first clusters, then, considering it as a new artificial object, two of the resulting number of the new objects can be chosen, declaring the second of the selected is a common end of both clusters. Of course, here you need to take into account the order of choice. Therefore, such a choice can be made in ways. Thus, in a whole, there are variants with a continuous segment of symmetric difference.
The total number of ways to construct a pair of partitions for fixed turns out to be
It remains to consider the extremes – the smallest and largest possible values of .
Lemma 5. The total number of ledge partitions of a set of n elements is .
Indeed, the process of constructing such a partition is identical to choosing the ends of its first cluster. If this cluster consists of a single element, then there are possibilities, otherwise .
Let’s return to our calculations. For , we obtain that the first clusters of partitions A, B coincide, and therefore, the partitions also coincide. According to Lemma 5, the choice of two coinciding partitions can be done in ways. If , then, the first clusters A and B should be formed by dividing the entire set by choosing one of its objects. This object will stand for the end of , and the next one for the beginning of . Therefore, there exists ways of construction. Let’s sum it up.
Lemma 6. Let be one of the possible values of the metric between two step partitions of a set of elements, and a non-negative integer is chosen so that . Then the number of such pairs of ledge partitions A, B for which , is given by the formula
So the exact probability distribution of the metric on a family of ledge partitions is found. The result may be written down in as a theorem.
Theorem 2. Let two ledge partitions A, B of a set U of n elements be chosen randomly and independently of each other. Then all possible values of are limited to integers of the form where s runs through all integer values between 0 and ,
where is given by Eq. (5) and . If is even and, thus, the value is admissible, then
It is clear that when trying to apply Theorem 2 in practice, we will immediately encounter the fact that real two-dimensional values obtained by observing a pair of characteristics, in which the first is numeric and the second binary, do not correspond to an exact ledge partition. This is usually the case, even if the presence of a ledge type connection between binary and numerical is obvious. In fact, the partition induced by the values of in this case will, apparently, be close to some ledge one. So the sample data give a possibility to construct a segment, the boundaries of which are most similar to the normative ones. This means that the number of errors, such a cases that 1 outside the segment or 0 inside it, is the minimum possible. Any such segment was called optimal in Dronov and Boyko, 2015. So there are often several optimal segments constructed from the same data set. Among other things, the cited article proposes algorithms for finding the shortest and longest optimal segment.
Let’s construct two ledge partitions of our main set of objects, the first clusters of which are the longest and shortest of the optimal segments correspondently. The probability that the metric on a pair of ledge partitions takes on a value not less than the value of the metric between the constructed two partitions will be denoted . It will be called a coefficient of the indicative certainty of the numerical indicator in relation to the binary characteristic .
This coefficient can be interpreted as the degree of clarity of the boundaries of values on those objects for which . If it is small, then for the first cluster of our partition it is impossible to establish unambiguous boundaries for the values of and, therefore, the indicator is rather weakly related to .
Thus, an increase in the value corresponds to rising a power of ledge type connection between a numerical indicator and a binary characteristic. The introduced coefficient thus can serve as another alternative to common correlation coefficients between characteristics in the ledge type connections along with the ledge coefficient.
Statistical test for the difference between 2-partitions
Now the standard statistical procedure can be used to justify the hypothesis that the two partitions, say, C and D observed in the experiment differ insignificantly. To do this, let’s compute and, using whichever of the two theorems proved above is more appropriate for the situation, calculate
If the obtained value is small enough, then the partitions C and D should be referred to as significantly different. As always, the choice of a sufficiently small threshold is within the competence of the researcher. But, in any case, should be interpreted as the probability that the partitions C, D do not differ.
Note that only the test using Theorem 1 is of practical importance. Really. as already noted, it is never possible to observe in practice the distribution of a binary characteristic, fully corresponding to one of the ledge partitions. The assertion of Theorem 2 nevertheless turns out to be useful for estimating an indicative certainty of numerical indicator in relation to the binary one. A method for it was proposed at the end of the previous section.
Let’s look at two examples. The first uses the data obtained at a study of dependence of deep vein thromboembolism of the lower extremities on the human genotype. The study was carried out in 2013 on the basis of the Altai Regional Diagnostic Center. It included both patients with thrombosis (main group, that is coded 1 in Table 2) and patients in the control group (coded by 0) who did not have the disease. Table 2 shows the data for each of 24 patients including the presence or absence of a pathological variant of the prothrombin F2 gene (coded by 1 or 0 correspondently).
Association of a pathology F2 prothrombin gene with thromboembolism
Group
0
0
0
0
0
0
0
0
0
0
0
0
0
0
F2
0
0
1
0
1
0
0
0
0
0
0
0
0
0
Group
1
1
1
1
1
1
1
1
1
1
F2
0
1
1
1
1
0
0
0
0
1
According to the data in Table 2 let’s construct 2-partitions A and B of the set of 24 patients in a way described in Section 2. The intersection of these partitions has 4 clusters, which contain 12; 2; 5; 5 objects respectively. Calculations by Eq. (1) give . The use of Eq. (3) is also acceptable here: take , this is the number of objects for which the elements of the two rows of Table 2 are not equal. Table 3 contains a fragment of the distribution of for , constructed according to the formulas of Theorem 1. Summing up the values of the third row of this table and using the result obtained, one can see that these partitions do not have significant differences with the probability . Now it can be argued that the presence of the F2 gene pathology is almost certainly associated with the occurrence of thrombosis.
A fragment of the distribution of the metric
7
8
9
10
11
12
238
256
270
280
286
288
0.041
0.088
0.156
0.234
0.298
0.161
Two indicators – numerical and binary
1
2
3
4
5
6
7
8
9
10
11
0
0
1
0
1
1
1
0
1
1
0
12
13
14
15
16
17
18
19
20
21
22
0
1
1
0
0
0
0
1
1
1
1
Completing the example, note that the value of the Pearson’s correlation coefficient between the characteristics given in Table 2 is 0.387. This value indicates that there exists a connection between them, but apparently underestimates its strength.
The data for the second example are borrowed from Dronov and Boyko, 2015. Consider the numerical indicator , which in our example is equal to the number of the corresponding object in the set of 22 elements. The elements of the main set are ordered in ascending order of the values of this indicator. Each object is also assigned one of the values of the binary characteristic .
In the cited article the shortest and longest optimal segments for this data were found. The first of them has boundaries 19 and 22, the second 3 and 22. If any of these variants is accepted for the normative boundaries, the data in Table 4 will give the same number of errors, equal to 8. This number of errors is the minimum possible for the data under consideration.
Taking the two found optimal segments as the first clusters of two ledge partitions A and B, let’s calculate . Since the intersection of the partitions consists of the segments [1; 2], [3; 18] and [19; 22], then by Eq. (1) the result is . Table 5 shows a fragment of the distribution of on the class of ledge partitions for . This series is constructed in accordance with the assertion of Theorem 2.
A fragment of the distribution for ledge partitions ()
192
210
224
234
240
242
0.109
0.120
0.128
0.134
0.138
0.069
The first row of Table 5 contains all consecutive values of the metric not less than 192, the second the probabilities of these values. The sum of the elements in the second row is 0.697. This numerical characteristic in the previous section was called the coefficient of indicative certainty . Thus, the chances of a successful attempt to use to predict the values of a binary characteristic can be estimated precisely by this number, and so the forecast is unlikely to be accurate.
It is interesting to note that when using the Pearson’s correlation coefficient between and , which is equal to 0.158, due to the standard Student’s test, we obtain an estimate of the likelihood of a relationship between and of 0.428, which, in principle, is close to the conclusion made. The numerical differences in the estimates, apparently, are explained by the fact that Pearson’s coefficient is well sensitive only to connections with a significant linear component. In addition, all calculations related to the level of significance of this coefficient, by and large, ignore the discreteness of one of the characteristics.
Conclusion
The main theoretical result of the work is the enumeration of all possible values of the metric on a pairs of random 2-partitions and calculation of the exact probabilities of each of these values under the assumption that the partitions are chosen randomly an independently of each other.
Let the choice of partitions be carried out from the family of the whole 2-distributions of a set of elements. Then it turned out that the distribution is obtained from the binomial distribution with parameters , by, say, bending its distribution series in half and gluing it together. More precisely, here occurs summation of the probabilities corresponding to the values symmetric to the center of the series of all these values. The exact formulations are given in Theorem 1.
To continue, define an order relation on the main set. Suppose only ledge partitions are considered admissible, one of the clusters of which is some interval, and the second fringes it. Then it is proved that the set of all values of and the principle of constructing the probabilities of these values does not change from the general case. But the resulting distribution turns out to be different from the glued one, and even lacks symmetry. This situation is described in more detail by Theorem 2.
On the basis of each of Theorems 1 and 2, a statistical test for distinguishing the partitions is constructed in a standard way. In addition, if, as for example, in medical applications, the order on the main set is associated with an increase in some numerical indicator, then using Theorem 2 we propose a method for estimating the degree of similarity of the resulting relationship between the numerical and binary characteristics with a ledge type connection. To do this, a new coefficient is proposed. It is called the coefficient of indicative certainty. Numerical examples are given.
The technique described in the work can also be applied to assess the strength of the connection between two binary characteristics. For this, construct a 2-partition induced by its values for each of them. The resulting partitions are compared, for example, using Theorem 1 and the statistical test proposed on the basis of it.
Various practical tasks in which the studied set of objects is divided into parts by different methods are quite common in practice. It appears useful to be able to compare the results of two such partitions. If, as a result of a comparison, an understanding arises that the two partitions differ insignificantly, then it turns out to be admissible to leave only one of the applied methods of the division to be applied in future.
The exact solution of problems associated with 2-partitions and binary characteristics is in great demand in medicine, where most of the non-numerical data are of this form. In addition, the asymptotic approaches common in classical mathematics are simply inapplicable for such issues. And new methods of rigorous justification for practical conclusions in medical sciences are especially in demand because of today’s general acceptance of the evidence-based medicine principles (see, for example, Masic et al., 2008; Szajewska, 2018).
References
1.
ArcherE.ParkI.M., & PillowJ. (2013). Bayesian and quasi-bayesianestimators for mutual information from discrete data.Entropy, 15(12), 1738-1755.
2.
CohenW.W. (2003). A comparison of string distance metrics forname-matching tasks.Knowledge Discovery in Databases Workshop on Data Cleaning and Object Consolidation, 3, 73-78.
3.
DronovS.V. (2021). On the greatest distance between two partitions of thefinite set. Journal of Physics: Conference Series. 14. “XIV International Scientific and Technical Conference “Applied Mechanics and SystemsDynamics”, AMSD 2020”, C. 012071.
4.
DronovS.V., & BoykoI.Y. (2015). The method for estimating theconnection power of binary and nominal variables.Prikladnaya Diskretnaya Matematika, 4, 109-119.
5.
DronovS.V., & DementjevaE.A. (2012). A new approach to post-hoc problemin cluster analysis.Model Assisted Statistics and Applications, 7(1), 49-55.
6.
DronovS.V., & EudokimovE.A. (2018). Post-hoc cluster analysis ofconnection between forming characteristics.Model Assisted Statistics and Applications, 13(2), 183-192.
7.
InalI. (2016). A metric for partitions.Economics Bulletin, 36(1), 588-594.
8.
KullbackS., & LeiblerR.A. (1951). On information and sufficiency.Ann. Math. Statistics, 22(1), 79-86.