Abstract
2 × 2 tables are encountered in various scientific disciplines, including biomedical, social and behavioral sciences, economics and ecology. In the literature many different similarity measures have been proposed that can be used to further summarize the information in a 2 × 2 table. Many of these measures are just functions of the four cells. In this paper the important ones are reviewed. Furthermore, it is shown how various similarity measures are related to one another by considering certain general similarity measures that have various important similarity measures as special cases. The presented overview may provide insights that may be helpful to researchers from various scientific disciplines in deciding what similarity measure to use in applications or for studying theoretical properties.
Keywords
Introduction
In various research situations the data can be summarized in a 2 × 2 table. For example, in psychology and biometrics it may be the result of a reliability study where two observers classify a sample of objects using a dichotomous response [33, 71]. In epidemiology, a 2 × 2 table may be the result of a randomized clinical trial with a binary outcome of success [62]. Furthermore, in ecology it may be the cross-classification of the presence/absence codings of two species types in a number of locations [56, 99]. Finally, in cluster analysis a 2 × 2 table may be the cross-classification of two different partitions of the same object set [1, 94].
The frequent occurrences of binary data has led to the fact that there are many similarity measures that can be used to further summarize the numbers of a 2 × 2 table [1, 105]. Well-known examples are the simple matching coefficient [82, 91], the phi coefficient [112] and Cohen’s kappa [18]. Sometimes reporting a single similarity measure concludes the data-analytic part of a research study. In other cases, multiple measures or matrices of similarity measures are used as input in techniques in data mining and cluster analysis.
In this paper an overview of the various similarity measures for 2 × 2 tables is presented. There is no space to consider all similarity measures that have been proposed, but the most important ones are reviewed. The review is an extended version of Heiser and Warrens [45], and contains material from [98–100, 108]. In choosing a particular similarity measure, a measure has to be considered in the context of the data analysis of which it is a part [40]. The overview provides insights that may be beneficial to both practitioners in deciding what measures to use, and theorists for studying properties of this type of similarity measures.
The paper is organized as follows. In section 2, definitions are presented. In section 3, examples of similarity measures for 2 × 2 tables are presented and various application domains are discussed. In section 4 to section 6 several general similarity measures are discussed. The aim of these sections is to show how the different measures can be classified and are related to one another. Rational functions are considered in section 4, chance-corrected measures in section 5, and power means in section 6.
Break-down of relative frequencies for binary variables X and Y
Break-down of relative frequencies for binary variables X and Y
Similarity measures for 2 × 2 tables have been classified in a number of different ways [5, 92]. Furthermore, similarity measures may have different names depending on the field of science or the analytic context. Example are, association coefficients, agreement indices, reliability statistics or presence/absence coefficients.
In this review, three general types of measures are distinguished, called type A, type B and type C. Of course, many other classifications are possible. Before considering the three types, some preliminaries are discussed.
Preliminaries
In general, a 2 × 2 table is obtained if two objects are compared on the presence/absence of a set of attributes, or if a set of objects is cross-classified by two binary variables. To simplify the presentation, it is presupposed that the 2 × 2 table is a cross-classification of two binary variables X and Y.
Table is an example of a 2 × 2 table. The four relative frequencies a, b, c and d characterize the joint distribution of the variables X and Y. Quantities a and d are often called, respectively, the positive and negative matches, whereas b and c are the mismatches. The row and column totals of Table are the marginal totals that result from summing the joint proportions. Instead of relative frequencies, Table may also be defined on counts or frequencies; relative frequencies are used here for notational convenience.
Similarity measures for 2 × 2 tables are functions that quantify the extent to which two binary variables are associated or the extent to which two objects resemble one another. The measures are functions that take as arguments the relative frequencies a, b, c and d and return numerical values that are higher if the variables are more associated [6]. In general, the symbol S is used to denote a similarity measure, but sometimes other symbols are used as well.
Symmetry
A measure is called symmetric if the values of b and c can be interchanged without changing the value of the similarity measure. Although the majority of measures discussed in this paper are symmetric, similarity measures are not required to be symmetric [65]. Asymmetric measures have natural interpretations if, for example, the variable X is a criterion against which variable Y is evaluated.
The simple matching coefficient [91] is given by
The Dice indices [25, 97] given by
Positive and negative matches
Sokal and Sneath [92] make the classical distinction between measures that include the positive matches a only and functions that include both the positive and negative matches a and d [5, 99]. A binary variable can be an ordinal or a nominal variable. If X is an ordinal variable, then X = 1 is more in some sense than X = 0.
For example, if a binary variable is a coding of the presence or absence of a list of attributes or features, then d reflects the number of negative matches. In the field of numerical taxonomy the quantity d is generally felt not to contribute to similarity, and hence should not be included in the definition of a similarity measure.
Type A similarity measures
Type A measures satisfy the two requirements
Symmetric examples of type A measures are the Jaccard [52] index
Type A measures can be functions that are increasing in d. An example is the measure
Type B similarity measures
Type B measures satisfy the two requirements
The similarity measures
Type C similarity measures
Type C measures satisfy the three conditions
Examples of type C similarity measures are Cohen’s kappa [9, 61]
Some type C similarity measures satisfy the conditions
Application domains
In this section, several important similarity measures for 2 × 2 tables and their application domains are reviewed.
Tetrachoric correlation and odds ratio
The tetrachoric correlation is a traditional measure for assessing association in a 2 × 2 table [27, 76]. It is an important statistic because the tetrachoric correlation is an estimate of the Pearson product-moment correlation coefficient between hypothetical row and column variables with normal distributions, that would reproduce the observed contingency table if they were divided into two categories in the appropriate proportions.
Because an approximate estimate of the Pearson correlation may well be as adequate in many applications, particularly in small samples, various authors have introduced approximations to the tetrachoric correlation [26, 76]. The tetrachoric correlation is an example of a measure for 2 × 2 tables that cannot be expressed in terms of the relative frequencies a, b, c and d.
Another classic measure is the odds ratio or cross-product
An odds ratio of 1 indicates that the condition or event under study is equally likely in both groups. An odds ratio greater than 1 indicates that the event is more likely in the first group. Probability theory tells us that two binary variables are statistically independent if the odds ratio is equal to unity, i.e.
The three similarity measures are special cases of the general measure
Epidemiological studies
Although the odds ratio is probably the most widely used measure in epidemiology, a variety of other similarity measures are used as well. In general, two cases can be distinguished in epidemiology.
In the first case the variable X is a criterion against which variable Y is evaluated [62]. Examples are the evaluation of a new medical test against a gold standard diagnosis, or a risk factor against a disorder, or assessing the validity of a binary measure against a binary criterion. In these cases a and d are the proportions of true positives and true negatives, whereas b and c are the proportions of false positives and false negatives. In this case researchers are interested in measures like
Another important measure is the weighted kappa coefficient [9, 62]
Measure
In the second case the variables X and Y are equally important, for example, in studies of inter-rater reliability or test-retest reliability. Suppose the variables are observers and that Table is the cross classification of the judgments by the two raters on the presence or absence of a trait. An obvious measure of agreement that has been proposed independently for this situation by various authors is the proportion of all objects on whom the two raters agree [33, 36]. This proportion of observed agreement is given by
In reliability studies it is considered a necessity that a similarity measure assesses agreement over and above chance agreement [98, 99]. Measures that control for chance agreement are Cohen’s kappa and the phi coefficient [112, 114]. Although Cohen’s kappa and the phi coefficient have a correlation-like range [-1, 1] (type C similarity measures), the measures are commonly used to distinguish between positive agreement and no agreement. For recommendations and guidelines on what statistics to use under what circumstances in epidemiological studies, we refer to Kraemer [62].
Ecological association
In ecological biology, one may distinguish several contexts where similarity measures for 2 × 2 tables can be used [56, 92]. One such case deals with measuring the degree of coexistence between two species types over different locations. A second situation is measuring association between two locations over different species types. In the first situation a binary variable is a coding of the presence or absence of a species type in a number of locations. The joint proportion a then equals the proportion of locations where both species types are found.
Dice [25] discusses the two asymmetric measures (see also [81, 97])
Popular similarity measures for ecological association are the Jaccard [52] index
The Jaccard index can be interpreted as the number of 1s shared by X and Y in the same positions, divided by the total number of positions were 1s occur. The Dice-Sørenson measure is a special case of measures considered in Czekanowski [21, 22] and Gleason [35]. With respect to the Jaccard measure, the Dice-Sørenson index gives twice as much weight to relative frequentie a. The Dice-Sørenson measure is regularly used with presence/absence data in the case that there are only a few positive matches relatively to the number of mismatches.
The Jaccard index, Dice-Sørenson measure and Driver-Kroeber-Ochiai index are popular measures of ecological association, and they have been empirically compared to other measures for 2 × 2 tables in numerous studies. For example, Duarte, Santos and Melo [29] evaluated association measures in clustering and ordination of common bean cultivars analyzed by RAPD type molecular markers. The genetic distance measures obtained by taking the complement of the Dice-Sørenson index were considered the most adequate.
Boyce and Ellison [10] studied similarity measures for 2 × 2 tables in the context of fuzzy set ordination, and concluded that the Jaccard index, Dice-Sørenson measure and Driver-Kroeber-Ochiai index, are the preferred similarity measures.
Comparing two partitions
Different clustering methods perform well in different situations, and no clustering method has been shown to dominate other methods across all application domains [46, 53]. To be able to choose a clustering method that is suitable for the task at hand, it is required that the characteristics of the method are well understood. An important and fundamental topic in cluster analysis research is therefore the validation of the cluster results [46].
To evaluate the performance of clustering methods researchers typically assess the agreement between a reference standard partition that purports to represent the true cluster structure of the objects, and a trial partition produced by the method that is being evaluated [46, 97]. So-called external validity indices can be used to assess the agreement between two partitions [1, 94]. High agreement between the two partitions then indicates good recovery of the true cluster structure [2, 97].
A related problem in social and behavioral sciences is that of measuring agreement among judges in classifying answers to open-ended questions, or psychologists rating people on categories not defined in advance [12, 80]. The classifications can be seen as partitions and agreement between judges can be assessed by quantifying the similarity between two partitions.
Many different similarity measures have been proposed for quantifying the agreement between two partitions. These so-called external validity indices can be divided into three different approaches, namely 1) counting pairs of objects, 2) information theory based, and 3) matching sets based [78]. Validity indices that are based on counting pairs of objects can be defined using a 2 × 2 table with quantities a, b, c, and d, by counting the number of object pairs that were placed in the same cluster in both partitions (a), in the same cluster in one partition but in different clusters in the other partition (b and c), and in different clusters in both (d).
Two popular measures for comparing partitions that are based on the pair counting approach are the Rand index [82]
Test homogeneity
The similarity measure [7, 68]
Although the Benini index has a correlation-like range [-1, 1] (type C similarity measure), it is usual to assume that two items are at least positively dependent. The Benini measure satisfies requirement (A3). It is equal to unity if the binary variables form a so-called Guttman pair. In this case, all subjects that pass the first item also pass the second item, or vice versa. The Benini index can become unity with different marginal distributions, that is, the item popularities or difficulties a + b and a + c may be different.
Cole [19] introduced a similarity measure which is equivalent to the Benini measure if there is positive covariance between the binary variables (ad > bc). In the case of negative covariance (ad < bc), Cole’s similarity measure is given by
The Cole measure is one of several similarity measures that are type C similarity measures that were introduced in the context of ecological association. Several authors proposed coefficients of ecological association that measure the degree to which the observed proportion of joint occurrences of two species types exceeds or falls short of the proportion of joint occurrences expected on the basis of chance alone [19]. In contrast, the measures discussed in section 3.3 are typically type A similarity measures.
The Cole index has been used in various applications by animal and plant ecologists [51, 83]. A variant of the Cole measure proposed in Hurlbert [51] is less influenced by the species’ frequencies. Hurlbert [51] examined both the Cole measure and the variant as approximations to the tetrachoric correlation (section 3.1).
Rational functions
In the following sections various general similarity measures from the literature are considered. Many similarity measures for 2 × 2 tables are special cases of a certain general similarity measure. The formulation of general similarity measures reveals and specifies how the various similarity measures may be related to one another and provide ways for interpreting them (see, e.g. the end of subsection 3.1). In this section rational functions are discussed.
Gower and Legendre [40] consider the general similarity measure
Measure S (θ) is a type A similarity measure and is also studied in Fichet [32], Gower [39] and Heiser and Bennani [44]. Similarity measure S (1) is the Jaccard index and
Janson and Vegelius [56] present an interesting relationship between the special cases of S (θ) that can sometimes be useful when comparing two of them (see also [2, 90]). The Jaccard index and the Dice-Sørenson measure are related by J = D/(2 - D). In general it holds that
Similarity measure S (θ) is a special case of the ratio model (Tversky [96])
A second general similarity measure considered in Gower and Legendre [40] is
Similarity measure T (θ) is a special case of the complement of the dissimilarity measure
Warrens [108] considers another type of family of rational functions, namely
The formulation of S (θ) and T (θ) is closely related to the concept of order equivalence [4, 87]. If two similarity measures are order equivalent, they are interchangeable with respect to an analysis method that is invariant under ordinal transformations. The relevant information for these analysis methods is in the ranking induced by the similarity measures, not in the values themselves. Application examples are in image retrieval [65] and monotone equivariant cluster analysis [58].
Any two special cases of S (θ) are order equivalent, and any two special cases of T (θ) are order equivalent. Omhover, Rifqi and Detyniecki [74] showed that two special cases of S (θ, δ) with parameters (θ, δ) and (θ′, δ′) are order equivalent if θδ′ = θ′δ. Similarly, Baulieu [5] showed that two special cases of T (θ, δ) with parameters (θ, δ) and (θ′, δ′) are order equivalent if θδ′ = θ′δ.
Warrens [99, 110] presented various inequalities between similarity measures. Several insights can be obtained from studying inequalities between similarity measures. For example, if several similarity measures defined on the same quantities have unconditional inequalities between them it is likely that these similarity measures reflect the association or agreement between the binary variables X and Y in a similar way, but to a different extent (some have lower/higher values than others).
The similarity measures S (θ) and T (θ) are strictly decreasing in θ. Hence, the inequalities S (θ) > S (θ′) if θ < θ′, and T (θ) > T (θ′) if θ < θ′ hold. For example, if agreement is not perfect, the Jaccard index S (1) always produces a lower value that the Dice-Sørenson measure
Correction for chance agreement
In section 2.6 several similarity measures were presented that satisfy requirement (A5), i.e., that have zero value if binary variables X and Y are statistically independent. In several domains of data analysis this requirement is a natural desideratum. For example, in reliability studies and when comparing partitions in cluster analysis, property (A5) is considered a necessity. However, requirement (A5) is less important for measures of ecological association (section 3.3), although some authors have argued to look at agreement beyond chance (see the Cole measure in section 3.5).
If a similarity measure does not satisfy desideratum (A5), it may be corrected for agreement due to chance [1, 113]. After correction for chance agreement a similarity coefficient S has a form
Expectations for Table 1 under statistical independence.
Expectations for Table 1 under statistical independence.
Expectations for Table 1 with one underlying frequency distribution.
Possible expectations for Table 1 when there is no underlying frequency distribution.
from section 3.2. For example, after correction for chance agreement both the sensitivity and the negative predictive value becomes κ (1). Furthermore, both the specificity and the positive predictive value become κ (0) after correction for chance agreement.Warrens [99, 107] presented various inequalities between similarity measures. The similarity measure CS is strictly decreasing in E (a + d) [99, 100]. This property can be used to derive various inequalities between the similarity measures considered in this section. Since the E (a + d) associated with Cohen’s kappa is smaller than the E (a + d) associated with Scott’s pi, which in turn is smaller than the E (a + d) with Goodman and Kruskal’s lambda, the double inequality Cohen’s kappa ≥ Scott’s pi ≥ Goodman and Kruskal’s lambda holds. Furthermore, since the E (a + d) associated with Scott’s pi is never smaller than
There are several functions that may reflect the mean value of two real non-negative numbers. Examples are the harmonic, geometric and arithmetic means, also known as the Pythagorean means. Various similarity measures can be expressed as a mean function of certain basic building blocks. Examples of these building blocks are the Dice measures [25, 97]
The Dice-Sørenson measure [25, 93], Driver-Kroeber-Ochiai measure [28, 75] and the Kulczyński measure [64] are, respectively, the harmonic, geometric and arithmetic means of the Dice measures [56, 99]. Furthermore, the Braun-Blanquet [11] and Simpson [89] measures
Other examples of building blocks are the weighted kappas [9, 77]
One so-called generalized mean is the power mean, sometimes referred to as the Hölder mean [14]. The minimum, maximum and the Pythagorean means are special cases of this generalized mean. Let p be a real number. The power mean of the Dice measures
The power mean of weighted kappas κ (1) and κ (0)
Warrens [99, 107] presented various inequalities between similarity measures. Since M (p) is increasing in p, the quadruple inequality Braun-Blanquet measure ≤ Dice-Sørenson measure ≤ Driver-Kroeber-Ochiai measure ≤ Kulczyński measure ≤ Simpson measure holds. Furthermore, since |N (p) | is increasing in p, the double inequality |Cohen′skappa| ≤ |phicoefficient| ≤ |Beninimeasure| holds.
For many similarity measures for 2 × 2 tables the maximal attainable value depends on the marginal distributions. For example, the relative frequency a in Table cannot exceed its marginal probabilities a + b and a + c. The Jaccard index and Dice-Sørenson measure (section 3.3), for example, can therefore only attain the maximum value of unity if b = c, that is, in the case of marginal symmetry.
The maximum value of a is given by
In general, this transformation can be applied to any similarity measure that has a maximum value that is restricted by the marginal totals. After correction for maximum value a similarity measure S has a form
Warrens [103] showed that all special cases of M (p) coincide after correction for maximum value. This similarity measure happens to be the Simpson measure [89]. Furthermore, various authors have observed that phi/phimax is equal to kappa/kappamax [33]. Warrens [103] showed that all special cases of N (p) become the Benini measure [7] (section 3.5) after the linear transformation S/Smax.
