Key variables come first! How best to design a correlation table when there is one key variable

Abstract

In many sciences, the relationships among the variables in multivariate data sets are often explored by means of correlation tables. In such studies there is often only one key variable of interest, for example, grain yield for cereals in agriculture, and class-size and academic performance in education. However, in such instances, there are no particular rules concerning how best to present the order of the variables in the resulting correlation matrices, even though it might be sensible to order them so that the focus is on the key variable. In this note, we show how this can be done.

Keywords

co-relationships multivariate data order of variables table layout

1. Introduction

Pearson’s product moment correlation coefficient is one of the most widely used statistical tools in the sciences. Representing a linear co-relationship between two bi-normally distributed variables, the coefficient indicates the strength and direction of this relationship. To show how several variables correlate, such coefficients are often presented in a correlation table (or matrix), which gives the correlations between each pair of variables. Such matrices can be found in numerous papers in all of the various sciences.

The order of the variables in a correlation matrix may be important, especially when there are more than three or four of them. Ehrenberg [1] and Ryder [2], for example, both suggest that the appropriate ordering of the rows and columns can be a powerful tool to support the interpretation of a matrix. Sometimes the ordering of the variables in a correlation matrix happens to be concerned with some particular sequence, like one that follows from the order of the development of biological traits during ontogenesis in agricultural examples [3], or based on yet another rationale – usually not mentioned [4]. If the order of the variables has some meaning, then this should be explained by the authors in their texts, and possibly in their table headings [5] so that it is easier to read the matrix. Friendly [6] showed how an appropriate ordering can help readers to probe into the structures present in correlation matrices. He proposed a method for ordering the variables so that those that are highly correlated are adjacent in the matrix. Consider first Table 1(a) and then look at Table 1(b). Table 1(b) presents an adjustment to the correlation table in Table 1(a) based on Friendly’s suggestions. We believe, however, that when there is only one key variable in the data set, there are better ways of ordering the variables than that suggested by Friendly [6].

Table 1(a).

A typical layout of a correlation table for a fictitious data set with seven variables. X₅ is considered the key variable, and the corresponding coefficients are emboldened.

	X₁	X₂	X₃	X₄	X₅	X₆
X ₂	−0.33
X ₃	0.12	0.45
X ₄	0.12	−0.09	0.50
X ₅	−0.40	0.10	−0.74	−0.47
X ₆	−0.25	0.06	−0.30	−0.02	0.58
X ₇	−0.15	0.68	0.21	0.14	0.21	0.51

Table 1(b).

A revised correlation table from Table 1(a) where the order of variables is determined by the method of angular order of the eigenvectors [6]. The associations between the key variable (not emboldened) and the other variables are not emphasized.

	X₁	X₄	X₃	X₂	X₇	X₆
X ₄	0.12
X ₃	0.12	0.50
X ₂	−0.33	−0.09	0.45
X ₇	−0.15	0.14	0.21	0.68
X ₆	−0.25	−0.02	−0.30	0.06	0.51
X ₅	−0.40	−0.47	−0.74	0.10	0.21	0.58

In fact, in many situations we indeed want to present results for one key variable. Take, for example, a common study in agriculture where plant-yield and yield-contributing characteristics are studied. In such a situation the focus is normally on the yield. Here we can construct a correlation matrix for such a set of variables in a way that when reading the matrix the focus is indeed on the yield. Similarly, in the field of education, we might be interested in how a key variable, such as season of birth, affects academic learning over life-span.

In this paper we aim to show that, when there is one key variable, appropriately ordering the variables in the matrix can facilitate understanding. Although the technique is simple, to the best of our knowledge it has been rarely used, if at all.

2. The technique and examples

Table 1(a) shows an example of a correlation table based on an artificial data set. Here the coefficients for all the pairs of variables have been estimated and a normal correlation matrix constructed. Now let us assume again that X₅ is the key variable. The data in Table 1(a) can be re-arranged to show this more clearly in two stages as follows:

Give the key variable in the second column (see Table 1(c)), as the first column will contain the variables’ names.

Then order the remaining variables in rows by the decreasing size of their correlations with the key variable. Thus the first row represents that variable where the correlation with the key variable is the highest and the last row represents that variable where the correlation with the key variable is the lowest.

Table 1(c).

A revised correlation table from Table 1(a) where the focus is on the key variable. Emboldening is added to show the easiness of reading the table compared with Table 1(a).

	X ₅	X₆	X₇	X₂	X₁	X₄
X ₆	0.58
X ₇	0.21	0.51
X ₂	0.10	0.06	0.68
X ₁	−0.40	−0.25	−0.15	−0.33
X ₄	−0.47	−0.02	0.14	−0.09	0.12
X ₃	−0.74	−0.30	0.21	0.45	0.12	0.50

Now look back at Table 1(a). Here understanding the correlation between the key variable X₅ and the other variables demands a careful examination of the table (even despite the emboldening of the coefficients). The same can be said about Table 1(b). But now look again at Table 1(c). This shows a re-arranged matrix based on Table 1(a). Here, as discussed above, the correlations between the key variable X₅ and the remaining variables are given in the second column, and the rows are sorted by the decreasing correlations with the key variable. In this way, the size of the correlations between the key variable and the others is now much clearer.

Table 1(d) shows a different way of presenting the same data as shown in Table 1(c). Some authors like to present the results horizontally rather than vertically as in Table 1(c). This is to some extent a matter of preference. Some people like to read from left to right, rather than top to bottom. A key factor here, too, is the length of the headings for each column/row. Nonetheless, we need to remember that, according to Ehrenberg [1], it is easier to make multiple comparisons when the data are listed in columns rather than in rows, so readers might prefer the layout of Table 1(c) to that of Table 1(d).

Table 1(d).

A revised version of Table 1(c) where the key variable is presented at the top of the table.

	X₆	X₇	X₃	X₁	X₄	X₃
X ₅	0.58	0.21	0.1	−0.40	−0.47	−0.74
X ₆		0.51	0.06	−0.25	−0.02	−0.30
X ₇			0.68	−0.15	0.14	0.21
X ₂				−0.33	−0.09	0.45
X ₁					0.12	0.12
X ₄						0.50

Tables 2(a) and 2(b) present a second example based on a correlation table initially provided by Ball et al. [7]. These investigators studied the impact of the assessment of population density on short-season soybean yield and its components. Here we assume that the focus is on yield, so this becomes the key variable.

Table 2(a).

Example based on a table from Ball et al. [7]. The original table presents Pearson’s correlation coefficients for yield and yield components from irrigated and non-irrigated systems over a range of population densities. Here we present only the former results and treat yield as the trait of key interest.

	Population (plants m⁻²)	Pods plant⁻¹ (pods plant⁻¹)	Seeds pod⁻¹ (seeds pod⁻¹)	Mass seed⁻¹ (g seed⁻¹)
Pods plant⁻¹ (pods plant⁻¹)	−0.82
Seeds pod⁻¹ (seeds pod⁻¹)	−0.36	−0.17
Mass seed⁻¹ (g seed⁻¹)	−0.25	0.11	0.11
Yield (g m ⁻² )	0.64	−0.47	−0.06	−0.34

Table 2(b).

A revised version of Table 2(a), with the focus on yield. The patterns of correlations of yield with the other variables can be grasped immediately.

	Yield (g m⁻²)	Population (plants m⁻²)	Seeds pod⁻¹ (seeds pod⁻¹)	Mass seed⁻¹ (g seed⁻¹)
Population (plants m⁻²)	0.64
Seeds pod⁻¹ (seeds pod⁻¹)	−0.06	−0.36
Mass seed⁻¹ (g seed⁻¹)	−0.34	−0.25	0.11
Pods plant⁻¹ (pods plant⁻¹)	−0.47	−0.82	−0.17	0.11

As before, to understand the correlation between yield and other variables one must very carefully examine the tables. Here, in Table 2(a), the pattern of the correlations with yield is not immediately obvious. However, in Table 2(b), the data have been re-arranged so that yield now appears in the first column. Again, we suggest, it is much easier to interpret the magnitude of the correlations between yield and the other traits in Table 2(b) than in Table 2(a).

Finally, Tables 3(a) and 3(b) present yet another example – this time from the field of publishing [8]. Table 3(a) shows part of the data as originally presented. Ask yourself which is the key variable in this study? Now re-arrange the table following our procedure to produce Table 3(b).

Table 3(a).

A fragment of a correlation table based on one from Hegarty and Walton [8]. Impact = 5-year citation count to a paper; JIF = journal impact factor; refs = no. of references; pages = length of articles; participants = no. of co-authors.

	JIF	Refs	Pages	Gender	Participants	Graphs	Tables
Impact	0.27	0.41	0.31	0.01	0.01	−0.14	0.13
JIF		0.35	0.31	0.08	−0.39	0.04	−0.14
Refs			0.63	−0.03	−0.19	−0.04	0.11
Pages				−0.06	−0.22	0.13	0.29
Gender					0.13	0.05	−0.11
Participants						−0.12	0.21
Graphs							−0.21

Table 3(b).

Revised version of Table 3(a). Here “impact” is treated as the key variable, as in Hegarty and Walton [8]. Variable names are as in Table 3(a).

	Impact	Refs	Pages	JIF	Tables	Gender	Participants
Refs	0.41
Pages	0.31	0.63
JIF	0.27	0.35	0.31
Tables	0.13	0.11	0.29	−0.14
Gender	0.01	−0.03	−0.06	0.08	−0.11
Participants	0.01	−0.19	−0.22	−0.39	0.21	0.13
Graphs	−0.14	−0.04	0.13	0.04	−0.21	0.05	−0.12

2. Conclusion

The way of presenting correlation tables described in this paper facilitates the reader’s understanding of the bi-variate relationships between one key variable and the other variables. Of course, this technique cannot replace multivariate techniques to explore correlations, but it can facilitate comparing simple correlations among the variables, and especially the reading of large correlation tables when there is one key variable.

Because correlation matrices are reported so often in the sciences, we believe that this technique will be found useful in various branches of the physical and social sciences. However, in spite of correlation matrices being used so frequently, we have to remember that Pearson’s correlations often fail to discover relationships in multivariate data sets. This is for various reasons, the most important being that they assume that all the pairs of variables are linearly associated, and that there are no outliers that would affect this linearity. In addition, these approaches discount the multidimensionality of a data set much more severely than, for example, in a scatterplot. Finally, it is also worth noting that such ordering of variables can be applied equally efficiently to heat maps representing correlation tables: these have a long tradition in graphing data matrices to reveal row and column hierarchical cluster structures [9], although they are used much less in the scientific literature. For their particular use in correlations, these are termed corrgrams [6].

References

Ehrenberg

ASC

. Rudiments of numeracy. Journal of the Royal Statistical Society A 1977, 140(3), 277–297.

Ryder

. Guidelines for the presentation of numerical tables. Research in Veterinary Science 1995, 58, 1–4.

Mądry

Kozak

Pluta

Żurawicz

. A new approach to sequential yield component analysis (SYCA): Application to fruit yield in blackcurrant (Ribes nigrum L.). Journal of New Seeds 2005, 7(1), 85–107.

Bin

Richardson

. An ergonomics study of a semiconductors factory in an IDC for improvement in occupational Health and Safety. International Journal of Occupational Safety and Ergonomics 2010, 16(3), 345–356.

Hartley

. New ways of making academic articles easier to read. International Journal of Clinical and Health Psychology 2012, 12(1), 143–160.

Friendly

. Corrgrams: Exploratory displays for correlation matrices. The American Statistician 2002, 56(4), 316–324.

Ball

McNew

Vories

Keisling

Purcell

. Path analyses of population density effects on short-season soybean yield. Agronomy Journal 2001, 93, 187–195.

Hegarty

Walton

. The consequences of predicting scientific impact in psychology using journal impact factors. Perspectives on Psychological Science 2012, 7(1), 72–78.

Wilkinson

Friendly

. The history of the cluster heat map. The American Statistician 2009, 63(2): 179–184.