Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. A regularized modification of canonical correlation analysis (RCCA) which imposes an penalty on the CCA coefficients is widely used in applications with high-dimensional data. One limitation of such regularization is that it ignores any data structure, treating all the features equally, which can be ill-suited for some applications. In this article we introduce several approaches to regularizing CCA that take the underlying data structure into account. In particular, the proposed group regularized canonical correlation analysis (GRCCA) is useful when the variables are correlated in groups. We illustrate some computational strategies to avoid excessive computations with regularized CCA in high dimensions. We demonstrate the application of these methods in our motivating application from neuroscience, as well as in a small simulation example.
Canonical correlation analysis (CCA) is a classic method commonly used in statistics for studying complex multivariate data. CCA was first introduced by (Hotelling (1936)) as a tool for finding relationships between two sets of variables. It remains relevant in many domains including, but not limited to, genetics (see, for example, Waaijenborg etal. 2008; Parkhomenko et al., 2009; Cao et al., 2009) and neuroscience (see, for example, Wang et al., 2020; Zhuang et al., 2020). In many applications the number of available observations is significantly smaller than the number of features under consideration, so some form of regularization is essential. Numerous regularized CCA extensions have been proposed (see, for example, Lykou and Whittaker 2010; Hardoon and Shawe-Taylor 2011; Witten and Tibshirani 2009). The most popular existing approach, called Regularized CCA (RCCA), imposes an -penalty on the canonical coefficients (see, for example, Vinod, 1976; Leurgans et al., 1993). Like any other standard regularization method based on the penalty, it has the property of treating all the coefficients equally and shrinking them towards zero. Although RCCA is well suited to data with general structure, in some applications the structure of the data can play an important role when investigating the association between variables. In this article, we develop several regularized extensions of CCA. These extensions were originally motivated by brain imaging applications, but the scope of applications can readily be extended to other fields.
The article is organized as follows. In Section 2, we introduce the necessary background for both CCA and RCCA methods. In Section 3 we propose several approaches to regularization that account for the underlying structure of the data. In particular, in Section 4, we introduce partially regularized canonical correlation analysis (PRCCA) that imposes an penalty only on a subset of the features. Although both RCCA and PRCCA problems have simple explicit solutions that can be computed via singular value decomposition, they require us to work in terms of sample covariance matrices. This can be infeasible when the number of features is very large. In Sections 2.4 and 4.2, we cover the ‘kernel’ trick that allows to escape excessive computations while conducting CCA with regularization. In Section 5, we introduce group regularized CCA (GRCCA), a novel method that exploits group structure in the data.
Since all the methods under consideration have similar structure, they can be considered as special cases of CCA with a general regularization penalty discussed in Sections 6. All the technical details and proofs for these methods are covered in the supplemental material.
We illustrate the proposed methods on our motivating study example involving functional brain imaging data in Section 3 as well as on a small simulation example in Section 7. We conclude with a discussion, including some ideas for future work.
Canonical correlation analysis with regularization
Canonical correlation analysis
Consider two random vectors and . The goal of CCA is to find a linear combination of variables and a linear combination of variables with the maximum possible correlation. Typically, we find a sequence of such linear combinations. Namely, for define a sequence of pairs of random variables as follows (see, for example, Härdle and Simar, 2007)
Random variables and are linear combinations of and , respectively,that is,
and
Coefficient vectors and maximize the correlation
Pair is uncorrelated with previous pairs, i.e
for
The pair is called -th pair of canonical variates; the corresponding optimal correlation value is called -th canonical correlation.
Note that correlation coefficient can be rewritten as
where , and refer to the covariance matrices , and respectively. It is easy to restate maximization of w.r.t. and in terms of a constrained optimization problem
Thus, finding the -th canonical pair is equivalent to solving the problem:
One can show that the canonical variates can be found via a singular value decomposition of the matrix , and that the canonical correlations coincide with the singular values of this matrix (see, for example, Mardia et al., 1979).
Dealing with high dimensions
In practice, we replace covariance matrices , and by the sample covariance matrices , and . Specifically, suppose and refer to matrices of observations for random vectors and respectively. Without loss of generality, assume that the columns of and are centred (mean 0), then
, and
If the number of observations is smaller than and/or , the corresponding sample covariance matrices are singular and the inverses and/or do not exist. Regularized canonical correlation analysis (RCCA) resolves this problem by adding diagonal matrices to the sample covariance matrices of and (see, for example, Leurgans et al., 1993; Gonzalez et al., 2008):
Here refers to the identity matrix. The modified correlation coefficient that is maximized while seeking pairs of canonical variates is, hence,
By analogy with CCA, it is easy to show that the RCCA variates can be found via the singular value decomposition of the matrix and that RCCA modified correlations are equal to the singular values of this matrix.
Shrinkage property
Similar to ridge regression, regularization shrinks the CCA coefficients and towards zero, where the penalty parameters and control the strength of the shrinkage of and , respectively. This can be supported by the following reasoning. As in the case of CCA, maximization of the modified correlation w.r.t. and can be restated as a constrained optimization problem
maximize w.r.t. and subject to and
Note that the constraints can be rewritten as
and ,
where refers to the vector Euclidean norm. Finally, one can interpret and as Lagrangian dual variables for constraints and which brings us to the optimization problem
One can show that for some appropriately chosen and this problem is equivalent to maximizing objective (2.4). Moreover, increasing and is equivalent to decreasing thresholds and which leads us to the shrinkage property. Finally, increasing and increases the denominator of (2.4) thereby shrinking the modified correlation coefficient to zero as well.
RCCA kernel trick
In some applications we need to deal with a very high-dimensional feature space. For instance, analysing functional magnetic resonance imaging (fMRI) data, where the dimension refers to the number of brain regions (or voxels), the number of features can reach hundreds of thousands. If one of and is very large it can be problematic to store matrices and . In this section we illustrate a simple trick based on the invariance of the RCCA problem under orthogonal transformations, that allows one to handle high-dimensional data when computing the RCCA solution. The idea to reduce a high dimensional CCA problem to a low-dimensional one via the kernel trick was previously introduced by Kuss (2003) and Hardoon et al. (2005). Below we demonstrate the practical application of this idea to the RCCA problem that we subsequently use in the implementation.
For simplicity, we assume that regularization is imposed on the part only, that is, we assume and set . The same reasoning applies if we regularize part as well. First we use the fact that any matrix with can be decomposed (e.g., via SVD) into a product , where is a square matrix, and is a matrix with orthonormal columns, that is, .
Lemma 2.1. [RCCA kernel trick] The original RCCA problem stated for and can be reduced to solving the RCCA problem for and . The resulting canonical correlations and variates for these two problems coincide. The canonical coefficients for the original problem can be recovered via the linear transformation
See Supplement Section 1 for the proof. Note that for the above trick allows us to avoid manipulating large and covariance matrices and and to operate in terms of smaller and matrices and . Of course, exactly the same trick can be applied to the part if
Hyperparameter tuning
Before proceeding to our first example, let us discuss how one can tune the hyperparemeters. There are two hyperparameters for RCCA, that is, and . Let us denote the vector of hyperparameters by . The values for these hyperparameters can be chosen via cross-validation. Below we present the outline for hold-out cross-validation; however, it can be naturally extended to the case of -fold cross-validation.
First we split all available observations into train and validation sets. We use the former set to fit the model and compute canonical coefficients and . Further, we use the latter set to estimate the model performance, that is, we calculate Note that here we utilize simple correlation instead of the modified correlation as a measure of performance. We pick the values of the hyperparameters maximizing the validation correlation, that is, which can be done by means of grid search.
Example. Human Connectome data study
In this section we present an application of regularized CCA to data from a neuroscience study: the Human Connectome Project for Disordered Emotional States (HCP-DES) Tozzi et al. (2020). One aim of HCP-DES is to link the function of macroscopic human brain circuits to self-reports of emotional well-being using magnetic resonance imaging. Here, we focused on brain activations during a Gambling task designed to probe the brain circuits underlying reward (described in detail in Barch et al., 2013; Tozzi et al., 2020). We linked this neuroimaging data with self-reports assessing various aspects of reward-related behaviours (Behavioural Approach System/Behavioural Inhibition Scale (BIS/BAS), Carver and White, 1994), depression symptoms (Mood and Anxiety Symptom Questionnaire (MASQ), Wardenaar et al., 2010) and positive as well as negative affective states (Positive and Negative Affect Schedule (PANAS), Watson et al., 1988). We selected participants who had complete self-report and imaging data as well as no quality control issues, for a total of 153 participants (94 females, 59 males, mean age 25.91, sd 4.85). For details on the preprocessing and subject-level modelling used to derive brain activations in response to the Gambling task, see Section 6 of the Supplement. We used for our analysis the activations for the monetary reward compared to monetary loss during the task. For each subject, the activation at each greyordinate (grey matter coordinate) in the brain was extracted, yielding a matrix of rows (subjects) and columns (greyordinates). The self-report data consisted of 9 variables: drive, fun seeking, reward responsiveness (from the BAS), total behavioural inhibition (from the BIS), distress, anhedonia, anxious arousal (from the MASQ), positive and negative affective states (from the PANAS). These were entered in a matrix of rows (subjects) and columns.
To test our structured methods, the greyordinates were grouped into 229 regions depending on their functional and anatomical properties, corresponding to the 210 regions of the Brainnetome atlas (Fan et al., 2016) with the addition of the 19 subcortical regions of the Desikan-Killiany Atlas (Desikan et al., 2006). The resulting brain regions are presented in Figure 1.
229 brain regions: 210 cortical regions of the Brainnetome atlas and 19 subcortical regions of the Desikan-Killiany Atlas
RCCA results
We start with a relatively simple model averaging activation inside each brain region and using the averaged values as features. Thus becomes of dimension . To remove the effect of sex on the resulting correlation we adjusted both activation and behavioural data for the binary sex variable by means of simple linear regression (mean adjustment). Since the matrix is relatively small we imposed no penalty on (). To pick the hyperparameter we ran ten fold cross-validation on the adjusted data pairs with the penalty factor varying over the grid .
The resulting cross-validation curves represent the unpenalized correlation between canonical variates (computed on the 9 folds of train set and one fold of validation set) averaged across 10 folds (see Figure 2a). Note that although larger shrinks the modified correlation towards zero, the unpenalized correlation is not guaranteed to be monotonically decreasing in . According to the plot, the highest score is achieved for with the corresponding test correlation equal to . Using the kernel trick we now run RCCA for the original activation data ( features adjusted for the sex effect). According to Figure 2b, the maximum score is equal to (). In general, to compare two models and check that the cross-validation score does not reflect spurious findings we should validate the performance of the models on an independent test set. However, the small sample size of the data (only 153 observations) makes the test correlation estimates unreliable. We use nested cross-validation (NCV) to overcome the problem of overfitting to the dataset that we use for tuning. Specifically, we split the data in 11 folds. Each of the 11 folds is given an opportunity to be used as an independent test set, while all other 10 folds folds are used to tune the hyperparameters via ten fold cross-validation. Therefore, we report 11 cross-validation scores along with 11 test scores, and we present the average for both as well as 1SE confidence intervals (see Figure 7). According to the NCV scores, the cross-validation procedure for the full RCCA model discovered significant correlation (independent test set correlation averaged across 11 folds is 0.105). However, the correlation value obtained by the smaller mean RCCA model was way too optimistic (average test score is 0.044 with wide error bands).
The cross-validation curves obtained via RCCA with ten fold cross-validation for the Human Connectome dataset. Left panel: (unpenalized) correlation between train canonical variates. Right panel: (unpenalized) correlation between validation canonical variates
It is worth noting the computational speed of the proposed method. Unlike the rcc() function from the popular CCA R package (see (Gonzalez et al., 2008)), which is not able to handle such a large number of features (K for the side), our implementation of RCCA with the kernel trick completes the calculations in 20 seconds.
Interpretability of canonical coefficients
In this section we visualize the RCCA coefficients corresponding to the hyperparameters chosen by cross-validation. Recall that the original features represent the brain activation detected at each brain greyordinate, so we can map the resulting RCCA coefficients back to the brain surface (see Figure 3). There is an apparent trade-off between the interpretability and flexibility of the model. Namely, although full data RCCA is more flexibile, there is quite a lot of variation in the resulting canonical coefficients. This makes the corresponding brain image harder to interpret. On the other hand, the reduced model allows us to identify the brain regions that have the highest impact on the resulting correlation. However, it loses in terms of flexibility (and, potentially, performance). In what follows, we aim to develop the model that links these two extremes enabling us to control the interpretability vs. flexibility trade-off.
RCCA coefficients computed for the averaged data (229 features) with hyperparameter chosen by cross-validation and for the original data (90 368 features) with hyperparameter chosen by cross-validation
CCA with partial regularization
Penalizing a subset of canonical coefficients
Suppose you are interested in the influence of a specific brain region on the resulting CCA correlation, however, you do not want to completely eliminate the remaining brain regions from the data. Recall that the inequality constraints in the RCCA optimization problem (2.5) control the deviation of all canonical coefficients from zero. PRCCA is a modification of RCCA that allows one to shrink only a subset of the CCA coefficients leaving the complement unpenalized. The proposed PRCCA method is a key building block for our final group RCCA approach and also admits some independent interesting applications.
Suppose that both and are split into two parts
where and with where and with
Replacing the constraints and in the optimization problem (2.5) by and , respectively, we get the PRCCA optimization problem
Re-expressing the constraints for and in terms of dual variables, we get
This leads us to the PRCCA modification of the correlation coefficient as follows:
As usual, PRCCA variates and coefficients obtained by maximizing (4.2) can be found by means of SVD of the matrix
PRCCA kernel trick
In this section we extend the kernel trick to the PRCCA problem set up. Note that, because was replaced by block matrix in the denominator of the modified correlation coefficient (2.4), the PRCCA problem does not preserve the property of invariance under orthogonal transformations. Thus, the mathematics used in Section 2.4 does not work anymore. There are two main ingredients for the PRCCA kernel trick. First, if the feature matrix consists of two orthogonal blocks, then the kernel trick can be applied to each block independently. Second, there exists a non-orthogonal transformation of the feature matrix making the two blocks orthogonal to each other, while resulting in an equivalent PRCCA problem.
We again assume for simplicity that the regularization penalty is imposed on the part only. Suppose where random vectors and correspond to penalized coefficients part and unpenalized part , respectively. Let and represent the corresponding matrices of observations, so . To make the PRCCA solution identifiable we require to be tall and full rank, that is, and We can also assume that
As the first step we find a linear transformation such that matrix has orthogonal blocks and , that is, and that preserves the second block, that is, . This can be easily done by linear regression (see Supplement for details). This linear transformation maps the original PRCCA problem to an equivalent one (equivariant in terms of the coefficients and invariant in terms of the objective).
Note that the above transformation forces the sample covariance matrix to be block-diagonal with blocks and , which enables us to apply the kernel trick to the first and the second block of independently. Specifically, consider the decomposition of , as in the RCCA kernel lemma, that is, where is a matrix with orthonormal columns and is some square matrix. Then the following lemma holds.
Lemma 4.2. [PRCCA kernel trick] The original PRCCA problem stated for and can be reduced to solving the smaller PRCCA problem for and . The resulting canonical correlations and variates for these two problems coincide. The canonical coefficients for the original problem can be recovered via the linear transformation
See Supplement Section 2 for the proof. Note that, according to the lemma, instead of working with large matrices and we can operate in terms of smaller matrices and thereby avoiding excessive computations.
Testing PRCCA on the Human Connectome data
First, we chose the brain region of interest that we aim to release from the regularization penalty. To do so for each column of averaged activation data we compute Cohen's d, which measures the effect size for a one-sample t-test comparing the population mean to zero, and pick the regions with at least a medium effect (). The resulting regions demonstrate the largest activation during the Gambling task. Then we run PRCCA on the averaged data imposing the penalty on all but these regions. We consider the same grid of values and choose the hyperparameter according to the maximum cross-validation score (see Section 2.5 for the details). The highest validation correlation is equal to and is achieved when (see Figure 4). Figure 5 represents the canonical coefficients computed for chosen by cross-validation. As expected, the coefficients for most regions were shrunk to zero leaving only a few standing out. Again, it is the kernel trick which enables running cross-validation for the extremely high-dimensional feature matrix in just a few minutes. Although PRCCA does not perform well in this application, it will play an important role in developing subsequent methods.
The cross-validation curves obtained via PRCCA with ten fold cross-validation for the Human Connectome dataset. Left panel: (unpenalized) correlation between train canonical variates. Right panel: (unpenalized) correlation between validation canonical variates
PRCCA coefficients computed for the averaged data ( features) with hyperparameter chosen by cross-validation
Canonical correlation analysis for grouped data
Handling data with a group structure
The main critique of applying standard RCCA approach to the fMRI data is that, in fact, RCCA completely ignores the brain geometry treating all the features equally. Recall that in the Human Connectome data the features representing particular greyordinates are grouped into macro regions according to the function and anatomy. The goal of the group regularized canonical correlation analysis (GRCCA) is to incorporate this underlying data structure into the regularization penalty.
There are some group extensions of CCA based on elastic net and group lasso penalizations suggested in the literature (see, for example, Chen et al., 2012; Lin et al., 2013). Unlike the existing methods the proposed GRCCA approach does not require an iterative algorithm and has a simple explicit solution. Equipped with the kernel trick it also allows working with data in a very high-dimensional feature space.
GRCCA solves the CCA problem under the following two natural assumptions. First, we assume homogeneity of groups and expect that the features within each group have approximately equal contribution to the canonical variates. In other words, the corresponding CCA coefficients do not vary significantly inside each group. Second, we assume the differentiating sparsity on a group level and expect that the coefficients will be shrunk towards zero all together for some groups. In terms of brain imaging applications these two assumptions mean that greyordinates ‘act in concert’ within each macro region and that some regions have a weaker effect on the studied phenomenon.
To state the GRCCA optimization problem rigorously we need to introduce some further notation. Suppose that the elements of random vectors and are known to come in and groups, respectively. For each let be the number of elements from that belong to group , so , and let be the random vector that consists of these elements. Without loss of generality, we can assume that . Similarly, one can divide the CCA coefficients into blocks corresponding to different groups, that is, . If denotes the mean of the CCA coefficients in group then the group homogeneity assumption implies that all values of do not deviate significantly from the mean value , whereas, the differentiating sparsity on a group level implies that the average deviation of from zero is small. This can be characterized by the following two constraints: and Note that these two equations can be interpreted as bounds on within- and between- group variation,respectively. One can derive similar constraints for coefficients as well. Adding all the constraints to the CCA optimization problem we end up with the GRCCA optimization problem
Next, denote . Let refer to the block diagonal matrix with blocks . Thus, the constraints on can be rewritten in terms of a regularization penalty as
Here, is the penalty matrix. The constraints for can be combined in a similar way leading to the GRCCA modified correlation coefficient
Note that this correlation coefficient has similar structure to the RCCA coefficient (2.4) and PRCCA coefficient (4.2), but now the covariance matrices in the denominator are adjusted by block diagonal matrices and .
Similar to RCCA and PRCCA, the explicit solution to the GRCCA problem can be found via the SVD of matrix which can be problematic in high dimensions. It turns out that there is a simple linear transformation that converts the GRCCA problem to an equivalent RCCA/PRCCA problem. In Supplement Sections 4 and 5 we give two ways: via the SVD of the penalty matrix and via feature matrix extension. This link can be subsequently used to establish the kernel trick for group-structured data thereby reducing computations in high dimensions.
Link to the flexibility vs. performance trade-off
There are several important properties of the proposed penalty matrix explaining the motivation for the GRCCA method. First, one can show that for , so RCCA is the special case of GRCCA. Second, increasing restrains the variability of coefficients within each brain region and, in limit, makes all the coefficients that belong to the same brain region equal to each other (and equal to the region mean). This is essentially equivalent to replacing features in each brain region by the average. Therefore, when the GRCCA problem becomes equivalent to the RCCA problem solved for the reduced data (see Section 3.1 for the details). To sum up, varying and allows us to approach the RCCA method conducted for either full or reduced data thereby controlling the flexibility vs interpretability trade-off described in Section 3.2.
GRCCA for the Human Connectome study
In this section, we apply the GRCCA method to the Human Connectome study data grouping activation features according to the brain regions. We again adjusted and for the sex effect and ran ten fold cross-validation on the adjusted data pair with the penalty factors varying in the grid (see Section 2.5 for the details). The resulting cross-validation curves are presented in Figure 6. According to the plot, the highest cross-validation score is attained for and leading to the correlation of . To validate these significant findings, we again run nested cross-validation (see Section 3.1). According to Figure 7, the NCV average test set score is equal to . Thus, although slightly optimistic (by a modest value of ), the GRCCA cross-validation correlation is not a spurious finding and is a significant improvement comparing to the RCCA method.
Note: For colour figure, please refer to the online version.
Nested cross-validation scores computed for three models: RCCA fitted on the averaged data (red, 229 features), RCCA fitted on the original data (green, features), GRCCA (blue). Two scores are reported: cv.validation = maximum score obtained via ten fold cross-validation, averaged across 11 NCV folds; test = score computed on independent test set, averaged across 11 NCV folds
In addition to better performance, the GRCCA method allows us to track the effect of the variation inside each brain region (controlled by the penalty factor) on the resulting canonical correlation separately from the effect of the variation across the regions (controlled by penalty factor). For example, the spikes for small values suggest that it is more beneficial to reduce within group variation than between group one. In other words, shrinking all brain region coefficients towards the group means improves the performance more than shrinking group means towards zero. Moreover, for small it is the ratio of hyperparameters that plays the key role: the highest score is always achieved when For large this pattern disappears as we start to over-penalize both between and within group variations.
Using GRCCA for visualization
In this section we demonstrate another advantage of GRCCA in the context of visualization and interpretability. In Figure 8, we present the coefficient paths ( vs. ) produced by the RCCA method as well as the group modification. Here different colours represent different brain regions. According to the plot, we observe the following behaviour of the coefficients. For the RCCA method the canonical coefficients are shrunk towards zero all together with the growth of . On the contrary, for large all the GRCCA coefficient paths become horizontal, which implies the convergence of the coefficients to the group means. Finally, increasing shrinks all the group means towards zero encouraging differentiating sparsity on a group level.
Coefficient paths for RCCA and GRCCA with colours representing the regions. The RCCA method shrinks canonical coefficients towards zero all together with the growth of . The GRCCA method shrinks them towards the group means with the growth of , whereas increasing shrinks all the group means towards zero
In Figure 9, we present the brain images computed for and . Note that larger makes the brain region pattern more obvious (similar to Figures 3a–3b). Moreover, for we get the plot corresponding to the best GRCCA model from Section 5.3. Thus, the GRCCA model chosen by cross-validation has not only better performance on the validation set than the RCCA model, but it is also more interpretable in the context of the importance of each brain region. Specifically, the canonical component had especially high positive loadings in subcortical regions involved in reward processing, such as the striatum (nucleus accumbens, putamen) and thalamus (Haber, 2017). It also loaded positively on a cortical network encompassing the temporal lobe, dorsolateral prefrontal, dorsomedial prefrontal, posterior cingulate and precentral cortices (see Figure 9 for an annotated visualization of these results). Most of these regions have been shown to be connected to the striatum and to be part of key reward-processing pathways as well (Haber, 2017).
From top to bottom: GRCCA coefficients for and . The third row represents the solution produced by the cross-validation procedure. Annotation of brain regions: [1] nucleus accumbens, [2] putamen, [3] thalamus, [4] temporal lobe, [5] dorsolateral prefrontal cortex, [6] dorsomedial prefrontal cortex, [7] posterior cingulate cortex, [8] precentral cortex
General approach to regularization
It turns out that all RCCA, PRCCA and GRCCA methods are similar in nature: they perform regularization by means of adjusting covariance matrices and/or in the denominator of the modified correlation coefficient. In this section, we consider the class of CCA problems with general weighted regularization.
If are some positive semi-definite penalty matrices, then the general modified correlation coefficient can be written as
The accompanying general RCCA optimization problem is therefore
Note that the inequality constraints in (6.2) can be rewritten as and , where is weighted Euclidean norm defined as The resulting canonical variates and coefficients can be found via the singular value decomposition of the matrix . To handle General RCCA in high dimensions, one can link it to the two methods for which we already established the kernel trick. In the Supplement Section 3 we provide the proof of the following lemma.
Lemma 6.3. [General RCCA to RCCA/PRCCA] If both and are positive definite then, by some proper change of basis, the general RCCA problem can be reduced to the RCCA one. Alternatively, if one of and has zero eigenvalues then general RCCA boils down to solving the PRCCA problem with number of unpenalized coefficients equal to the multiplicity of the zero eigenvalue.
Simulation study
Generating data with a group structure
In this section, we set up a small simulation experiment where we compare performance of all the above methods on the data with group structure. We generate the data as follows. For random vector we assume that it is grouped into groups of equal size, thereby having variables in group . Each group of is generated by one of centroid random variables and is obtained by adding some Gaussian noise to the centroid. Moreover, we assume the presence of some correlation between the centroids and . To be precise, to generate the data we exploit the multivariate normal distribution as a joint distribution of random vector and random vector of centroids :
with
Next, we generate random vector corresponding to groups from the distribution where is the th component of the centroid vector. Finally, we obtain matrices and by drawing samples from the above distributions. In our experiments we use , and , the number of groups is . We set and test two settings: for correlated data and for independent data.
As the next step, we run RCCA, PRCCA and GRCCA on the generated data imposing the regularization on the part only and using the following hyperparameters. For all methods the penalty factor is chosen to be . For PRCCA we penalize variables only leaving variables untouched; these five unpenalized variables correspond to the first features in each group. For GRCCA we again try two versions. First, we run GRCCA with , that is, without differentiating sparsity on a group level assumption. Next, we add differentiating sparsity to GRCCA and vary the penalty factor in the range .
We compare all methods in terms of resulting correlations. For this purpose, we generate train and test sets, fit models on train and evaluate canonical correlation value on test. We plot average train and test correlations as well as their one standard error intervals vs. penalty factor (see Figure 10). According to the plot, for the correlated data the best test score is achieved by sparse version of GRCCA, which significantly outperforms RCCA. Better performance can be explained by the presence of the groups structure in the data. Note that non-sparse GRCCA also looses in terms of the test score. The possible reason is that number of observations () is only twice as large as the number of groups (), so regularization on a group level helps to prevent overfitting to train data. In the case of independent data, all the competitors perform in a similar way: The average test correlation is very close to zero regardless the hyperparameter value; the test correlation curves are almost flat.
Train and test curves computed via simulation. Four models presented: RCCA, PRCCA and GRCCA with zero () and non-zero sparsity (). First row: train and test correlation obtained for data with correlation (). Second row: train and test correlation obtained for uncorrelated data ()
Finally, for each model we pick the value of according to the maximum test score and compare the CCA coefficients for the chosen models. Figure 11 displays the main difference between the RCCA and GRCCA methods. Although both techniques aim to reduce the data dimensionality, the reduction is achieved in a different way. Unlike RCCA, which treats all the coefficients equally and confines their deviation from zero, GRCCA carries out the reduction treating equally the coefficients inside each group and removing the within group noise. To sum up, in the presence of a group structure the group modification of RCCA allows for dimensionality reduction in a more efficient and interpretable way.
Note: For colour figure, please refer to the online version.
Discussion
In this article, we proposed several approaches to the CCA regularization. The introduced PRCCA technique has a similar flavour as RCCA, but it penalizes only a subset of canonical coefficients. Both of these methods combined with the proposed kernel trick allows us to find the CCA solution even in case of extremely high data dimensionality. We further present the GRCCA method, which is based on the underlying group structure of the data and which, therefore, can be useful in some applications, and extend regularization to the case of a more general regularization penalty thereby proposing General RCCA. The close connection between the latter techniques with RCCA and PRCCA methods enables to utilize the kernel trick in the general case thus providing a powerful tool for regularizing CCA in the high-dimensional framework.
There is still much scope for future work. One interesting direction for further research is to consider other applications of the proposed group RCCA technique. For example, there are many problems in genetics where genes are grouped by functional similarity. Further, in this article we cover only two types of penalties: partial and group; although the proposed kernel trick can handle any -type penalty (see Section 6 for general RCCA). Thus we can study other structured modifications of RCCA that can be beneficial for applications. As an example, it may be interesting to explore hierarchical group structure, where not only brain loci are combined in some regions, but also regions are combined in some groups (e.g., we have cortical and subcortical groups in the HCP study).
From the computational point of view, it would be useful to investigate how one can optimize the choice of the hyperparameters. The following idea is inspired by ridge regression, which also uses the penalty. Note that currently it is not necessary to apply any data normalization before running regularized CCA (all the fMRI features have the same scale). However, the overall scale of influences the choice of the hyperparameters, that is, multiplying by some number implies increasing the penalty factors by times as well. Therefore, it would be beneficial to develop some recommendations for the grid of hyperparameters the user should search through. For instance, we can use the ridge regression heuristic and introduce a concept of degrees-of-freedom (aka ) for CCA with regularization. Then, we can base the hyperparameter recommendation on the value.
Software
Proposed methods are implemented in the R package RCCA; the software is available from Github (https://github.com/ElenaTuzhilina/RCCA).
Supplementary material
Supplementary material is available online http://www.statmod.org/smij/archive.html.
Footnotes
Acknowledgments
The authors disclosed receipt of the following financial support for the research, authorship and/or publication of this article: Leonardo Tozzi was supported by grant U01MH109985 under PAR-14-281 from the National Institutes of Health. Trevor Hastie was partially supported by grants DMS-2013736 and IIS 1837931 from the National Science Foundation, and grant 5R01 EB 001988-21 from the National Institutes of Health.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship and/or publication of this article: Leonardo Tozzi was supported by grant U01MH109985 under PAR-14-281 from the National Institutes of Health. Trevor Hastie was partially supported by grants DMS-2013736 and IIS 1837931 from the National Science Foundation, and grant 5R01 EB 001988-21 from the National Institutes of Health.
References
1.
BarchDM, BurgessGC, HarmsMP, PetersenSE, SchlaggarBL, CorbettaM, GlasserMF, CurtissS, DixitS, FeldtC, NolanD, BryantE, HartleyT, FooterO, BjorkJM, PoldrackR, SmithS, Johansen-BergH, SnyderAZ and Van EssenDC (2013) Function in the Human Connectome: Task-fMRI and individual differences in behavior. NeuroImage, 80, 169–89.
2.
CaoK-AL, MartinP, Robert-GranieC and BesseP (2009) Sparse canonical methods for biological data integration: Application to a cross-platform study. BMC Bioinformatics, 10, 1–17.
3.
CarverCS and WhiteTL (1994) Behavioral inhibition, behavioral activation, and affective responses to impending reward and punishment: The BIS/BAS scales. Journal of Personality and Social Psychology, 67, 319–33.
4.
ChenX, HanL and CarbonellJ (2012) Structured sparse canonical correlation analysis. Proceedings of Machine Learning Research, 22, 199–207.
5.
DesikanRS, SegonneF, FischlB, QuinnBT, DickersonBC, BlackerD, BucknerRL, DaleAM, MaguireRP, HymanBT, AlbertMS and KillianyRJ (2006) An automated labeling system for subdividing the human cerebral cortex on MRI scans into gyral based regions of interest. NeuroImage, 31, 968–80.
6.
FanL, LiH, ZhuoJ, ZhangY, WangJ, ChenL, YangZ, ChuC, XieS, LairdAR, FoxPT, EickhoSB, YuC and JiangT (2016) The Human Brainnetome Atlas: A New Brain Atlas Based on Connectional Architecture. Cerebral Cortex, 26, 3508–26.
7.
GonzalezI, DéjeanS, MartinP and BacciniA (2008) CCA: An R Package to extend canonical correlation analysis. Journal of Statistical Software, 23, 1–14.
8.
HaberSN (2017) Chapter 1: Anatomy and connectivity of the reward Circuit. In Decision Neuroscience, edited by DreherJC and TremblayL, pages 3–19. San Diego, CA: Academic Press.
HardoonD, SzedmakS and Shawe-TaylorJ (2005) Canonical correlation analysis: An overview with application to learning me- thods. Neural Computation, 16, 2639–64.
12.
HotellingH (1936) Relations between two sets of variables. Biometrika, 28, 321–77.
13.
KussM (2003) The geometry of kernel canonical correlation analysis (Technical report).
14.
LeurgansS, MoyeedR and SilvermanB (1993) Canonical correlation analysis when the data are curves. Journal of the Royal Statistical Society, 55, 725–40.
15.
LinD, ZhangJ, LiJ, CalhounV, DengH-W and WangY-P (2013) Group sparse canonical correlation analysis for genomic data integration. BMC Bioinformatics, 14, Article 245.
16.
LykouA and WhittakerJ (2010) Sparse CCA using a Lasso with positivity constraints. Computational Statistics and Data Analysis, 54, 3144–57.
17.
MardiaK, KentJT and BibbyJM (1979) Multivariate Analysis. New York, NY: Academic Press.
18.
ParkhomenkoE, TritchleraD and BeyeneJ (2009) Sparse canonical correlation analysis with application to genomic data integration. Statistical Applications in Genetics and Molecular Biology, 8, 1–34.
19.
TozziL, StavelandB, Holt-GosselinB, ChesnutM, ChangSE, ChoiD, ShinerM, WuH, Lerma-UsabiagaG, SpornsO, BarchDM, GotlibIH, HastieTJ, KerrAB, PoldrackRA, WandellBA, WintermarkM and WilliamsLM (2020) The Human Connectome project for disordered emotional states: Protocol and rationale for a research domain criteria study of brain connectivity in young adult anxiety and depression. NeuroImage, 214, 116715.
20.
VinodH (1976) Canonical ridge and econometrics of joint production. Journal of Econometrics, 4, 147–166 of Econometrics, 4(2), 147–66.
21.
WaaijenborgS, de Witt HamerPCV and ZwindermanA (2008) Quantifying the association between gene expressions and DNA-markers by penalized canonical correlation analysis. Statistical Applications in Genetics and Molecular Biology, 7, Article 3.
22.
WangH-T, SmallwoodJ and Mourao-MirandaJ (2020) Finding the needle in a high-dimensional haystack: Canonical correlation analysis for neuroscientists. Neuroimage, 216, 116745.
23.
WardenaarKJ, van VeenT, GiltayEJ, de BeursE, PenninxBWJH and ZitmanFG (2010) Development and validation of a 30-item short adaptation of the Mood and Anxiety Symptoms Questionnaire (MASQ). Psychiatry Research, 179, 101–6.
24.
WatsonD, ClarkLA and TellegenA (1988) Development and validation of brief measures of positive and negative affect: The PANAS scales. Journal of Personality and Social Psychology, 54, 1063–70.
25.
WittenD and TibshiraniR (2009) Extensions of sparse canonical correlation analysis with applications to genomic data. Statistical pplications in Genetics and Molecular Biology, 8, Article 28.
26.
ZhuangX, YangZ and CordesD (2020) A technical review of canonical correlation analysis for neuroscience applications. Human Brain Mapping, 41, 3807–33.