Abstract
Background:
Functional connectomes (FCs) have been shown to provide a reproducible individual fingerprint, which has opened the possibility of personalized medicine for neuro/psychiatric disorders. Thus, developing accurate ways to compare FCs is essential to establish associations with behavior and/or cognition at the individual level.
Methods:
Canonically, FCs are compared using Pearson's correlation coefficient of the entire functional connectivity profiles. Recently, it has been proposed that the use of geodesic distance is a more accurate way of comparing FCs, one which reflects the underlying non-Euclidean geometry of the data. Computing geodesic distance requires FCs to be positive-definite and hence invertible matrices. As this requirement depends on the functional magnetic resonance imaging scanning length and the parcellation used, it is not always attainable and sometimes a regularization procedure is required.
Results:
In the present work, we show that regularization is not only an algebraic operation for making FCs invertible, but also that an optimal magnitude of regularization leads to systematically higher fingerprints. We also show evidence that optimal regularization is data set-dependent and varies as a function of condition, parcellation, scanning length, and the number of frames used to compute the FCs.
Discussion:
We demonstrate that a universally fixed regularization does not fully uncover the potential of geodesic distance on individual fingerprinting and indeed could severely diminish it. Thus, an optimal regularization must be estimated on each data set to uncover the most differentiable across-subject and reproducible within-subject geodesic distances between FCs. The resulting pairwise geodesic distances at the optimal regularization level constitute a very reliable quantification of differences between subjects.
Impact statement
Functional connectomes (FCs) have a reproducible individual fingerprint, making it possible to study neurological and psychiatric phenomena at an individual level. But this requires an accurate way to compare FCs to establish individual-level associations with behavior and/or cognition. Although the canonical methods of comparing FCs (e.g., correlation, Euclidean) are adequate, geodesic distance provides a more principled and accurate way of comparing FCs by utilizing the underlying non-Euclidean geometry of correlation matrices. We demonstrate that by combining geodesic distance with an optimal amount of regularization, we can get substantially more reliable estimates of relative distances between FCs and thus uncover individual-level differences.
Introduction
Brain activity can be estimated, indirectly, by measuring the blood oxygenation level dependent (BOLD) signal using magnetic resonance imaging (MRI) (Bandettini et al., 1992; Frahm et al., 1992; Kwong et al., 1992; Ogawa et al., 1990, 1992). This is the standard technique to generate brain images in functional MRI (fMRI) studies. Functional connectivity between two distinct brain regions is then defined as the statistical dependence between the corresponding BOLD signals, canonically estimated with Pearson's correlation coefficient (Bravais, 1846; Galton, 1886). A whole-brain functional connectivity pattern can be represented as a full symmetric correlation matrix denominated functional connectome (FC) (Fornito et al., 2016; Sporns, 2018). FCs have been used to study the changes in brain connectivity with aging (Zuo et al., 2017), cognitive abilities (Shen et al., 2017; Svaldi et al., 2019), and across a wide range of brain disorders (Fornito and Bullmore, 2015; Fornito et al., 2015; van den Heuvel and Sporns, 2019). Recently, it has also been shown that FCs have a recurrent and reproducible individual fingerprint (Abbas et al., 2020; Amico and Goñi, 2018; Finn et al., 2015; Gratton et al., 2018; Mars et al., 2018; Pallarés et al., 2018; Rajapandian et al., 2020; Satterthwaite et al., 2018; Seitzman et al., 2019; Venkatesh et al., 2020), which has opened the possibility of personalized medicine for neuro/psychiatric disorders (Satterthwaite et al., 2018), aided by improved acquisition parameters and the availability of large data sets with open data policy (Allen et al., 2014b; Amunts et al., 2016; Miller et al., 2016; Okano et al., 2015; Poo et al., 2016; Van Essen et al., 2012, 2013).
A clinically useful individual-level biomarker must have high interindividual differentiability, which in turn requires an accurate way of comparing individual FCs. FCs are compared traditionally by computing the Pearson's correlation coefficient between their upper-triangular vectorized versions (Amico andand Goñi, 2018; Bari et al., 2019; Finn et al., 2015). This approach enables us to assess to what extent it is possible to identify a participant from a large population of participants, a process known as fingerprinting or subject identification. The success rate of subject identification is known as identification rate (Finn et al., 2015) and has been also referred to as participant identification (Venkatesh et al., 2020). Although comparing FCs using Pearson's correlation coefficient is intuitive and computationally simple, it ignores the underlying geometry of the correlation-based FCs (Venkatesh et al., 2020) and hence has had only limited success in terms of identification rates (Finn et al., 2015).
A geometry-aware approach (Venkatesh et al., 2020) has recently been introduced to establish a more accurate way of measuring distance between any two FCs. FCs computed using Pearson's correlation coefficient between BOLD signals of all brain regions are objects that lie on or inside a nonlinear surface or manifold called the positive semidefinite cone (Fig. 1). This non-Euclidean geometry of FCs suggests that the distances between FCs are better measured along a geodesic of the cone. This contrasts with using correlation which is equivalent to the cosine of the angle between demeaned and normalized FCs, or the Euclidean distance which is equivalent to the straight-line distance between FCs. Venkatesh and colleagues (2020) applied the geodesic approach of comparison to the problem of individual fingerprinting and showed that it improves identification rates robustly compared with a dissimilarity measure based on Pearson's correlation coefficient. The improvement was observed across most conditions (resting-state [REST] and seven fMRI tasks) from the Human Connectome Project (HCP) data set.

Incremental regularization of FCs and its effect on the estimates of geodesic distance. We illustrate the geodesic distance between two FCs of size
The non-optimality of conventional metrics to compare FCs can be shown in another way. When comparing FCs using the conventional Pearson or Spearman-based correlations, the FCs are vectorized and then correlated. Implicit in this process is the assumption that all the elements of FCs are uncorrelated features. This is not the case. Since FCs are correlation matrices (Q), they live on or inside a positive semidefinite cone, that is,
The definition of geodesic distance between two positive definite matrices of the same size (say Q
1 and Q
2) requires that at least one of the matrices being compared is invertible (Pennec et al., 2006). When this is not the case (rank deficient matrices with at least one eigenvalue equal to 0), both Q
1 and Q
2 can be regularized by adding a scaled identity matrix,
Using a regularization of
In this article, we explore the effect of the magnitude of the regularization parameter (τ) on the geodesic distance between FCs and its impact on identification rates. We assess this effect for different scanning lengths, number of frames for a fixed scanning length, parcellations, and fMRI tasks and evaluate which levels of regularization maximize identification rates. In this manner, we aim to develop a procedure to uncover individual fingerprints by shifting FC data to an optimal location of the semidefinite cone where test/retest FCs are more differentiable across subjects. The existence of an optimal regularization would be critical to obtain the corresponding geodesic distances between FCs. An optimal amount of regularization should lead to higher identification rates in FCs (i.e., higher individual fingerprint), and hence, these optimally regularized FCs and particularly their corresponding pairwise distances would be better suited for establishing associations between functional connectivity and cognition, behavior, and neurological diseases at the individual level.
Methods
Data set
We included the
For each condition, subjects underwent two sessions corresponding to two different acquisitions (left to right or LR, and right to left or RL). The resting-state fMRI scans were acquired on two different days with a total of four sessions (coded as REST1 and REST2). The two sessions from REST1 were used for most of the analyses in this study. REST2 sessions were only used in the generalizability analysis (see the Subject Identification section). The HCP scanning protocol was approved by the institutional review board at Washington University in St. Louis. Full details on the HCP data set have been published previously (Glasser et al., 2013; Smith et al., 2013; Van Essen et al., 2012).
Brain parcellations
Two gray matter parcellations were used in this study: The Destrieux atlas (Destrieux et al., 2010), or “aparc.2009s” in FreeSurfer nomenclature, defined using “Rules and algorithm that produced labels consistent with anatomical rules as well as automated computational parcellation,” featuring 75 regions in each hemisphere (74+Medial Wall), with the particularity of separating gyral and sulcal areas (a total of 150 brain regions). MMP1.0 atlas (Glasser et al., 2016), a multimodal parcellation of the human cerebral cortex, with 180 brain regions in each hemisphere (a total of 360 brain regions).
For completeness, 14 subcortical regions were added to each parcellation, as provided by the HCP release (filename
Preprocessing
The data processed using the “minimal” preprocessing pipeline from the HCP were employed in this work (Glasser et al., 2013). This pipeline included artifact removal, motion correction, and registration to standard template. Full details on this pipeline can be found in earlier publications (Glasser et al., 2013; Smith et al., 2013).
We added the following steps to the “minimal” processing pipeline. For resting-state fMRI data: (i) we regressed out the global gray matter signal from the voxel time courses (Power et al., 2014), (ii) we applied a bandpass first-order Butterworth filter in the forward and reverse directions [0.001–0.08 Hz (Power et al., 2014); MATLAB functions butter and filtfilt], and (iii) the voxel time courses were z-scored and then averaged per brain region, excluding any outlier time points that were outside three standard deviation from the mean (workbench software, command -cifti-parcellate). For task fMRI data, we applied the same steps as mentioned above, but a more liberal frequency range was adopted for the band-pass filter (0.001–0.25 Hz) (Amico et al., 2019), since the relationship between different tasks and optimal frequency ranges is still unclear (Cole et al., 2014).
Table 1 shows the number of frames per run and the scanning length for all fMRI conditions. It also shows the number of participants for whom this number of frames per run was available after the preprocessing. Any runs where we could not fully process the data or were left with fewer frames were left out of the analyses.
Summary of the Number of Unrelated Participants Available (of a Total of 426) for Each Parcellation and Condition After Complete Preprocessing of the Functional Magnetic Resonance Imaging Data with Corresponding Number of Frames per Run
EM, emotion processing; GAM, gambling; LAN, language; MOT, motor; REL, relational processing; REST, resting-state; SOC, social; WM, working memory.
Whole-brain FCs
As described in the Preprocessing section, for a given brain parcellation, time series data for each voxel were z-scored and averaged within each brain region. Pearson's correlation coefficient (MATLAB command corr) was used to estimate the functional connectivity between all pairs of brain regions, resulting in a symmetric correlation matrix of size
As mentioned above, FCs are correlation matrices and it is well known that correlation matrices are symmetric positive semidefinite (SPSD), which means that their eigenvalues are greater than or equal to zero (Bhatia, 2009). If all the eigenvalues of an FC are strictly greater than zero, then it is a symmetric positive definite (SPD) FC matrix. The rank and invertibility of an FC are also directly related to its eigenvalues: if one or more eigenvalues are zero, then that FC is rank-deficient and not invertible. When all the eigenvalues are greater than zero for an FC, it is full-rank and hence invertible (Bhatia, 2009). The rank of an FC depends on the number of brain regions in the parcellation (m) and the number of samples in the time series (T) such that:
For all the conditions, the FCs generated using Destrieux parcellation were full-rank if the number of samples (frames) in the time series used was
Geometry of FCs
FCs estimated using Pearson's correlation coefficient are objects that lie on or inside a nonlinear surface, or manifold, called the positive semidefinite cone. Although a three-dimensional visualization of this manifold is only possible for
Let
be the set of all symmetric positive matrices of dimension M, which lie on or inside an SPSD cone of dimension M. The positive-definite matrices would comprise the interior of the cone while all the rank-deficient semidefinite matrices would reside on the cone boundary. Now assume that
and
are two SPD matrices of size
and its corresponding m eigenvalues satisfy
where
Venkatesh and colleagues (2020) used
Subject identification
Subject identification is the process of identifying an individual's FC from a population of FCs, given another FC of that individual. All conditions (resting-state and seven tasks) in our data set contain two runs (LR and RL acquisition orientation), which we denominate here Test and Retest. To avoid any bias due to the acquisition orientation, runs were randomly assigned to either Test or Retest for each subject. This process was repeated for each condition separately.
An FC from the Retest data was labeled with the participant's identity in the Test data that was closest to it in the Test data. We repeated this process for all the FCs in the Retest data and defined the identification rate as:
This process was repeated by reversing the roles of test and retest sessions, as introduced by Finn and colleagues (2015). The final identification rate was obtained by averaging the two values.
The identification rates were computed for each condition separately. To study the effects of regularization on the identification rates, this process was repeated for a wide range of regularization parameter values, τ, in particular:
Different values of τ for the two parcellations were chosen based on preliminary exploration of the change in identification rates with τ.
To understand the effect of scanning length, for each value of τ, the identification process was repeated by selecting frames sequentially of the total time series, starting from 50 frames to the maximum number of frames, in steps of 50 (see Table 1 for maximum number of available frames and the corresponding scanning length for all eight fMRI conditions).
To understand the effect of number of frames when the scanning length is fixed, the identification process was repeated for each value of τ using the maximum scanning length. The number of frames was adjusted by choosing alternating frames from the time series, that is, by picking every second, third, fourth … frame. Note that this process is equivalent to assessing identification rates for longer repetition times (TRs). The maximum gap between chosen frames was decided for each condition to keep at least 50 frames in the final time series.
To assess variability in identification performance due to differences in samples, we used sampling without replacement. For every run, we randomly selected 80% of the participants and performed subject identification process. This procedure was repeated 100 times for each value of τ and for each number of frames evaluated.
The above mentioned “sampling without replacement” process would also serve as a proxy exploration of the generalizability of the optimal regularization magnitude for outside data sets of same or similar acquisition parameters as the ones used in this study. To explore generalizability of the optimal regularization magnitude across different sessions of the same subjects, two sessions from REST2 were used to compare the identification rates for varying values of τ with REST1, using the entire scanning length.
Results
We explored the effect of using different values of the regularization parameter (τ) on the geodesic distance, and the uncovering of individual fingerprint in FCs. Identification rate (Finn et al., 2015; Venkatesh et al., 2020) was used as a metric to quantify the individual fingerprint. Identification rate was computed by the Subject Identification process, which is the process of identifying an individual's FC from a population of FCs, given another FC of that individual. Identification rate is simply the percentage of accurately identified individuals. Through a small example, we show evidence of regularization affecting not only the global geodesic distance but also relative distance between FCs, which ultimately may affect identification rates. Then, we systematically studied how regularization affects identification rates for FCs, with different fMRI conditions (resting-state and seven fMRI tasks), parcellations, varying scanning lengths, and finally, varying number of frames for a fixed scanning length. The generalizability of the optimal regularization magnitude for different sessions of the same subjects and for different subjects for whom fMRI data were acquired with exactly same acquisition parameters was also investigated.
We first provided an example to develop an intuitive understanding of how regularization affects geodesic distances between FCs. To do so, we assessed the effect of regularization on geodesic distances among FCs when subjects are performing the emotion processing task. Figure 2A shows that as regularization (τ) increases, average geodesic distance across all subjects and sessions (global geodesic distance), exponentially decreases. We then assessed the effect of regularization on the relative geodesic distances between FCs. Figure 2B shows the proximity-rank in terms of distance. Briefly, the proximity-rank of an FC B with respect to an FC A quantifies how many FCs in that data set are closer to FC A than FC B. Taking as reference subject Atest, we tracked the proximity-rank of the subjects Aretest, Bretest, and Cretest at different levels of τ. At

Effect of regularization (τ) on global and relative geodesic distances. We have chosen the emotion processing FCs to illustrate how geodesic distances across subjects and/or sessions change with regularization magnitude.
Results above show that regularization not only affects the global geodesic distance among FCs but also the relative distance, which may ultimately affect identification rates. Figure 2C and D shows the identifiability matrices for 25 subjects chosen arbitrarily (for ease of visualization) performing the emotion processing task at a low (
Intuitively, these results tell us that asymptotically, geodesic distances between FCs approach zero as τ tends to infinity. In addition to affecting the absolute magnitude of the distances, τ also affects the relative distances between FCs and we have preliminary evidence that there is an optimal value/range of τ, which would affect relative distances in such a way that FCs from the two sessions of the same subject are closer to each other than any other FCs. These findings motivate us to assess changes in subject identification rates with varying magnitudes of τ.
Figure 3 shows the effect of τ on identification rates for all fMRI conditions (using the entire scanning length) and for both the Destrieux and MMP1.0 parcellations. Identification rates for all conditions and different parcellations appeared to be highly sensitive and roughly concave functions of τ. In most cases, we observed the presence of a clearly identifiable optimal τ (from now on denominated τ*) value for which the identification rate is maximized. For a few cases for the MMP1.0 parcellation, it seems that there was a wide range of optimal τ that produced very similar identification rates (e.g., resting-state, emotion).

Effect of regularization (τ) on identification rates. Identification rates for all eight conditions (utilizing maximum available scanning length) with variable magnitudes of τ, using Destrieux (left; 164 ROIs) and MMP1.0 (right; 374 ROIs) parcellations. Filled circles indicate the mean identification rate, whereas error bars indicate the standard error of the mean across samplings with replacement (error bars are small enough that they are hidden behind the circles). Legend indicates the eight conditions along with maximum available number of frames. Along each curve, the circle not filled indicates the optimal value of τ, which maximizes the identification rate. Color images are available online.
Using the entire scanning length, τ* depended not only on the condition but also on the parcellation (Fig. 3). The τ* values were smaller for the Destrieux parcellation than for the MMP1.0 parcellation for any given condition. Resting-state, language, and working memory had the highest, whereas the emotion task had the lowest identification rates at τ* for both parcellations. At τ*, the identification rates were either approximately equal (for resting-state) or higher when using MMP1.0 parcellation, compared with Destrieux, except for working memory and social tasks. For both parcellations, resting-state condition reached greater than 99% identification rate at τ*.
We then assessed the effect of scanning length on identification rate and how it interacts with τ. Results are shown in Figure 4 (Destrieux) and Figure 5 (MMP1.0). With the Destrieux parcellation (164 brain regions), in general, τ* was particularly small (

Identification rates as a function of regularization (τ) and scanning length used to compute FCs using Destrieux parcellation. The panel shows identification rates, averaged across samplings without replacement, for all eight fMRI conditions. For any given condition, the scanning length was adjusted by selecting frames sequentially of the total time series ranging from 50 to maximum number of frames available, in steps of 50. fMRI, functional magnetic resonance imaging. Color images are available online.

Identification rates as a function of regularization (τ) and scanning length used to compute FCs using MMP1.0 parcellation. The panel shows identification rates, averaged across samplings without replacement, for all eight fMRI conditions. For any given condition, the scanning length was adjusted by selecting frames sequentially of the total time series ranging from 50 to maximum number of frames available, in steps of 50. Color images are available online.
Optimal Identification Rates for All Eight Functional Magnetic Resonance Imaging Conditions Using Destrieux Parcellation, and the Corresponding Values of the Optimal Scan Length, Percentage of Maximum Available Frames, and the Optimal Regularization Magnitude (τ* )
ID, identification.
With MMP1.0 parcellation (374 brain regions), we observed similar results. Just as with the Destrieux parcellation, resting-state behaved differently than tasks. First, for any given scanning length, τ* values were much smaller for resting-state than for tasks (Fig. 5). Second, the identification rates for resting-state were more dependent on the scanning length than on the regularization. For a given τ, identification rates tended to increase with increasing scanning length for all conditions, with maximal identification rates achieved with entire scanning length (Table 3). Finally, the optimal ranges of τ were broader with shorter scanning length and more specific with increasing scanning length. In comparison to the Destrieux parcellation, the narrowing of the optimal τ range required longer scanning length for MMP1.0 for any given condition.
Optimal Identification Rates for All Eight Functional Magnetic Resonance Imaging Conditions Using MMP1.0 Parcellation, and the Corresponding Values of the Optimal Scan Length, Percentage of Maximum Available Frames, and the Optimal Regularization Magnitude (τ* )
We also assessed the effect of number of frames on the identification rates, when maintaining the entire scanning length. Overall, for a given condition, the identification rate was not severely affected by decreasing the number of frames (Fig. 6). When the number of frames became too small (different for each condition), identification rates dropped more drastically for the Destrieux parcellation than for the MMP1.0. It is interesting to note that with ∼170 or more frames, identification rates reach a plateau for all fMRI conditions and parcellations.

Effect of number of frames on identification rates using the entire scanning length. Identification rates for all eight fMRI conditions (utilizing optimal regularization magnitude [τ*]—see Table 1) with variable number of frames, using Destrieux (left; 164 ROIs) and MMP1.0 (right; 374 ROIs) parcellations. Maximum scanning length was always maintained for each condition by choosing alternate points from BOLD time series. For instance, 397 frames were obtained for resting-state by choosing every third time point. Filled circles indicate the mean identification rate, whereas error bars indicate the standard error of the mean across samplings with replacement (error bars are small enough that they are hidden behind the dots). Legend indicates the eight fMRI conditions along with the maximum number of frames available. BOLD, blood oxygenation level dependent. Color images are available online.
A very low standard error of mean was observed for all the analyses discussed above (Figs. 3–6), highlighting the generalizability of the optimal regularization magnitude to FCs from different subjects. Optimal regularization magnitude and the corresponding identification rates for REST2 were found to be similar to REST1 (Fig. 7) highlighting the generalizability across different sessions of the same subjects. It should be noted that for both REST1 (Fig. 3) and REST2 (Fig. 7), there is a range of τ where the corresponding identification rates are approximately equal to the optimal identification rate. In addition, the scatter plots between identification rates of REST1 and REST2 show how similarly the two samples behave with respect to τ (Fig. 7; insets).

Generalizability: effect of regularization (τ) on identification rates for REST2. Identification rates for the two sessions (LR and RL) from REST2 (utilizing maximum available scanning length) with variable magnitudes of τ, using Destrieux (left; 164 ROIs) and MMP1.0 (right; 374 ROIs) parcellations. Filled circles indicate the mean identification rate, whereas error bars indicate the standard error of the mean across samplings with replacement (error bars are small enough that they are hidden behind the circles). Legend indicates the REST2 condition along with maximum available number of frames. Along each curve, the circle not filled indicates the optimal value of τ, which maximizes the identification rate. The insets in both plots are the scatter plots between REST1 and REST2 of the mean identification rates (across samplings) for the entire range of τ. Both x- and y-axes indicate identification rates and the dotted line is identity line. LR, left to right; REST, resting-state; RL, right to left. Color images are available online.
Discussion
In this work, we explored the effects of different magnitudes of regularization on geodesic distance and subsequently its impact on subject identification rates in FCs. We explored these effects for eight fMRI conditions from the HCP data—resting-state, emotion, gambling, language, motor, relational, social, and working memory. We found that the optimal value of the regularization parameter, which maximized the identification rates, is dependent on the condition, parcellation, scanning length, and the number of frames used to the compute the FCs. In addition, the deviation from the optimal point could affect the identification rates drastically depending on the condition, scanning length, and/or the number of frames used. We also found that the magnitude of optimal regularization is generalizable across different subjects and different sessions of the same subjects, when the acquisition parameters are the same. In short, we found that geodesic distance, which has been shown to be a more accurate way of comparing FCs than canonical methods (Venkatesh et al., 2020), can be further refined by choosing an optimal regularization magnitude for each data set and fMRI condition.
Increased regularization reduces geodesic distance globally and alters relative distances between FCs
Geodesic distance is highly determined by the eigenvalues of the FCs being compared [Eq. (1)]. When those FCs are regularized by adding a constant value to their main diagonal, it increases their eigenvalues by the same amount, thus affecting the geodesic distance between them. As the regularization magnitude increases, the eigenvalues of the FCs, and hence the geodesic distance between them, becomes dominated by it. Since the regularization value added to both FCs is always equal, for a large enough regularization magnitude, their eigenvalues also become approximately numerically equal, leading to a decreased geodesic distance. Intuitively, increasing main diagonal regularization is equivalent to shifting and shrinking the space occupied by the matrices within the manifold. Thus, as the regularization magnitude increases, it was expected that the geodesic distance between FCs would decrease, as observed in Figure 2A.
It was less intuitive that the relative magnitude of the distances would also change with regularization. As the regularization magnitude increased, the relative distance between FCs changed in different directions as shown in Figure 2B with FC Bretest and Cretest. Furthermore, for an optimal value of regularization, the distances between sessions of the same subjects became smaller than between subjects, which lead to better identification of the subjects when comparing the test and the retest sessions, as shown in Figure 2C and D identifiability matrices.
Overall, we can think of increasing regularization as a nonlinear shrinking procedure, which does not preserve relative distances between FCs. By tracking the effects of regularization on three subjects, we demonstrated that the local distance information is not preserved for different magnitudes of regularization (Fig. 2B). This result must be taken into account when using geodesic distance to compare FCs. Then, the question is how to decide what magnitude of regularization to choose? The answer lies in the implicit hypothesis that the FCs from two sessions of the same subject should be closer to each other than FCs from any session of any other subject. If we can find a regularization magnitude where for most subjects, this statement is true, then that is the spot where the distances between FCs are the most meaningful, if not accurate. This optimal spot can be discovered by tracking identification rates as they change with regularization, as was done in this study.
Identification rate is a concave function of the regularization parameter
We observed that for any condition and parcellation, there was a specific value or a range of values for the regularization parameter where identification rate peaked (Fig. 3). In other words, identification rate was a concave function of the regularization parameter for all fMRI conditions and parcellations tested here. We should emphasize that only a limited range of the regularization parameter was tested in this study, for specific conditions and parcellations, and thus, we cannot theoretically guarantee that the optimal levels of regularization found here could be trivially extrapolated to other data sets with different acquisition parameters. But, considering the breadth of the fMRI conditions and the size of the data set used in this study, we are confident that this concave behavior would be replicable in other fMRI data sets as well.
Optimal regularization parameter depends on the specific data set
We observed that the optimal value of the regularization parameter, which maximizes the identification rates, depends on the condition, parcellation, scanning length, and number of frames used to compute the FCs (Figs. 4 and 5 and Tables 2 and 3). Venkatesh and colleagues (2020) used a fixed regularization magnitude (
Longer scanning length leads to more specific values of optimal regularization and to higher identification rates
As the number of samples (or frames chosen sequentially), and hence the scanning length, increases in the time series data, the resultant correlations become more reliable (Bonett and Wright, 2000), and thus, we get better estimates of FCs in the “static” sense of functional connectivity. For all the tasks, we observed that as the scanning length increased, the range of values of τ which resulted in maximized identification rates narrowed down (Figs. 4 and 5). This effect was not as prominent in resting-state, where for most of the scanning lengths evaluated, there was a wide range of values of τ, which resulted in maximum identification rates. This suggests that resting-state FCs, in comparison to tasks, may reside in an intrinsically different region of the semidefinite cone where reallocation of FCs through regularization does not have a sizeable influence on their differentiability.
It should also be pointed out that with optimal values of τ, the optimal identification rates were almost always obtained when using the entire scanning length (two exceptions: resting-state and language using Destrieux parcellation; Tables 2 and 3). Even in the two cases where it was not, the optimal scanning length was marginally smaller than the entire scanning length and the optimal identification rate was approximately equal to the identification rate obtained with maximum scanning length (within margin of error). Intuitively, we can say that the longer the scan acquired, the more information we have about the condition and the subject, which results in higher identification rates.
Number of frames and TR length are not as influential as scanning length
For all conditions, across the two parcellations, when the scanning length was decreased, the identification rates dropped, sometimes drastically (Figs. 4 and 5). Ostensibly, it might seem that this does not hold for resting-state condition, but it is worth noting that resting-state scan is a considerably longer acquisition (14 min and 47 sec compared with second longest, working memory, which is 4 min and 44 sec) than all the tasks and the effect of shorter scanning length comes into play when the reduced scanning length becomes comparable to tasks (around 6 − 7 min). The decrease in identification rate with decreasing scanning length raises a natural question: what would happen if scanning length is maintained but the number of frames is reduced?
The answer is that identification rates are considerably less sensitive to number of frames than the scanning length, when the number of frames is not too small (Fig. 6). To achieve fewer number of frames while maintain the scanning length, we chose alternate time points, with varying gaps, which introduced another variable into the mix: TR. For instance, by choosing every fourth sample from a time series, we are effectively increasing the TR fourfold. So, another conclusion that we could draw from this result is that identification rates are considerably less sensitive to TR length than scanning length. This effect has been observed before by Horien and colleagues (2018) but using Pearson's correlation coefficient as a metric to compare FCs. This knowledge could be helpful in designing scanning protocols where often one has to “sacrifice” spatial resolution for temporal resolution or vice versa. Knowing that as long as one has a long enough scan, perhaps a relatively longer TR could be acceptable in favor of improved spatial resolution, without any detrimental effects to the FC fingerprint.
Regularization counteracts the effect of a coarser grain parcellation on individual fingerprint
Using Pearson's correlation as a similarity metric to compare FCs, Finn and colleagues (2015) showed that a parcellation with more ROIs resulted in higher subject identification rates than a parcellation with fewer ROIs. Venkatesh and colleagues (2020) observed the same trend with both geodesic distance and Pearson's correlation-based dissimilarity. This suggested that finer parcellations lead to more uniqueness or fingerprint, at least up to a certain resolution. In this work, we found that when using a coarser resolution parcellation, we can achieve similar identification rates than a finer resolution parcellation when applying geodesic distance with optimal regularization magnitude.
When computing FCs, an ROI time series is computed by averaging voxel-level time series for all the voxels contained within the ROI. One of the main reasons this is done is to increase the signal-to-noise ratio of the time series under consideration, as the voxel-level time series would be much noisier than an averaged ROI time series. By choosing a finer resolution parcellation, we chose smaller size ROIs and hence compromise on the signal-to-noise ratio in the time series in favor of spatial resolution, compared with a coarse resolution parcellation, where an ROI time series would be computed by averaging over a larger number of voxels. Since by using geodesic distance with optimal regularization, we can overcome the downside of coarse resolution parcellation in terms of fingerprint, perhaps we can favor a relatively coarser parcellation for an improved signal-to-noise ratio while maintaining the individual fingerprint.
Generalizability of the optimal regularization magnitude
Very small differences were observed in the optimal identification rates and the optimal magnitudes of the regularization parameter when different subsamples of the data set were used for subject identification (Figs. 3–6). This highlights that the results for optimal regularization in this study are generalizable to other data sets, given that the scans are acquired with the same or similar parameters. If one is to change the acquisition parameters though, the optimal regularization magnitudes might be different. Using the two sessions from REST2 (not used in any of the former analyses), we were also able to show that the optimal regularization magnitudes and the corresponding identification rates are generalizable to different sessions of the same subjects, even when acquired on different days with the same parameters (Fig. 7). In addition, we observed that optimal identification rates are maintained for the same amount of regularization when the TR length is increased (to a certain extent), and the number of frames is decreased while maintaining the scanning length (Fig. 6). Overall, these findings suggest a generalization of these results to a considerable range of temporal resolution in the BOLD fMRI data.
Comparison with canonical metrics used to compare FCs
With all the canonical methods of comparing FCs (e.g., Pearson's correlation coefficient, Euclidean distance), only the elements in the upper or lower triangular part of the FC are selected and vectorized. This means that regularization has no effect on those metrics since the regularization magnitude is added to the main diagonal, which is ignored by all those metrics. It has already been shown by Venkatesh and colleagues (2020) that geodesic distance outperforms those metrics in uncovering individual fingerprint in FCs. They achieved this using a fixed nonoptimal regularization magnitude (
How to estimate the optimal regularization parameter and the resulting geodesic distances in a specific study
We have observed that the optimal regularization that leads to maximum identification rates is dependent on the fMRI condition, brain parcellation, scanning length, and the number of frames. There might be other aspects of the data that influence such optimal value as well, such as voxel size. Hence, results suggest that when using geodesic distance to compare FCs, the regularization parameter must be estimated from the FC data of that study. Also, one should utilize sampling techniques to estimate a mean or median magnitude of regularization, along with the corresponding error. Once an appropriate regularization magnitude has been identified, one should regularize all FCs in the data set by that amount and then use geodesic distance to compare FCs. These steps have been tabulated for the benefit of the user of this framework (Table 4). It is important to remark that these resulting pairwise distances are better suited for establishing associations between functional connectivity and cognition, behavior, and neurological diseases at the individual level.
A Step-by-Step Outline of How to Estimate and Apply an Optimal Regularization Magnitude (τ*) to an Functional Connectome Data set, Such That Individual Fingerprint Is Maximized When Using Geodesic Distances to Compare Functional Connectomes
FC, functional connectome; fMRI, functional magnetic resonance imaging.
This process of estimating an optimal regularization from the data themselves, and then applying it back to the same data might seem biased, but we should emphasize that the optimal regularization is estimated to maximize individual fingerprint in the data and nothing else. It is not optimized for any group differences or for any neuro/psychiatric or behavioral score. The only desired output is maximal interindividual differentiability so that the desired effects could be accurately captured at the individual level.
It might also seem desirable to have a constant value of regularization (say 0.1 or 1) that is applicable to all data sets, without any considerable negative effects. But as we have observed, deviations from optimal regularization magnitudes could have detrimental effects on the measured individual fingerprint depending on a variety of factors. Hence, it is always recommended to estimate an optimal regularization magnitude from the data themselves, especially considering that it is extremely easy and computationally efficient to estimate.
Limitations and future work
One limitation of the geodesic distance, whether applied to regularized or unregularized FCs, is that it only provides a single numeric distance estimate between FCs and hence does not allow element-wise (or edgewise) analyses of the FCs (i.e., analysis focused on a particular brain region or a specific functional coupling between two brain regions). Although this limitation can be addressed by projecting FCs from the SPD manifold onto a tangent space of symmetric matrices, which would be Euclidean and allow the use of Euclidean algebra and calculus (Pervaiz et al., 2020; You and Park, 2021). Future work should explore these projections and how they interact with regularization magnitude.
One could also explore the effects of regularization on the identification rates when the test and retest sessions belong to different fMRI conditions (e.g., working memory vs. resting-state), analogous to Finn et al. (2015) and Venkatesh et al. (2020). To estimate the optimal amount of regularization based on functional connectivity fingerprint, one could go beyond test/retest of the same individual and assess identification rates when the test and retest sessions belong to twin pairs (monozygotic or dizygotic). Finally, we can compare this straight forward main diagonal regularization with other kinds of regularization techniques that include off diagonal elements or add a variable amount to the elements of the main diagonal.
Conclusions
The use of the geodesic distance on full-rank or regularized rank-deficient FCs has been shown to be a more principled and accurate method to compare FCs than canonical methods, ultimately leading to improved subject fingerprinting, as measured by identification rates. Here, we combine geodesic distance with optimal regularization to uncover brain connectivity fingerprints by means of an incremental assessment of the magnitude of the regularization parameter. We show that optimal regularization that maximizes subject identification rates is highly data set-dependent—it depends on the fMRI condition, on the brain parcellation used, scanning length, and on the number of frames used to compute the FCs.
Footnotes
Acknowledgments
Data were provided [in part] by the Human Connectome Project, WU-Minn Consortium (Principle Investigators: David Van Essen and Kamil Ugurbil; 1U54MH091657) funded by the 16 NIH Institutes and Centers that support the NIH Blueprint for Neuroscience Research; and by the McDonnell Center for Systems Neuroscience at Washington University.
Previous versions of this article have appeared in a preprint posting on the Cornell University's
Author Disclosure Statement
No competing financial interests exist.
Funding Information
Joaquín Goñi acknowledges financial support from NIH R01EB022574, NIH R01MH108467, Indiana Alcohol Research Center P60AA07611, and Purdue Discovery Park Data Science Award “Fingerprints of the Human Brain: A Data Science Perspective.” Jaroslaw Harezlak has been partially supported by the grant NIH R01MH108467. Enrico Amico acknowledges financial support from the SNSF Ambizione project “Fingerprinting the brain: network science to extract features of cognition, behavior and dysfunction” (Grant No. PZ00P2_185716). Luiz Pessoa's research has been supported by the National Institute of Mental Health (R01 MH071589 and R01 MH112517).
