Abstract
Compositional data analysis deals with situations where the relevant information is contained only in the ratios between the measured variables, and not in the reported values. This article focuses on high-dimensional compositional data (in the sense of hundreds or even thousands of variables), as they appear in chemometrics (e.g., mass spectral data), proteomics or genomics. The goal of this contribution is to perform a dimension reduction of such data, where the new directions should allow for interpretability. An approach named principal balances turned out to be successful for low dimensions. Here, the concept of sparse principal component analysis is proposed for constructing principal directions, the so-called sparse principal balances. They are sparse (contain many zeros), build an orthonormal basis in the sample space of the compositional data, are efficient for dimension reduction and are applicable to high-dimensional data.
Introduction
Dimension reduction is often the first step for analyzing multivariate data. The representation of the data in a lower dimensional space can be used in an exploratory way, e.g., to get an impression of the data structure, also as a precursor to other statistical methods, like regression or discriminant analysis. The latter is important in particular for high-dimensional data, where the number of variables is higher than the number of observations. The most meaningful method in terms of dimension reduction is principal component analysis (PCA), since it is the goal of PCA to maximize the explained variance. However, PCA cannot directly be applied to compositional data (CoDa) because they live in a different geometry.
A composition x(x1, …, xD)
t
consists of D > 1 strictly positive components, the so-called compositional parts, and it is represented in the D-part simplex
The constant l can be chosen arbitrarily (e.g., 100 in case of percentages), without changing the multivariate information (Aitchison, 1986; Egozcue, 2009; Pawlowsky-Glahn and Buccianti, 2011). The simplex sample space is a (D – 1)-dimensional subset of the D-dimensional real Euclidean space. Typical examples for compositional data are income components, chemical element concentrations in some material or spectral data where only the intensity is informative and not the absolute number of the intensity (e.g., data from mass spectrometry). The latter example is a case for a high-dimensional composition, i.e., where D is in the order of hundreds.
PCA for compositional data has been studied in many papers, starting with Aitchison (1983, 1984). The main idea is to transform the compositional data appropriately to the real Euclidean space ℝ
D
, before PCA is carried out. For this purpose, the centred logratio (clr) transformation turned out to be useful (Aitchison and Greenacre, 2002). The clr transformation is a transformation from
D
to ℝ
D
, and the results for an observation x ∈
D
are the transformed data y ∈ ℝ
D
with
This transformation results, however, in singularity, because
The coordinates of ilr-transformed data, or, equivalently, the PCA scores of clr-transformed data are in general difficult to interpret. PCs are linear combinations of ilr- (or clr-)variables, but in general the information of the original compositional parts is contained in not one, but several of the transformed variables. For example, each clr-variable contains information of all compositional parts due to the geometric mean in the denominator (see Equation (1.1)). This problem also remains when using the ilr transformation proposed in Egozcue et al. (2003). It can be circumvented by the approach of Egozcue and Pawlowsky-Glahn (2005), where the so-called balances are constructed, representing coordinates of an orthonormal basis in the simplex, which describe all the relative information of groups of compositional parts. If prior knowledge is available, balances can summarize the information of groups of meaningfully related components. Otherwise, orthonormal balances can be easily constructed by a procedure called sequential binary partitioning (Egozcue and Pawlowsky-Glahn, 2005).
Unlike PCs, balances are not constructed in order to maximize the explained variance. The idea of principal balances (PBs) (Pawlowsky-Glahn et al., 2011) is an attempt to satisfy both needs: to form an orthonormal basis on the simplex, and to maximize the explained variance with the constraint of allowing for a simpler interpretability. In other words, PBs are balances, describing all the relative information of groups of parts, but they should be close to the PCs of clr-transformed data in order to express much of the data information.
Pawlowsky-Glahn et al. (2011) suggested three algorithms to construct PBs. Two of them are computationally intensive and are thus limited to low dimension. The third is based on hierarchical clustering (HC); it is fast to compute also for high-dimensional data, but the explained variance of the first few PBs can be quite low. In this article, we propose a fourth algorithm based on sparse PCA (Witten et al., 2009) for computing interpretable orthonormal directions with high explained variance. The resulting balances are called sparse principal balances (SPBs).
The article is organized as follows: Section 2 reviews the algorithms of Pawlowsky-Glahn et al. (2011) to construct PBs. Section 3 introduces an algorithm for con-structing sparse principal balances. Simulation studies in Section 4 compare the performance of the four different algorithms, and Section 5 presents an application with metabolomics data. The final Section 6 concludes with a discussion about the main features of the proposed algorithm and an outlook for future work.
Denote by x+ and x– subcompositions of two disjoint groups of parts from a D-part composition x(x1, …., xD)
t
, where the number of parts in x+ and x– is r and s, respectively, with r + s ≤ D. Further, denote by gm(.) the geometric mean of the arguments. Then a balance is defined as
(Egozcue and Pawlowsky-Glahn, 2005), and represents a coordinate of the composition with respect to a (compositional) basis vector e of unit length, formed in the clr-space. This basis vector e, also called balancing element, is defined in the clr-space as clr(e)(a1, …, aD)
t
, with components
for i1, …, D. The possibility of including just disjoint subgroups of compositional parts into the construction of balances will be useful later on for achieving sparsity.
Orthonormal balances can be defined using a sequential binary partition (Egozcue and Pawlowsky-Glahn, 2005). The basic idea is to recursively partition each of the groups x+ and x– in two further disjoint groups and to compute the balance and the corresponding balancing element according to Equations (2.1) and (2.2), until a complete set of D – 1 orthonormal basis vectors has been formed. In each step of the procedure, there are many possible disjoint partitions: with increasing dimension D, the number of binary partitions grows exponentially (Pawlowsky-Glahn et al., 2011). The resulting orthonormal bases are just rotations of each other.
With PBs, one wants to identify a set of D – 1 orthonormal balances which successively maximize the explained variance. An exhaustive search for PBs is only feasible if D is very small. Pawlowsky-Glahn et al. (2011), thus, introduced the following three approximate algorithms to compute PBs. Some of these algorithms use the resulting PCs of the clr-transformed data; these PCs will be called CoDa-PCs (CoDa stands for Compositional Data):
(angular proximity to principal components): In the first step of this recursive algorithm, all possible binary partitions of the full D-part composition are created. The balancing element with the smallest geometric angle with one of the CoDa-PCs is stored and removed from the set of possible directions. The procedure is then applied to each group of the previously identified balance separately, where the geometric angle to one of the remaining CoDa-PCs is minimized. This is repeated step-by-step, until D – 1 balances are extracted, i.e., until a complete sequential binary partition is achieved.
(hierarchical clustering of components): The set of orthonormal balances is constructed by hierarchical cluster analysis. The variance of the balance between two groups (2.1) is used as a criterion to link two clusters. It turns out that this corresponds to a Ward clustering (Everitt, 1993) based on the variation matrix tijVar[log(xi/xj)], i, j1, …, D, where ‘Var’ denotes the variance (Aitchison, 1986).
(maximum explained variance hierarchical balances): This sequential algorithm starts with the first CoDa-PC, and uses two groups with the signs of the loadings for constructing a balance. Let r denote the number of positive signs and s the number of negative signs. Then it is checked whether a change of one positive sign to the other group increases the explained variance. This check is also carried out for all combinations of 2, …, r – 1 positive signs. The balance with the maximum explained variance is stored, a new CoDa-PCA is performed with the larger group and so on.
For more details on the algorithms, we refer readers to Pawlowsky-Glahn et al. (2011). The computation time of AP and MV explodes quickly with increasing dimension because of the exponentially growing number of possible combinations that are used as candidates. Creating all possible combinations also leads to memory allocation problems for larger D. The HC algorithm is just based on a D D distance matrix, which is unproblematic even for larger dimension.
Sparse principal balances (SPBs) can be defined as balances, according to equation (2.1), that make a trade-off between maximizing explained variance and the number of involved components
Suppose we have given n compositions collected into the data matrix X of dimension n D. The clr-transformed matrix is denoted by Y, and it has the same dimension. The rank-k approximation of Y is
where dl (scalar), ul (vector of length n) and vl (vector of length D) minimize the squared Frobenius norm of Y –
where
For k > 1, Witten et al. (2009) propose the following approximation procedure:
Let Y1 ∈ Y. For Find ul, vl, and dl by applying the above rank-1 approximation to the data Yl.
Without the L1 constraints, this procedure leads to a rank-k singular value decomposition of Y with orthogonal vectors, i.e.,
SPBs can now be constructed as follows. As with the algorithms in Section 2, we start with a sparse PCA of the clr-transformed data matrix Y, by extracting a fixed number k (1≤k < D) of components. The resulting sparse loading matrix is denoted by the D k matrix V = [vij]. Usually, k will be small (e.g., k≤5), since we are only interested in a few balances that explain an essential part of the variability. The resulting loadings, however, need to be modified for SPBs, because for a simplification of the interpretation we want to ensure that different balances relate to different parts. In other words, we want to force a disjoint pattern of non-zero entries in different balances. The proposed algorithm is as follows:
Force a disjoint non-zero pattern: For Find the smallest j for which Denote by While Identify the entry (i, j) of V with the smallest positive absolute value, and set this vij to zero. Update As a result, we have at least as many zero-lines in V as ‘problematic’ columns, i.e., columns without positive and negative sign. Guarantee positive and negative sign in each column of V:
Repeat for all For any Repeat for all For any The resulting columns of the (modified) matrix V are in general not formed by vectors in the clr-space, i.e., their elements do not sum up to zero. In order not to change the zero elements of V, we project the dl≤D non-zero elements of each column vl, denoted as As a final modification of V, the nearest balances (with respect to the Euclidean distance) to the resulting clr vectors are constructed (Egozcue and Pawlowsky-Glahn, 2005). For this purpose, we simply compute the arithmetic mean from the positive/negative entries of wl, and by rescaling to unit norm the balancing element is obtained, see Equation (2.1). The set of extracted balances is orthogonal by construction.
Execution time comparisons on a standard computer show immediately, which of the four algorithms is feasible for the computation with high-dimensional data. Figure 1 shows the average computation time (average over 10 trials) for computing the first balance with the different algorithms. The number of observations is always n100, but the number of parts D varies from 4 to 50 (left plot). The computation times for the algorithms AP and MV explode for quite moderate values of D, and thus these algorithms will not be considered later on for the simulations in high-dimensional settings. On the other hand, the time for HC and SPB is still very low. In the right plot, the same setting is used, but D is taken much higher (100, 500, 1000, 2000). Here we only compare HC and SPB which are still feasible to compute. Although the computation time for HC is still quite moderate, the one for an SPB is far below in all runs.

In this section, we compare the methods HC and SPB which are still feasible to compute for high-dimensional compositional data. By simulating data of varying dimension, we focus on the explained variance of the resulting balances, and compare with the explained variance achieved by CoDa-PCA. Also, the resulting sparseness will be of interest in the evaluation.
The simulation setting is as follows:
A matrix Lilr of size (D – 1) (D – 1) is generated with uniformly distributed values in [–1, 1]. The columns are then standardized to unit length vectors (loading matrix in ilr-space). A matrix Zilr of size n (D – 1) is generated according to a (D – 1)-dimensional normal distribution with mean vector (0, …, 0)
t
and a diagonal covariance matrix CZdiag(0.91, 0.92, …, 0.9
k
, 0.01, …, 0.01). So, the PCA scores are uncorrelated and their variances are decreasing exponentially. The data matrix is reconstructed in the ilr-space by Xilr is transformed to the simplex by the inverse ilr transformation
Here, The clr transformation (Equation (1.1)) is applied to Xs, resulting in Xclr.
CoDa-PCA is applied to Xclr. Note that the total variance corresponds to the sum of the diagonal elements of CZ. HC is applied to Xs and SPB is applied to Xclr. In all simulation runs, we fix the number of observations with n100 and the number of replications with 1000.
Simulations extracting k5 components
The number of extracted PCs is k5, and the data dimension is taken as

Figure 2 compares the results for the cumulative explained variances of the three methods. Each boxplot summarizes the results of the 1000 simulations. In lower dimension (D50 and D100) the methods SPB and HC result in a quite similar explained variance, but with increasing dimension, SPB clearly gets better. There is still a clear gap to the explained variance of CoDa-PCA, but remember that SPBs are based on sparse PCA which forms a compromise between explained variance and sparsity of the loading matrix.
The method SPB results in balances which involve non-overlapping groups of variables. The balances thus have a certain number of non-zeros, and Figure 3 displays for the above simulation results the proportion of non-zeros in each balance, cumulated by subsequent balances. This information is useful for evaluating the level of sparseness of the algorithm. For example, for D50 (upper left plot), about 30% of the entries of the first balance are non-zero, for the second balance we obtain slightly less than 30% non-zeros, which gives in total somewhat less than 60% of non-zeros for the first two balances. With the first five balances, about 95% of the variables have entered the SPBs. Also in higher dimension, these percentages are comparable. Note that the data are not simulated with inherent sparseness structure.


Here, we are only interested in the first k2 components. Similar as before, the tuning parameters for SPB are selected in order to maximize the explained variance for k2. Now, we divide the cumulative explained variances for k2 obtained from the SPB by those from HC after each simulation run. The results are shown by boxplots in Figure 4. It turns out that only for D10, HC leads to a higher median value, for D50 and D100 the method SPB achieves higher values in the vast majority of runs, and for even higher values of D, SPB is clearly improving over HC.
Application to metabolomics data
The data set considered here contains NMR metabolomic spectra from urine samples of 18 mice, each belonging to one of two treatment groups. Each spectrum has 189 spectral bins, and they are measured in parts per million (ppm). This already clearly emphasizes the compositional nature of the data. The data set is described in detail in Nyamundanda et al. (2010), and it is available in the R package MetabolAnalyze as data set UrineSpectra (Nyamundanda et al., 2012).
Our goal is a two-dimensional representation of the spectral information. We compare the CoDa-PCA method with the hierarchical clustering (HC) and the sparse principal balances (SPB) approaches. Table 1 shows the explained variances for the three methods when using two components (balances). There is of course an information loss compared to the clr-based CoDa-PCA approach. The explained variance of the first SPB component is much more than that of HC, and the cumulative explained variance of HC and SPB for two components is comparable, although SPB aims at a simpler interpretation than HC.
Cumulative explained variances for CoDa-PCA, hierarchical clustering (HC) and sparse principal balances (SPB) for the urine data set.
Cumulative explained variances for CoDa-PCA, hierarchical clustering (HC) and sparse principal balances (SPB) for the urine data set.
Figure 5 shows the original spectral data, overlaid with the information of the signs of the first two balances from HC. Specifically, the vertical lines refer to the information of the sign: if the sign of the balance for a part is positive, we see a light grey vertical line (red, in the online-version) at the corresponding position in the plot, and if the sign is negative, the vertical line is shown in dark grey (blue). It can be seen that for the first balance, there are few negative signs, and all remaining signs are positive. For the second balance we have the reverse behaviour, but here we also get zeros at positions where the signs of the first balance are negative (no vertical lines). Overall, an interpretation of the balances is not obvious.


In comparison to Figure 5 for HC, Figure 6 shows the information of the first two balances from SPB. One can see that only few non-zero entries exist (light and dark grey lines), allowing for a severe simplification of the interpretation. The upper plot for the first balance shows non-zero entries essentially at the peaks of the signal. Non-zeros in the second balance are at smaller peaks or in ranges of larger variability of the signals.
Finally, Figure 7 shows the projection of the data in the space of the first two PCs (balances). More specifically, the left plot shows the first two PCA scores of CoDa-PCA, the middle plot shows the first two coordinates derived from HC and the right plot presents the first two coordinates from SPB. The plot symbols are according to the grouping information (control and treatment group). The groups show a clear pattern in all three plots, which means that the between-group variability is reasonably high in order to be visible in the first two main directions. For CoDa-PCA, the group separation is visible only along the first PC, while for HC and SPB, both components are informative for distinguishing the groups. In particular for SPB, the groups have an easier interpretation, since only very few variables contribute to the resulting components, see Figure 6.

In many applied fields like analytical chemistry, metabolomics or genomics, the experimental measurement processes yield high-dimensional data where the reported data values are not of direct interest, but rather the ratios between the variables. This type of data is called compositional data, and the logratio approach for their analysis allows for a more reliable insight into the real data structure. The main aim of this article is to show that the compositional data analysis method of principal balances leads to reasonable and easily interpretable results, if sparse principal component analysis is employed for dimensionality reduction. As a result of this, the necessity of working in the clr-space, leading to keep the zero-sum constraint and, thus, to limited possibilities of data manipulation, affected also the resulting algorithm. Despite that, both simulations and the real data example suggest that sparsity provides a good trade-off between maximization of the explained total variance and interpretability.
Especially for spectral data, where the order of the variables has a meaning, this work could be extended in the direction of sparse fused PCA (Guo et al., 2010). This method yields non-zero loadings for neighbouring spectra, resulting in an even clearer interpretation. Also in the context of compositional data, non-zero loadings for neighbouring spectra would lead to the interpretation that a whole range of spectra, related to an inherent property of the measurement, is relevant for a sparse principal balance.
Another possible extension is to use a robust sparse PCA method for the clr-transformed data as a starting point for the SPB algorithm. Robust sparse principal components are less sensitive to data outliers (Croux et al., 2013).
In conclusion, we feel that sparsity constraints on compositional data analysis methods offer a practical way to gain insight into high-dimensional compositions.
Footnotes
Acknowledgements
The paper is supported by the Operational Program Education for Competitiveness-European Social Fund (project CZ.1.07/2.3.00/20.0170 of the Ministry of Education, Youth and Sports of the Czech Republic) and the grant IGA_PrF_2014_028 Mathematical Models of the Internal Grant Agency of the Palacký University in Olomouc.
