Abstract
Abstract
With the emergence of droplet-based technologies, it has now become possible to profile transcriptomes of several thousands of cells in a day. Although such a large single-cell cohort may favor the discovery of cellular heterogeneity, it also brings new challenges in the prediction of minority cell types. Identification of any minority cell type holds a special significance in knowledge discovery. In the analysis of single-cell expression data, the use of principal component analysis (PCA) is surprisingly frequent for dimension reduction. The principal directions obtained from PCA are usually dominated by the major cell types in the concerned tissue. Thus, it is very likely that using a traditional PCA may endanger the discovery of minority populations. To this end, we propose locality-sensitive PCA (LSPCA), a scalable variant of PCA equipped with structure-aware data sampling at its core. Structure-aware sampling provides PCA with a neutral spread of the data, thereby reducing the bias in its principal directions arising from the redundant samples in a data set. We benchmarked the performance of the proposed method on ten publicly available single-cell expression data sets including one very large annotated data set. Results have been compared with traditional PCA and PCA with random sampling. Clustering results on the annotated data sets also show that LSPCA can detect the minority populations with a higher accuracy.
1. Introduction
R
Principal component analysis (PCA) is widely used as a dimension reduction tool when investigating a transcriptome data set. Owing to the mechanics of PCA, for a sc-RNAseq data set with typically ∼25K genes and a few hundreds to thousands of samples, the principal directions obtained from a traditional PCA is likely to be biased by the larger groups in the data set. To put it simply, the top principal components are dominated by the sheer number of data points in larger groups, thereby superseding the effect of small subpopulations. This poses a major setback in the analysis of sc-RNAseq data sets.
Inspired by the benefits of sampling in imbalanced data sets (Chawla et al., 2002; Martinsson et al., 2011), we propose to improvise PCA by introducing a systematic down-sampling step before PCA. We identify this step as the structure-aware data sampling. To this end, we have built a scalable dimension reduction framework—locality-sensitive principal component analysis (LSPCA).
We evaluated LSPCA on multiple single-cell data sets including a large data set of ∼68K peripheral blood mononuclear cell (PBMC) transcriptomes, where the true cell type annotations were available. When evaluating the clustering outcome on the annotated data set using LSPCA, we obtained better adjusted Rand index (ARI) than classic PCA. As a noteworthy benefit of LSPCA, we also observed that LSPCA performs better in discovering the minority cell groups. Besides that, we validated on nine unannotated data sets, that the principal components resulting from LSPCA are more identical to classic PCA components than to the randomized variant of PCA (Martinsson et al., 2011). Details of the results on all data sets are discussed in Section 6.
2. Background
2.1. Locality-sensitive hashing
Locality-sensitive hashing (LSH) (Indyk et al., 1997; Pauleve et al., 2010; Leskovec et al., 2014) operates on a reduced dimension to find approximate nearest neighbors. LSH uses special locality-sensitive hash functions where the chances of similar objects hashing into a same bucket are more than the collision of dissimilar objects (Bawa et al., 2005).
Each hash function projects a sample point on one random hyperplane and returns a Boolean value. The Boolean value corresponds to a sign (0/1) depending on which side of the hyperplane the projected point lies. By concatenating the output from multiple such hash functions, a string of Boolean values is derived for the corresponding sample. Unique hash strings encode for bucket identifiers that represent boundaries of neighborhood called “regions.” During search, a bucket identifier of the query object is generated using the same set of hash functions. Only the members within a region, called “candidates,” are used to compute for actual similarity. The approximate nearest neighbors are searched within and around the given query object region, which reduces the computation time for n query points from
The performance of the LSH method depends on the choice of a few parameters such as the distance metric between neighbors, number of hash functions, and the number of hash tables. Choosing of these parameters is a nontrivial task that is addressed in an advanced self-tuning version of LSH known as LSHForest (Bawa et al., 2005). We have used the LSHForest for the sampling process in the proposed algorithm.
2.2. Principal component analysis
PCA (Karamizadeh et al., 2013) is a statistical technique that converts the observation points of correlated variables to linearly uncorrelated variables using orthogonal transformations. PCA is widely used as a dimension reduction technique (Li et al., 2008). Mathematically, PCA is defined as an orthogonal linear transformation of data into a new coordinate system by iteratively projecting instances on the direction of maximum variance (Jolliffe, 1986). PCA of a data matrix
Singular value decomposition (SVD) is a matrix factorization method providing a robust computational framework to compute the principal components accurately for a variety of data sets. It is the generalization of the eigen decomposition. It is used to obtain
The steps to transform the original matrix into the new space are enumerated hereunder.
Construct the covariance matrix,
Compute the eigenvalue matrix (
Compute
Transform
Note that the first principal component of
where
3. Data Preprocessing
Data sets containing raw unique molecular identifier (UMI) read counts were downloaded from multiple sources: (1) using the Bioconductor recount package by Collado-Torres et al. (2017), (2) from the www.support.10xgenomics.com website, and (3) the adult and fetal human brain scRNA-seq data set by Darmanis et al. (2015). Each data set was preprocessed using the steps outlined in Figure 1.
Cell and gene filtering: Raw count matrix was filtered for poor quality cells. Typically single-cell data sets have >20,000 genes. Genes having count >2 in at least four cells were kept for subsequent preprocessing.
Median normalization: The filtered data matrix was median normalized. The median normalization step then involved division of UMI counts in the filtered matrix by the total UMI counts in each cell. These scaled counts were multiplied by the median of the total UMI counts across cells (Zheng et al., 2017).
Gene selection: To select the top dispersed genes, 2000 most variable genes were selected based on their relative dispersion (variance/mean) with respect to the expected dispersion across genes with similar average expression (Macosko et al., 2015; Zheng et al., 2017).
Log normalization: The resulting matrix (of dimension n cells × 2000 genes) was further subjected to

Data preprocessing stages.
4. The LSPCA Algorithm
The LSPCA framework is carried out in three major steps: (1) LSH-based sampling, (2) computing principal components, and (3) post hoc projection of all data points onto the PCA space. The details of each step are described in this section. The flowchart of the whole process is outlined in Figure 2.

A pictorial view of the LSPCA algorithm. The solid circles represent individual samples of a data set. The numbers adjacent to each sample indicate the respective index. LSPCA, locality-sensitive principal component analysis.
4.1. LSH-based sampling
Input: Preprocessed data matrix,
LSHForest: In this step, the hash codes of the input data points are produced. Unique hash codes that depict local regions or neighborhood are then computed. The python sklearn implementation of LSHForest module is used. The LSHForest is applied on the preprocessed data set with n_estimators = 30 and rest of the parameters are set to their defaults.
K-NN graph: An approximate neighborhood graph is obtained from the LSH tables. It is followed by searching for the 5-nearest neighbors (K-NN) of each point. This involves computing the Euclidean distances between the query point and its candidate neighbors.
Sampling: Sampling is carried out in a “greedy” manner. Each data point is visited sequentially in the same order as it appears in the original data set. During each visit, its respective 5-NN are flagged and never visited again. In this way, a subset of samples is obtained. The residue is further downsampled by performing the visit step recursively. For example, in the 68K PBMC data set, after 5 such iterations, ∼2000 samples were retained.
4.2. Computing LSPCA rotation matrix
LSPCA uses the structure preserving samples,
4.3. Post hoc projection onto the PCA space
The obtained transformation matrix,
5. Data Description
Two types of single-cell data sets were used for the evaluation. The unannotated recount single-cell data sets comprised nine data sets sourced from the recount2 project (Collado-Torres et al., 2017). The second type contained two data sets with annotations. The summary of the data sets is provided in Table 1.
Recount is a multiexperiment resource of analysis-ready RNA-seq gene, and exon count data sets, recount2, are the later version of the recount project. This online resource consists of RNA-seq gene and exon counts along with their corresponding coverage bigWig files for 2041 different studies. All unannotated data sets used in our experiments were downloaded from (https://jhubiostatistics.shinyapps.io/recount).
Summary of the Data Sets Used in the Experiments
PBMC, peripheral blood mononuclear cell.
One of these data sets, GSE67835, as also annotated and originated from profiling of 466 single cells isolated from the adult and fetal human brain tissues. The mixture contained populations of oligodendrocyte progenitors (OPC), oligodendrocytes, astrocytes, microglia, neurons, endothelial cells, replicating neuronal progenitors, and quiescent newly born neurons.
The PBMC data set prepared by Zheng et al. (2017) was downloaded from (https://support.10xgenomics.com/single-cell-gene-expression/data sets. The data are sequenced on Illumina NextSeq 500 high output with 20,000 reads per cell.
6. Results and Discussion
LSPCA was compared with (1) a standard PCA on whole data set and (2) PCA on a subset of data obtained through random sampling (RSPCA). The random sampling PCA uses random sampling instead of LSH-based sampling. The number of random samples was kept equal to the number of samples obtained from LSH-based sampling. We included this method to contrast the effectiveness of LSH-based sampling against random sampling. The PCA module from the python sklearn implementation was used for the standard PCA with svd_solver = “full.”
Every raw data set was processed as per the steps as outlined in Figure 1. The preprocessed data sets were then subjected to the three feature extraction variants: PCA, LSPCA, and RSPCA. For the clustering experiments, k-means was applied to the top 50 principal components obtained from the respective PCA variant. The best k for each of the unannotated data sets was determined using two strategies, namely Bayesian Information Criterion (Bhat and Kumar, 2010), and silhouette scores (Rousseeuw, 1987). The k corresponding to the highest score was chosen. Both the strategies reported the same value of k for each data set. In contrast, the k for the annotated data sets was set according to their respective known number of cell types.
6.1. Clustering accuracy
The clustering performance by respective PCA methods was reported using the silhouette score for the unannotated data sets and the ARI for the annotated data sets. In addition, the clustering accuracy of the minor cell type populations of the annotated data sets was examined exclusively.
6.1.1. Silhouette score
The silhouette score (or index) is an internal validity measure based on similarity of points within a cluster and separation across other clusters (Rousseeuw, 1987). It is used for validation of consistency within clusters of data when data labels are not present. A high average silhouette score indicates that the objects within a cluster are tightly placed and the individual clusters are well separated. Figure 3 shows the silhouette score for nine single-cell recount data sets clustered on the top 50 components obtained from the respective PCA variants. The comparative performance shows that the proposed LSPCA method gives a better lower dimensional representation of original data sets than the alternative variants of PCA.

Silhouette index of the unannotated data sets. The values adjacent to the bubble indicate the number of clusters for which the maximum silhouette score could be attained. The x-axis indicates the silhouette score.
6.1.2. Adjusted Rand index
Adjusted Rand Index (ARI) metric is a suitable choice to validate obtained clusters when the data is annotated. The PBMC and the Human Brain Cell data sets were used to demonstrate the concordance of the obtained clusters with the biological annotations. While all the ∼68K PBMC cells were annotated, it must be noted that in the Human Brain Cell data set (GSE67835), the annotations of neurons and OPCs cell types were found to be ambiguous, Darmanis et al. (2015), and were therefore removed from our cluster validation analysis. With this restriction, 268 out of 466 human brain cells were retained featuring astrocytes, endothelial, fetal-quiescent, fetal-replicating, microglia and oligodendrocyte cells.
The value of ARI is close to 0 when the clustering prediction is random, whereas the ARI is 1 when the clustering is perfectly identical to the true labels (Yeung and Ruzzo, 2001). Figure 4 shows the clustering performance on the PBMC and Human Brain Cell data sets for the different PCA methods. It is observed that the ARI scores obtained from PCA with LSH-based sampling are higher than traditional PCA or PCA with random sampling.

Comparing the k-means clustering accuracy evaluated using ARI of three PCA variants on
6.1.3. Discovery of small clusters
The effectiveness of sampling is more pronounced when the performance of clustering is measured only for the minority groups. For this measurement, we selected only those transcriptomes that shared the annotated groups of size ≤5% in the entire data set. Figure 5A shows the ARI computed for the transcriptomes appearing in minor groups of the ∼68K PBMC data set for all the methods. For all the categories, more accurate predictions were made when LSPCA was used.

ARI to compare discordance of the predicted clusters discovered by the respective PCA variants with reference to the known annotations.
For the Human Brain data set (GSE67835), minor cell types were simulated by gradually reducing the number of cells for one selected cell type. To this end, three different data subsets were derived by varying the concentration of oligodendrocytes at 5% (5 out of 244 cells), 3% (8 out of 239 cells) and 1% (3 out of 234 cells) respectively. As expected, in all the cases LSPCA maximized the ARI scores (Figure 5B).
6.2. Correlation with conventional principal components
This experiment measures the linear association between the first five principal components of LSPCA with the respective components obtained using the traditional PCA. The linear association was measured using Pearson's correlation coefficient. All the data sets were preprocessed before any PCA as shown in Figure 1. The RSPCA was also used to compare the effect of sampling. The correlation between the components of both the methods versus the respective first five components of standard PCA was recorded as shown in Figure 6. It is observed that the top LSPCA components correlate more with the traditional PCA than RSPCA for all the data sets.

The box-plots show the distribution of Pearson's correlation coefficients between the respective components of a traditional PCA and LSPCA (dark) or RSPCA (light) across all the data sets. RSPCA, random sampling PCA.
6.3. Execution time
PCA took ∼1 hour on the PBMC data set compared with only ∼11 minutes by LSPCA. All experiments were carried out on a PC having an Intel Core i7-3770 3.40 GHz processor and 32 GB of RAM.
7. Conclusion
Overall, LSPCA is applicable wherever PCA is used, more specifically useful for large data sets like single-cell expression matrix. It is fast and produces components almost identical to the vanilla PCA components. The method provides flexibility to a user to adjust the number of samples to train the PCA with. It must be noted that LSPCA is not a new dimension reduction method, but when the data set is large, it assists PCA to work on a smaller subset whose members adequately represent the whole data set. Compared with random sampling, structure-aware sampling is a more effective way to sample from a large data set. LSPCA performs dimension reduction by operating on a subset of less redundant samples without significantly altering the performance of the classic PCA.
Software Availability
Footnotes
Author Disclosure Statement
The authors declare that no competing financial interests exist.
