Penalized Supervised Star Plots: Example Application in Influenza-Specific CD4+ T Cells

Abstract

An immune cell's phenotype expresses through its high-dimensional marker signature. Cluster analyses of data from high-throughput mass and flow cytometry marker panels permit discovery of previously undescribed immune cell phenotypes. Impactful reporting of new phenotypes demands low-dimensional visualization tools that preserve with integrity phenotypes' original high-dimensional structure. For this purpose, we introduce penalized supervised star plots. As designed and as we demonstrate, penalized supervised star plots are two-dimensional projections that tend to preserve separation of clusters as well as information on the relative contributions of various markers in differentiating phenotypes. The new method is robust to markers that do not differentiate phenotypes at all, as shown in a challenge data set. Results include comparison with other popular procedures. Penalized supervised star plots incorporate cross-validation to permit portability of estimated optimal projections to new samples. Supervised star plots are further illustrated with a featured influenza-specific T cell data set as well as a peripheral blood mononuclear cell phenotyping data set.

Introduction

Mass and flow cytometry data are becoming increasingly high dimensional, with quantities of markers ranging from 10 to 40 or more. Because an immune cell's phenotype is quantifiable through its cytometry marker signature, cluster analyses [sensu (10)] of high-dimensional, high-throughput mass or flow cytometry marker panels permit discovery of previously undescribed immune cell phenotypes. When coupled with viral antigen stimulation experiments, discovery of new virus-specific immune cell phenotypes becomes possible.

While any such newly discovered phenotypes can be described in words or in row column tabular format, impactful reporting of new phenotypes demands visualization tools that immediately, clearly, and intuitively convey salient marker signatures. A particularly intuitive approach treats each distinct phenotype as a cohesive “cloud” of cells occupying a distinct local neighborhood within the original high-dimensional space, in which each constituent dimension is a separate marker that helps to define the phenotype. The challenge is to display this high-dimensional cloud in low-dimensional space (2D or 3D) and yet retain this intuitive visual summary of marker expression. Various information-preserving low-dimensional projections have been developed, including principal component analysis (PCA) (24), self-organizing maps (17), t stochastic neighbor embedding (tSNE) (20), radial visualization (Radviz) (12), discriminant analysis of principal components (DAPC) (14), and star coordinates (15). PCA, self-organizing maps, Van Long and Linsen's star plot method (33), and tSNE are all unsupervised techniques that attempt to portray high-dimensional clusters with fidelity into two or three dimensions. In stark contrast, the approach presented here is fully supervised. Penalized supervised star plots (PSSs) capitalize on existing estimates of high-dimensional cluster structure to optimize their display and marker information in two dimensions. Supervision has the important advantage that it allows the user complete freedom for estimating an optimal cluster solution (or multiple solutions by different clustering methods if comparison is of interest) in the original high-dimensional marker space. Then, having identified that solution, the user can portray (project) that solution with fidelity into two dimensions using PSSs. The PSS projection also retains information, via a penalty term, on how different markers contribute to separation of the high-dimensional clusters, scaled by within-group dispersion (23). This designed property of preservation of original high-dimensional information within two dimensions is crucial to accurate biological interpretation. The proposed method includes an additional penalty term to diminish markers that do not contribute to cluster structure, as demonstrated in this study.

Clustering is an unsupervised procedure that seeks to estimate the true quantity of clusters and their composition, as defined by the input markers. If clusters are provided as “known” (i.e., input), classification procedures (such as linear discriminant analysis [LDA]) can be applied to estimate those boundaries that optimally separate the clusters in the space of the input markers. As such, the goals of clustering and classification procedures extend far beyond two-dimensional (2D) visualization. For example, LDA is a general classification procedure designed to estimate boundaries that separate clusters (classes) in two (boundaries are lines) or more dimensions (boundaries are hyperplanes) (9). PSS is not a clustering procedure but rather a specialized 2D visual classifier of clustering results. PSS has two specialized objectives: (1) strictly 2D maximal visual separation of high-dimensional preexisting clusters (2) overlain with 2D marker-specific vectors of lengths and directions that facilitate biological interpretation of the preexisting high-dimensional cluster solution. One can constrain other general utility classifiers to “force” a 2D classification (as shown in the Results section and Supplementary Data); but PSS goes further and (via two penalty terms) preserves and displays biologically interpretable information in two dimensions for each of the preexisting clusters and their markers. In this study, we compare PSS to the popular method of DAPC (14).

Materials and Methods

Proposed star plot algorithm

Let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{R}}$$ \end{document} be the \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$n \times p$$ \end{document} matrix of readouts on n observations and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$i = 1 , \ldots , p$$ \end{document} markers; and let \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Q}}$$ \end{document} be a \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$p \times 2$$ \end{document} projection matrix into the 2D real space \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \mathbb{R}^2}$$ \end{document} . Define a \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$n \times 2$$ \end{document} supervision matrix \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Y}}$$ \end{document} , such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Y}} = { \bf{RQ}} + { \bf{E}}$$ \end{document} , where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{E}}$$ \end{document} is an \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$n \times 2$$ \end{document} residual matrix that contains the signed departures of the projection \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{RQ}}$$ \end{document} from supervisor \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Y}}$$ \end{document} . The ith row of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Q}}$$ \end{document} represents the contribution of the ith marker to the projection toward the supervisor \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Y}}$$ \end{document} . Suppose further that the n observations have been previously exhaustively and mutually exclusively partitioned within \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \mathbb{R}^p}$$ \end{document} into \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\kappa$$ \end{document} clusters, indexed by \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$c \in \left\{ {1 , \;2 , \; \ldots , \kappa } \right\} $$ \end{document} , 3 ≤ κ, where \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$n= \mathop \sum \limits_c {n_c}$$ \end{document} , so that, on expanding, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \left[ { \begin{matrix} {{{ \bf{Y}}_1}} \\ \vdots \\ {{{ \bf{Y}}_ \kappa }} \\ \end{matrix} } \right] = \left[ { \begin{matrix} {{{ \bf{R}}_1}} \\ \vdots \\ {{{ \bf{R}}_ \kappa }} \\ \end{matrix} } \right] { \bf{Q}} + \left[ { \begin{matrix} {{{ \bf{E}}_1}} \\ \vdots \\ {{{ \bf{E}}_ \kappa }} \\ \end{matrix} } \right]. \end{align*} \end{document}

The rows in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Q}}$$ \end{document} are ordered top to bottom and columns in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{R}}$$ \end{document} are ordered left to right by marker in nonincreasing sequence from highest to lowest “importance” values \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${r_1} \ge {r_2} \ge \ldots \ge {r_p}$$ \end{document} , where r_i , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$i \in \left\{ {1 , \ldots , p} \right\} $$ \end{document} , is the ratio of among-cluster to within-cluster sample variances for the ith marker, after centering and scaling. By design, the rows of each \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${{ \bf{Y}}_c}$$ \end{document} are n_c identical repeats of the same supervision direction vector \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${{ \bf{y}}_c}$$ \end{document} . To maximize separation of clusters in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \mathbb{R}^2}$$ \end{document} , supervisor \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Y}}$$ \end{document} is structured such that \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${{ \bf{y}}_{c + 1}} = {{ \bf{y}}_c} \left[ { \begin{matrix} { \cos \left( {2 \pi / \kappa } \right) } & { - \sin \left( {2 \pi / \kappa } \right) } \\ { \sin \left( {2 \pi / \kappa } \right) } & { \cos \left( {2 \pi / \kappa } \right) } \\ \end{matrix} } \right]$$ \end{document} with convention \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${{ \bf{y}}_1} = \left[ { \begin{matrix} 0 & 1 \\ \end{matrix} } \right]$$ \end{document} ; and, without loss of generality, \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{y}}_c^ \prime {{ \bf{y}}_c} = 1 \; \forall \;c \in \left\{ {1 , \;2 , \; \ldots , \kappa } \right\} $$ \end{document} (15). To avoid complications of differential scaling, we column center using sample means and scale using sample standard deviations per \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{S}}_Y^{ - 1} \left( {{ \bf{Y}} - \overline { \bf{Y}} } \right) = { \bf{S}}_R^{ - 1} \left( {{ \bf{R}} - \overline { \bf{R}} } \right) { \bf{Q}} + { \bf{E}}$$ \end{document} , where all off-diagonal elements of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{S}}_Y^{ - 1}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{S}}_R^{ - 1}$$ \end{document} are zero.

Estimation proceeds via \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document} \begin{align*} \begin{split}\mathop {{ \rm{argmin}}} \limits_{ \left\{ { \bf{Q}} \right\} } \left\{ {{{{{ \left. \parallel { \bf{E}} \right. \parallel }_{ \rm{F}}}} \over {{n_{ \rm{t}}}}} + { \lambda _1}{{{ \rm{tr}} \left( {{ \bf{Q \prime Q}}} \right) } \over p} + { {{ \lambda _2}} \over {p - 1}} \mathop \sum \limits_{i = 2}^p \mathop \sum \limits_{j = 1}^2 {{ \left( {{q_{i, j}} - { q_{i - 1 , j}}} \right) }^2}} \right\} \mathop \to \limits^{{ \rm{CV}}} \mathop {\hat{\bf{Q}} ,}\end{split} \tag{1} \end{align*} \end{document}

for selected real scalars \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _1}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _2}$$ \end{document} ; and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \left. \parallel { \bf{E}} \right. \parallel _{ \rm{F}}}$$ \end{document} denotes the Frobenius norm of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{E}}$$ \end{document} . Term \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _1}{ \rm{tr}} \left( {{ \bf{Q \prime Q}}} \right) / p$$ \end{document} facilitates strongest shrinkage toward the origin of those marker rows in \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Q}}$$ \end{document} that contribute least to prediction of supervisor \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Y}}$$ \end{document} . The second “fused” [sensu (32)] penalty term is designed to sustain ordering of importance values \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${r_1} \ge {r_2} \ge \ldots \ge {r_p}$$ \end{document} from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \mathbb{R}^p}$$ \end{document} when mapped into the star plot structure within \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \mathbb{R}^2}$$ \end{document} , in terms of the respective lengths of marker's coefficient vectors. To help make optimal estimate \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathop {\hat{\bf{Q}}}$$ \end{document} portable to new samples, the procedure repeatedly (16) (i.e., for each new grid pair \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _1}$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _2}$$ \end{document} ) pseudorandomly allocates 60% of the total sample of size \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$n = {n_{ \rm{t}}} + {n_{ \rm{v}}}$$ \end{document} observations to training ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${n_{ \rm{t}}}$$ \end{document} observations) and 40% to validation ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${n_{ \rm{v}}}$$ \end{document} observations), thereby performing twofold cross-validation (CV), and selects those \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _1} = \lambda _1^*$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _2} = \lambda _2^*$$ \end{document} that minimize estimated \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$ { \frac { { { \left. \parallel { \bf { E } } \right. \parallel } _ { \rm { F } } } } { { n_ { \rm { v } } } } } $$ \end{document} in the validation sample.

CyTOF experiments

For the peripheral blood mononuclear cell (PBMC) phenotyping data set, a cryopreserved healthy donor PBMC sample was thawed and stained with a 38 antibody panel, as described previously (19). CyTOF^® data were collected on a Helios™ instrument from Fluidigm (South San Francisco, CA). Manual gating was used to create five clusters: CD4+ T cells, CD8+ T cells, B cells, NK cells, and monocytes using FlowJo version 10.2 from TreeStar, Inc. (Ashland, OR).

For the influenza-specific T cell data set, PBMC from 24 healthy participants, who received an influenza vaccine 7 days prior, were stimulated for 8 h with overlapping peptides corresponding to influenza HA and M1 proteins (JPT, Berlin, Germany), along with 5 μg/mL each of brefeldin A and monensin (Sigma, St. Louis, MO). Cells were then prepared and stained for CyTOF mass cytometry as described (19). Approximately 10⁶ total cells were collected per participant; and those CD4+ T cells at 7 days expressing any cytokine(s) under HA+M1 stimulation were isolated as influenza-specific cells for cluster analysis and subsequent PSS visualization. A phenotyping solution was obtained via a ragged pruning (5) of an agglomerative hierarchical clustering applied to the denoised data (25), with cluster quality assessed with the silhouette index (27).

PSS construction was performed in R Package via base function optim, using method = “BFGS,” with additional utilities from matrixcalc (22), plotrix (18), JPEN (21), and VCA (28). Example R script is available online at the Stanford Medicine ME/CFS Initiative web page (30).

PCA was performed in R. viSNE, a graphical user interface tool based on tSNE, available on Cytobank was used on both CyTOF data sets. Radviz and the popular, supervised method of DAPC were performed in R using the software packages publicly available on the Comprehensive R Archive Network (CRAN) (1,13).

Results

Application of PSSs to three independent data sets

We first applied PSSs to a “challenge” data set. Although not immunological, Fisher's iris phenotype data set was selected for its public availability in base R. These data consist of 50 observations each on four phenotypic traits (sepal and petal length and width) in three species. To examine the designed robustness of our method, we augmented this iris data set with 50 additional columns of variables correlated with each other but not with phenotype via pseudorandom draws from a multivariate standard normal distribution with correlation structure created using R package clusterGeneration (26). As detailed in Figure 1, the PSS method shrunk all noise “marker” projection coefficients (components of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\bf \hat{Q}$$ \end{document} ) strongly toward zero and properly retained and ordered (in terms of vector length) three of four phenotypic traits (petal length and width, sepal length) that distinguish these phenotypes (Fig. 1). This finding indicates that, as designed, PSSs are robust to noisy markers.

FIG. 1.

Application of PSS to Fisher's iris phenotype data set. Result shows separation of phenotypes in 2D space. Vectors are the projection coefficients (from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathop { \hat{\bf{Q}}}$$ \end{document} ) for each individual marker. Larger coefficients contribute more to separation of clusters within the star plot. Clearly, all noise markers (labeled X1 through X50) have coefficients that have been strongly shrunken toward the origin (0, 0), as designed. Importance values, as quantified by the ratio of among-cluster to within-cluster variance r in the original four dimensions, are 23.6 for petal length, 19.2 for petal width, 2.4 for sepal length, and 0.96 for sepal width. The PSS correctly identifies importance ordering for three of these four features (petal length and width, sepal length) but not sepal width. R script for star plot construction from this data set of 150 observations, 54 markers, and repeated CV over a grid of 441 combinations of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _1}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _2}$$ \end{document} ran in ∼6.3 min on a 3.5 GHz processor. 2D, two dimensional; CV, cross-validation; PSS, penalized supervised star plots.

We then applied the PSS method to a simple immunological data set consisting of PBMC phenotyping data (Fig. 2). The complete staining panel used can be found in the Supplementary Data (Supplementary Table S1). To cluster this data set, we first manually gated PBMC CyTOF data on major lineage markers and identified five major clusters: CD4+ T cells (CD14− CD33− CD3+ CD4+ CD8−, Cluster 1), CD8+ T cells (CD14− CD33− CD3+ CD4− CD8+, Cluster 2), B cells (CD14− CD33− CD3− CD19+ CD20+, Cluster 3), NK cells (CD14− CD33− CD3− CD16+ CD56+, Cluster 4), and monocytes (CD14+ CD33+, Cluster 5). The PSS from this data set showed clear separation of the five clusters. Also, we saw that the vectors represented important lineage markers, which were, in fact, the markers used to define cluster identity.

FIG. 2.

Application of PSS to peripheral blood mononuclear cell (PBMC) phenotyping data set. Using PSS, we visualized a PBMC data set gated on five basic cell lineages. Counterclockwise from right: CD4+ T cells (Cluster 1, navy blue), CD8+ T cells (Cluster 2, blue), B cells (Cluster 3, green), NK cells (Cluster 4, orange), and monocytes (Cluster 5, brown). The indicated vectors clearly show lineage markers such as CD4, CD8, CD3, CD19, CD56, and CD33, which are major drivers of cluster separation. R script for star plot construction from this data set of 8,234 observations, 38 markers, and repeated CV over a grid of 441 combinations of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _1}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _2}$$ \end{document} ran in ∼180 min on a 2.8 GHz processor.

Finally, we applied PSS to our most challenging immunological data set, consisting of a relatively small number of influenza-specific CD4+ T cells identified by CyTOF, where all cells shared major lineage markers (CD3, CD4, etc.), but with potentially subtle differences in a variety of other markers, some of them dimly expressed. Clustering was performed via ragged pruning of an agglomerative hierarchical clustering, as described in the Materials and Methods section. Of the nine total clusters obtained, three were found to be influenza specific based on a significantly higher abundance under influenza peptide stimulation compared with no stimulation. Figure 3 shows the PSS for these three clusters. From the PSS, it is immediately apparent that MIP1b is the most discriminatory marker, which agrees with its ranking as the most important variable, as quantified by the ratio r of the among-phenotype to within-phenotype variances, in the original 32-dimensional marker space (Table 1). As per the method's design, other variables of high importance values r also have larger coefficient vectors within the star plot (e.g., CD3, CD8, CD69, CD4, CD85j, and CD154). In contrast, tumor necrosis factor (TNF) has the second largest coefficient vector within the star plot and is ranked 10th in importance in separating clusters, indicating that the fused penalty term facilitates but does not strictly enforce importance ordering.

FIG. 3.

Application of PSS to influenza-specific T cell data set. The original clustering was performed in 32D marker space. The PSS preserves the separation of the three phenotypes in two dimensions (Cluster 1, navy blue; Cluster 2, cyan; Cluster 3, brown). Vectors are the projection coefficients (from \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathop { \hat{\bf{Q}}}$$ \end{document} ) for each individual marker. Larger coefficients contribute more to separation of clusters within the star plot and also tend to have greater importance, as quantified by the ratio of among-cluster to within-cluster variance r, in separating clusters in the original 32D space. See Table 1. R script for star plot construction from this data set of 6,273 observations, 32 markers, and repeated CVs over a grid of 361 combinations of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _1}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _2}$$ \end{document} ran in ∼31 min on a 3.5 GHz processor. 32D, 32 dimensional.

Table 1.

The 32 Markers Used to Identify Phenotypes Among Influenza-Specific CD4+ T Cells

Marker	r
MIP1b	1.80
CD8	0.58
CD3	0.50
CD69	0.46
CD4	0.45
CD85j	0.39
CD154	0.37
CD57	0.33
CD107a	0.30
TNF	0.28
IL2	0.24
IL17	0.21
CD27	0.19
CD127	0.17
Granzyme	0.17
IFNg	0.16
HLADR	0.15
CCR7	0.14
CD45RA	0.10
CD16	0.10
GMCSF	0.08
CCR6	0.06
Perforin	0.06
CXCR5	0.06
CD25	0.05
PD1	0.05
ICOS	0.05
IL21	0.04
CD56	0.03
CD38	0.01
Ki67	0.01
CXCR3	0.01

Markers are ordered from highest to lowest importance value r, which is the ratio of the among-phenotype to within-phenotype variances in the original 32-dimensional marker space.

The directions of coefficient vectors are also meaningful. For example, in agreement with observed relative expression levels (Supplementary Table S2), the PSS (Fig. 3) shows that MIP1b is expressed least in cluster 2 and most strongly in cluster 3, CD69 is expressed least in cluster 2 and most strongly in cluster 1, and so on. Using PSS, we were able to visualize a number of key features that define each of the three influenza-specific T cell clusters. These prominent features are discerned by the length and direction of the star plot vectors and are confirmed by the mean expression levels in Supplementary Table S2. Cluster 1 showed some CCR7 expression and low CD45RA, akin to central memory T cells. It also expresses high levels of TNF-α, CD69, CD154, and IL-2. Cluster 2 displayed a more effector memory phenotype as seen by its low level of CCR7 and CD45RA, with high levels of IL-2, CD154, and PD-1. Finally, Cluster 3 had more of a CD4+ TEMRA phenotype (effector memory re-expressing CD45RA) and expressed high MIP1b and CD57, perforin, granzyme B, as well as coexpression of CD8.

Comparison of PSS with PCA, viSNE, Radviz, and DAPC

PCA of Fisher's iris phenotype data set showed that it was less successful at spatially separating phenotypes into compact clusters and distinguishing true phenotypic traits from noise (e.g., vector length for noise variable X21 vs. vector lengths for sepal length and petal width); and PCA appears to have some difficulty with ordering true traits properly given that petal width \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$r = 19.2$$ \end{document} > sepal length \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$r = 2.4$$ \end{document} (Supplementary Fig. S1). This is in contrast to PSSs, which are robust, even when the data set contains noisy markers (Fig. 1). We also ran PCA on the PBMC phenotyping and the influenza-specific T cell data sets (Supplementary Fig. S2). There was some separation, but with a high degree of overlap between the clusters in both cases. PCA was less successful in clearly separating clusters, even in a PBMC phenotyping data set containing lineage markers with clear differential expression.

We also ran viSNE (3) on the influenza-specific T cell and PMBC phenotyping data sets to compare results with PSS (Supplementary Fig. S3). With viSNE, we saw a clear separation of the five clusters in the PBMC phenotyping data set. In contrast, in the influenza-specific T cells, there was moderate separation with some overlap and fragmentation of phenotypes.

We ran Radviz, which is a tool that places anchor points along a circle (11,12). Using our PBMC phenotyping data set, we found that there was some separation of clusters, but it was difficult to visually determine which parameters are the main drivers of cluster separation (Supplementary Fig. S4A). The influenza-specific T cell data set showed almost no separation of clusters (Supplementary Fig. S4B).

Finally, we also ran DAPC (14), a popular supervised (discriminant function) method, currently being cited by more than 1,200 documents on Scopus. Despite being supervised, DAPC produced poor visual separation of clusters in the PBMC phenotyping as well as influenza-specific T cell data sets (Supplementary Fig. S5).

Overall, our comparative analyses showed that in these cluster data sets, PSS provides a more clear and intuitive visualization, even when the differences between clusters are subtle.

Discussion

Subpopulations of immune cells within a given lineage, such as CD4+ T cells, may express many combinations of markers, which are now readily tracked with high-dimensional flow or mass cytometry. Automated clustering algorithms can define the major subpopulations within these lineages, but visualizing the strength of that high-dimensional clustering solution and the major markers responsible for differentiating the clusters is still challenging. An example where this is relevant is CD4+ T cells responding to viral antigens, such as influenza, as shown in this study.

Interestingly, in the influenza-specific T cell data, we found that markers such as CD3 and CD4 were important in separating the three clusters. The cells used in this experiment were stimulated with influenza peptides to identify virus-specific clusters. Markers such as CD3 and CD4 are known to be downmodulated on activation, which is most likely why their expression levels vary among clusters. Furthermore, low levels of CD8 are known to be expressed on some subsets of CD4+ T cells responsive to viruses (31). This indicates a high degree of complexity in flu-responsive clusters, and PSS was able to visually highlight subtle differences in markers such as CD3, CD4, and CD8.

We also compared PSS to other conventional visualization methods (PCA, viSNE, Radviz, and DAPC). While viSNE was adequate at separating clusters for the simpler immunological data set (PBMC phenotyping), it was less effective than PSS on the more complex influenza-specific T cell data set. Furthermore, it did not provide clear identification of prominent separating features, whereas the vectors in PSS more accurately represented prominent separating features. PCA, a linear dimensionality reduction method, performed most poorly on all three data sets tested. Similarly, Radviz and DAPC also exhibited very limited utility in visualizing this form of cluster data (Supplementary Figs. S4 and S5), especially Radviz. Radviz can be extended to be supervised (29), which could improve separation of clusters; but we did not examine this extension because we did not find this option in R package Radviz (2). In the two sample data sets tested, we found that cluster separation was poor. Furthermore, the radial marker axes failed to provide useful insight into cluster differences. DAPC also did not clearly separate clusters in both data sets (Supplementary Fig. S5). DAPC is a more general classification tool and not purely for 2D visualization, which may partly explain its limited visualization performance even in well-separated clusters (e.g., PBMC phenotyping data set, Supplementary Fig. S5A). Taken together, these findings highlight the need for a new supervised method such as PSS.

The strong performance of PSSs across contrasting data sets suggests that they may have wide utility for visualization of high-dimensional cell phenotyping data, especially mass cytometry data. For influenza-specific T cells, this could aid in understanding the basis of vaccine-induced protection (or lack thereof), as well as understanding T cell differentiation in response to different viral antigens.

PSSs are designed to be information-preserving. Specifically, the purpose of PSSs is to preserve with integrity as much information as possible in the projection from p-dimensional to 2D space. The two squared loss (L ₂) penalty terms of our objective function E1 are attractive for this purpose because, unlike L ₁ penalties, they shrink but do not completely remove (shrink exactly to zero) any of the coefficient vectors. This may be especially important given some recent concerns raised about L ₁ penalties (8).

The PSS algorithm does entail a grid search over possible combinations of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _1}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _2}$$ \end{document} . For construction of Figure 3, a grid search over 361 combinations of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _1}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _2}$$ \end{document} for this data set of 6,273 observations, 32 markers, and repeated CV (one 60%/40% partition per grid point) ran in ∼31 min on a 3.5 GHz processor. Nearly all of the run time is due to CVs. Repeated CV does have the important benefit of making the estimate of the optimized projection matrix Q more reproducible and portable for constructing star plots from other data sets on the same marker set drawn from the same population of individuals. Reproducibility has always been important in the sciences, including viral immunology. Recently, concerns have been growing about possible neglect of reproducibility (4). The estimated Q matrix of projection coefficients could be published with original findings so that other investigators could perform the projection \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathop { \hat{\bf{Q}} {{ \bf{R}}^*}}$$ \end{document} using their marker data R* and thereby assess the reproducibility of the original findings, even if new data R* are of different sample sizes (n) than the original data R. Reproducibility could be strengthened further if PSS is performed on the results of carefully planned consensus clusterings [e.g., Ref. (6)].

PSSs are straightforward to apply. After proper row construction of supervisor \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Y}}$$ \end{document} and projection matrix \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Q}}$$ \end{document} plus centering and scaling of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{Y}}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{R}}$$ \end{document} , solution of Equation 1 across a grid of penalty parameters ( \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _1}$$ \end{document} , \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _2}$$ \end{document} ) completes the process. For plotting, resultant estimate \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$\mathop { \hat{\bf{Q}}}$$ \end{document} contains the coefficient vectors, and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \bf{R}} \mathop { \hat{\bf{Q}}}$$ \end{document} gives the data projection into two dimensions. Future improvements to the current implementation could include allowance for CV of other folds (e.g., 5- or 10-fold) and application of thin-plate smoothing splines (7) to reduce variance and permit interpolation for estimation of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$$ { \frac { { { \left. \parallel { \bf { E } } \right. \parallel } _ { \rm { F } } } } { { n_ { \rm { v } } } } } $$ \end{document} as a function of \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _1}$$ \end{document} and \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${ \lambda _2}$$ \end{document} following grid search. A trade study would be required to assess the possibility (and overall desirability) of revising Equation 1 to enforce strictly the inequality constraint \documentclass{aastex}\usepackage{amsbsy}\usepackage{amsfonts}\usepackage{amssymb}\usepackage{bm}\usepackage{mathrsfs}\usepackage{pifont}\usepackage{stmaryrd}\usepackage{textcomp}\usepackage{portland, xspace}\usepackage{amsmath, amsxtra}\usepackage{upgreek}\pagestyle{empty}\DeclareMathSizes{10}{9}{7}{6}\begin{document}$${r_1} \ge {r_2} \ge \ldots \ge {r_p}$$ \end{document} .

Footnotes

Acknowledgments

This work was funded by NIH/NIAID grant 5U19AI057229 to Dr. Mark Davis, Director of the Institute of Immunity, Transplantation and Infection, Stanford University School of Medicine. We are grateful to Dr. Davis for his encouragement and general support. We also thank Dr. Dongxia Lin for CyTOF analysis, and Dr. Cornelia Dekker, Rohit Gupta, and Janine Sung for sample collection, processing, and banking. We also thank the three anonymous reviewers, whose considered and constructive comments greatly improved the method and article.

Authors' Contributions

T.H.H. and H.T.M. conceived the study. T.H.H. designed the method, wrote R code, conducted analyses, and wrote the article. H.T.M. critically reviewed the article. P.B.S. processed samples, generated CyTOF data, gated, and analyzed them, performed viSNE analysis, and helped write the article. W.W. performed PSS, PCA, RadViz, and DAPC analyses. W.W. also contributed extensions to the original R script for PSS. All authors have read and approved the final version of the article.

Ethical Approval

Samples were collected from human participants under informed consent, through the Institutional Review Board (IRB)-approved protocol at Stanford University (IRB-16390).

Author Disclosure Statement

No competing financial interests exist.

Supplementary Material

Supplementary Figure S1

Supplementary Figure S2

Supplementary Figure S3

Supplementary Figure S4

Supplementary Figure S5

Supplementary Table S1

Supplementary Table S2

References

Abraham

. Radviz: A R Package for Multi-Dimensional Data Visualization. The Comprehensive R Archive Network (CRAN), 2016. Available at https://cran.cnr.berkeley.edu/ (Last accessed January 23, 2019 ).

Abraham

. 2016. Visualizing multivariate data with Radviz. Retrieved from: https://cran.r-project.org/web/packages/Radviz/vignettes/single_cell_projections.html (Accessed on December 5, 2018 ).

Amir

, Davis

, Tadmor

, et al. viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia. Nat Biotechnol, 2013; 31:545–552.

Baker

. 1,500 scientists lift the lid on reproducibility. Nature, 2016; 533:452–454.

Bruggner

, Bodenmiller

, Dill

, et al. Automated identification of stratifying signatures in cellular subpopulations. Proc Natl Acad Sci USA, 2014; 111:E2770-7.

Chiu

, Talhouk

, and Liu

. diceR: Diverse cluster ensemble in R. Comprehensive R Archive Network (CRAN), 2018. (Accessed on July 11, 2017 ).

Duchon

Splines minimizing rotation-invariant semi-norms in Sobolev spaces. In: Schempp

, Zeller

, eds. Constructive Theory of Functions of Several Variables. Berlin, Heidelberg: Springer-Verlag (Dold A, Eckmann B, editors. Lecture Notes in Mathematics), 1977.

Hansen

. The risk of James–Stein and lasso shrinkage. Econom Rev, 2016; 35:1456–1470.

Hastie

, Tibshirani

, and Friedman

. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer, 2009:84–95.

10.

Hastie

, Tibshirani

, and Friedman

. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd ed. New York: Springer, 2009:501–528.

11.

Hoffman

, Grinstein

, Marx

, et al. DNA visual and analytic data mining. In: Proceedings Visualization'97 (Cat No 97CB36155). Phoenix, AZ: IEEE, 1997:437–441.

12.

Hoffman

, Grinstein

, and Pinkney

. Dimensional anchors: a graphic primitive for multidimensional multivariate information visualizations. In: Proceedings of the 1999 Workshop on New Paradigms in Information Visualization and Manipulation in Conjunction with the Eighth ACM International Conference on Information and Knowledge Management—NPIVM'99. New York, NY: ACM Press, 1999:9–16.

13.

Jombart

. adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics, 2008; 24:1403–1405.

14.

Jombart

, Devillard

, and Balloux

. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations. BMC Genet, 2010; 11:94.

15.

Kandogan

. Visualizing multi-dimensional clusters, trends, and outliers using star coordinates. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining—KDD'01. New York, NY: ACM Press, 2001:107–116.

16.

Kim

J-H

. Estimating classification error rate: repeated cross-validation, repeated hold-out and bootstrap. Comput Stat Data Anal, 2009; 53:3735–3745.

17.

Kohonen

. The self-organizing map. Proc IEEE, 1990; 78:1464–1480.

18.

Lemon

. plotrix: a package in the red light district of R. R-News, 2006; 6:8–12.

19.

Lin

, Gupta

, and Maecker

. Intracellular cytokine staining on PBMCs using CyTOF™ mass cytometry. Bio Protoc, 2015; 5:e1370.

20.

van der Maaten

, and Hinton

. Visualizing data using t-SNE. J Machine Learn Res, 2009; 9:2579–2605.

21.

Maurya

. JPEN: Covariance and Inverse Covariance Matrix Estimation Using Joint Penalty. Comprehensive R Archive Network (CRAN), 2015. Available at https://cran.cnr.berkeley.edu/ (Last accessed January 23, 2019 ).

22.

Novomestky

. matrixcalc: Collection of Functions for Matrix Calculations. Comprehensive R Archive Network (CRAN), 2012. Available at https://cran.cnr.berkeley.edu/ (Last accessed January 23, 2019 ).

23.

Orloci

. An agglomerative method for classification of plant communities. J Ecol, 1967; 55:193–206.

24.

Pearson

. L III. On lines and planes of closest fit to systems of points in space. Philos Mag Ser, 1901; 6 2:559–572.

25.

Peterson

, and Ford

. Random matrix theory and covariance matrix filtering for cancer gene expression. In: Peterson

, Masulli

, Russo

, eds. Computational Intelligence Methods for Bioinformatics and Biostatistics. Berlin, Heidelberg: Springer Berlin Heidelberg ( Hutchison

, Kanade

, Kittler

, et al., eds. Lecture Notes in Computer Science), 2013:173–184.

26.

Qiu

, and Joe

. clusterGeneration: Random Cluster Generation (with Specified Degree of Separation). Comprehensive R Archive Network (CRAN), 2015. Available at https://cran.cnr.berkeley.edu/ (Last accessed January 23, 2019 ).

27.

Rousseeuw

. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math, 1987; 20:53–65.

28.

Schuetzenmeister

, and Dufey

. VCA: Variance Component Analysis. Comprehensive R Archive Network (CRAN), 2018. Available at https://cran.cnr.berkeley.edu/ (Last accessed January 23, 2019 ).

29.

Sharko

, Grinstein

, and Marx

. Vectorized Radviz and its application to multiple cluster datasets. IEEE Trans Vis Comput Graph, 2008; 14:1444–1451.

30.

Stanford Medicine ME/CFS Initiative. R Code for a High-Dimensional Cell-Phenotype Visualization Tool. http://med.stanford.edu/chronicfatiguesyndrome/research/Rcode-Visualization-Tool.html

31.

Suni

, Ghanekar

, Houck

, et al. CD4(+)CD8(dim) T lymphocytes exhibit enhanced cytokine expression, proliferation and cytotoxic activity in response to HCMV and HIV-1 antigens. Eur J Immunol, 2001; 31:2512–2520.

32.

Tibshirani

, Saunders

, Rosset

, et al. Sparsity and smoothness via the fused lasso. J R Stat Soc Ser B (Stat Method), 2005; 67:91–108.

33.

Van Long

, and Linsen

. Visualizing high density clusters in multidimensional data using optimized star coordinates. Comput Stat, 2011; 26:655–678.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.69 MB

0.12 MB

0.19 MB

0.10 MB

0.22 MB

0.13 MB

0.02 MB

0.00 MB