Abstract
Social networks in education commonly involve some form of grouping, such as friendship cliques or teacher departments, and blockmodels are a type of statistical social network model that accommodate these grouping or blocks by assuming different within-group tie probabilities than between-group tie probabilities. We describe a class of models, covariate stochastic blockmodels (CSBMs), that incorporates covariates into blockmodels. These models not only estimate the effects of covariates in the presence of the block structure but also can determine differential covariate effects such as within blocks versus between blocks. For example, education researchers can now determine those factors that mitigate relationships both within schools and between schools. We introduce several CSBMs as examples and present a series of simulation studies to investigate both the feasibility and some operating characteristics as well as fit CSBMs to real network data.
1. Introduction
In educational settings, school professional networks offer insight into how social resources are shared and information is exchanged among teachers and other staff. For example, advice-seeking social networks can reveal which teachers are considered experts, whose advice is highly sought, which teachers are well-connected in their school’s network, and which teachers are completely isolated. Often substantive researchers are interested in the factors associated with the presence of a particular relationship, since collegial interactions are associated with changes in teacher practice (Frank, Zhao, & Borman, 2004; Moolenaar, Daly, & Sleegers, 2010). For example, Spillane, Kim, and Frank (2012) and Hopkins, Lowenhaupt, and Sweet (2015) both find that being a formal leader increases the likelihood of providing advice and that teachers tend to seek and provide advice to other teachers who teach the same grade. Thus, it is not surprising that many standard statistical social network models, such as exponential random graph models (Wasserman & Pattison, 1996) and latent space models (Hoff, Raftery, & Handcock, 2002), accommodate covariates for inference about individual- or pair-level variables.
Current social network models however do not adequately accommodate ties between teachers in different schools. Analyzing ties across schools is relevant in many contexts, since schools do not operate in isolation and may share teachers, coaches, or other personnel. Research on those ties that do exist across schools may be important for understanding the functionality of the district as a whole in addition to other factors such as which schools have more across-school ties, which teachers’ attributes are related to these ties, or the general relationship between across-school ties and within-school ties. Consider a school district with k schools. The framework introduced by Sweet, Thomas, and Junker (2013) for modeling the networks in these k schools is only useful if we assume that these k schools are isolated. While important for many research questions, this framework does not currently allow for ties across schools.
Instead, we propose to treat the district as a single network and model the k school networks as subgroups or blocks within the larger district network. Social network models that accommodate subgroup structure are commonly known as blockmodels (White, Boorman, & Breiger, 1976) or stochastic blockmodels (Holland, Laskey, & Leinhardt, 1983), and these models generally assume that within-block ties have a different probability than across-block ties. The terms a priori stochastic blockmodel and a posteriori stochastic blockmodel refers to whether block membership is known or unknown, respectively, and are common in the blockmodeling literature (Anderson & Wasserman, 1992; Nowicki & Snijders, 2001; Wasserman & Anderson, 1987).
Because blockmodels historically have not included covariates, the purpose of this article is to define a class of models, covariate stochastic blockmodels (CSBMs), explore a variety of CSBMs, and illustrate how these models can be applied in educational research. A covariate extension of the stochastic blockmodel was originally introduced by Airoldi, Choi, and Wolfe (2011) and our work will explore similar models as well as some of the operating characteristics of these models, which has not been studied. There are several other models that incorporate covariates into extensions of stochastic blockmodels (Sweet, Thomas, & Junker, 2014; Tallberg, 2004; White & Murphy, 2014), but these extensions are relatively rare and these models have not been applied in education.
In Section 2, we formally introduce several CSBMs and describe our model-fitting algorithm. We then discuss possible applications for these models as well as consider potential issues in Section 2.2. Then, to illustrate the feasibility of these models as well as to assess potential model-fitting issues and parameter recovery, we present a series of simulation studies in Section 3, followed by an application using real advice-seeking network data in Section 4. We conclude by describing future work and additional applications in Section 5.
2. Model
The CSBM framework is a class of blockmodels that accommodates covariates in an additive way, so that tie probability is influenced by both block membership and covariates. We then assume covariates influence the probability of a tie independently of the block membership, which is similar to the model proposed by Airoldi et al. (2011) and unlike other models that incorporate covariates into the block membership assignment mechanism (Tallberg, 2004; White & Murphy, 2014).
We define a CSBM using a hierarchical Bayesian model. Let Y denote the binary sociomatrix such that
CSBMs incorporate covariates into the model in an additive way through the log odds probability of a tie. Since the probability of a tie has support (0, 1), we transform the tie probability to the log odds or logit scale, so that the support is now
An example of such a CSBM is given as:
Note that when block membership is not known a priori, it must be estimated, and in Equation 1, we use a multinomial prior for each
There are many situations in education in which block membership is known. For example, given a network of schools, each school could be considered as a subgroup, and individual membership to each school is known and does not need to be part of the model. If the block memberships are known, then the model is called the a priori CSBM and is given below as:
Note that the a priori CSBM can be thought of as a reparameterization of a logistic regression model, although we caution that this model should be interpreted carefully. Instead of using a B matrix or in our case a logit (B) to describe the additive effects of block membership, we can incorporate group–group probabilities through indicators and regression coefficients. Such a model is given as:
It is important to realize that without informative priors on the entires of the B matrix, the model is not identified. Without these priors, we make no assumption about the probability of within-block ties as compared to across-block ties. Without this or an analogous assumption, block structure no longer is meaningful and block membership does not contribute to the variability in network structure. Thus, if one chooses to fit the a priori CSBM using logistic regression, it should be done in a Bayesian framework taking advantage of informative priors, or if done in a frequentist setting, it should employ some kind of constraint to accommodate the block structure of the network.
Turning our attention to the covariate part of the CSBM, there are situations in which a covariate is believed to have a differential effect depending on whether the tie occurs within a block or between blocks. For example, elementary school teachers tend to seek advice from teachers in their school, who teach the same grade (Spillane, Kim, & Frank, 2012), but this factor may not be a strong predictor of across-school advice ties. Alternatively, we might find that principals tend to seek advice outside their home schools because they are communicating with other principals, making it unlikely they will ask advice within their own schools. In these situations, we should model the covariates for ties within blocks and across blocks separately. In other situations, we may choose to model covariates as having constant effects across all individuals regardless of block membership.
To accommodate covariates that vary depending on the nature of the tie (whether it is within a block or across blocks), we introduce the a priori random covariate stochastic blockmodel (RCSBM) along with the a posteriori RCSBM. The a posteriori RCSBM is given as:
Within each of these models, additional specifications are possible depending on the situation. For example, the covariate effect may vary depending on whether the tie occurs within blocks or between blocks, and we could also allow within-block covariate effects to vary by block. Then,
Additional parameters can also be added to the model depending on prior specification. For example, we specify
2.1. Model Estimation
To fit CSBMs, we use a Markov chain Monte Carlo algorithm (MCMC; Gelman, Carlin, Stern, & Rubin, 2004), where the joint likelihood of the model can be written as:
If block memberships are estimated, they are updated using Gibbs updates and the complete conditional distribution is given as:
All other parameters in the model are updated using Metropolis–Hastings updates. The regression coefficients β can be updated using a random walk proposal distribution, and depending on the specific model, it may be possible to update hyperparameters with Gibbs updates. To estimate B, we reparameterize B and update instead the log odds of B again using a normal random walk proposal distribution. Without covariates, closed form complete conditionals do exist and B can be updated using a Gibbs step (Sweet et al., 2014).
The MCMC algorithm is coded in R (R Development Core Team, 2010) and this code is available as supplemental material. Note that our code is for binary network data only. These models can also be estimated using the R package CIDnetworks (Dabbs, Junker, Sweet, & Thomas, 2014), which uses a probit link function and accommodates any type of tie data.
2.2. Considerations
2.2.1. Applications and models
CSBMs are a useful class of models for researchers interested in identifying and measuring factors associated with network ties in the presence of subgroups. Subgroup structure is common in many social organizations, either through formal organization (e.g., departments in a high school) or as an outcome of social interaction (e.g., proximity or homophily). Given a friendship network of students, students may self-group by grade but choose their friends based on a number of other characteristics such as race, gender, or common interests. We could also estimate attributes associated with friendships across classrooms. Are these friendships based on other shared activities such as sports teams or neighborhood playgroups or are there certain attributes that these students possess such as age or having a leadership role?
Specifically, these models are best-suited for covariates that are unrelated to the subgroup. For example, consider an advice-seeking network of teachers in a high school and suppose the blocks represent departments. That is, teachers tend to seek out other teachers within their own department for advice. Then, we could use CSBMs to investigate the effects of teacher attributes such as race, gender, experience, and instructional practices on advice-seeking. Because college major tends to be strongly associated with teaching department, including that as an node-level or even as a dyad-level covariate is problematic since the block membership is based on department. This is analogous to the practice of removing collinear covariates in multiple regression (Weisberg, 2005) and does not detract from the utility of the model.
CSBMs are also a flexible class of models. CSBMs can accommodate a variety of block-dependent covariate effects, such as differential effects for within-group ties and between-group ties. In addition, CSBMs can accommodate other random effects for covariates. For example, we might need a model in which covariate effects vary by block. Consider elementary school friendship networks in which students are blocked by grade. We might hypothesize that the effect of gender on friendship ties becomes increasingly positive as students get older, that is, students select students of the same gender for friendship ties, so the effect of the tie-level covariate of “being the same gender” varies across the grade-level blocks. Such a block-dependent covariate could also be accomplished by incorporating block-specific covariates directly into the B matrix.
2.2.2. Estimation and identifiability
One issue common to all a posteriori blockmodel-fitting algorithms is that of label switching. When block membership is unknown, the CSBM algorithm estimates the group identity for all individuals, creating several identical solutions. Such an identifiability issue is handled in a number of ways (Celeux, 1998; Jasra, Holmes, & Stephens, 2005; Stephens, 2000). We choose to postprocess the block memberships and relabel in order of average node ID, which is a simple way to obtain a unique solution. Labeling of blocks becomes an issue when calculating classification rates, especially when block membership is not recovered with near-perfect accuracy.
Another potential identifiability issue that is more particular to CSBMs involves the B matrix and the number of ties across groups, since the inclusion of covariates may increase the number of between-block ties. Separating the effects of the covariate with the estimates of the B matrix may prove challenging in some circumstances, and for our particular applications, we assume between-block tie probabilities to be low (i.e., the off-diagonal elements of the B matrix are small) and suggest using informative priors or even fixing the off-diagonal values of the B matrix to ensure that covariate effects are not affected.
We also consider the challenge of estimating group membership. Local modality is certainly an issue with group membership estimation and we recommend using a clustering algorithm to generate starting values for group membership in the MCMC algorithm. If starting values for group membership are selected randomly or fixed so that all individuals begin in the same group, MCMC convergence in a local mode is more likely and this issue is incensed as subgroup density decreases and the number of blocks increases.
It is unsurprising that increasing the number of parameters in any CSBM increases the number of possible identification issues and we conclude our discussion with the a posteriori RCSBM in which both group membership is unknown and covariate effects differ for across-block and within-block ties. This particular model is likely to be the most difficult to fit of the CSBMs presented in this article. Estimates for covariates are necessarily dependent on block membership estimates and errors in one create errors in the other. These models are likely best-suited for networks with dense subgroups and most likely small networks, for example, a class friendship network or small organization network.
Despite these issues, which are not difficult to circumvent, these models are quite useful. In Section 3, we explore these issues in more detail though simulation studies to illustrate under which conditions these models can and should be used. Then, we present empirical examples of fitting these models in Section 4.
3. Simulations
We conducted three simulation studies to illustrate the utility of these models but also to investigate some underlying operating characteristics that may affect parameter estimation. We first explore whether estimating the B matrix influences covariate effect parameter recovery in a CSBM. In the next simulation study, we explore the effect of covariates on recovering group membership, and in the third study, we explore the effects of subgroup density on recovering group membership. These simulations are not meant to be exhaustive; merely they are an exploration into these models and preliminary results can be used to inform future work in this area.
3.1. Simulation 1: Investigating Possible Identifiability Between-Group Tie Probability Matrix and Covariate Parameters
There may be an identifiability issue when simultaneously estimating the within- and between-group tie probability matrix (B) and regression coefficients. When all elements of B can vary and β is unconstrained, changes in within-block probability of a tie may result from a change in β or a change in entries of B. For example, a positive covariate may increase the number of both within- and between-group ties and this effect may instead be captured in the estimates of the entries of B as opposed to the covariate effect. Thus, to investigate this possibility, we explore covariate parameter recovery when B is fixed or estimated.
We simulated 20-node networks of three subgroups and varied both the effects and the numbers of covariates. Each covariate was then included in our data generating model as both a sender covariate and a receiver covariate, that is,
We simulated 100 data sets for each simulation for each of the four covariate combinations. We then fit the simulated data set using three different models that differ only with respect to the B matrix estimation. Model 1 assumes the B matrix is known, Model 2 assumes the off-diagonal elements of the B matrix are known, but the diagonal elements must be estimated, and Model 3 assumes the entire B matrix is unknown. Thus, Model 3 is given as:
Table 1 displays the parameter recovery rates and we find that parameters are recovered approximately 88–98% of the time. Coverage probabilities do not appear to be influenced by the model, which suggests that estimating B does not affect covariate parameter estimation. Coverage may decrease slightly as the number of covariates increases, but it is not obvious from these results. We also note that coverage does not seem to vary based on the variance of the X.
(Simulation 1) Coverage Probabilities for Covariate Parameters
Note. We fit each network with three models: B is fixed (Model 1), the diagonal elements of B are estimated (Model 2), and all of B is estimated (Model 3). CSBM = covariate stochastic blockmodel.
3.2. Simulation 2: Effects of Covariates on Group Membership Recovery
Another potential estimation issue may occur when jointly estimating group membership and covariates. We suspect that increasing numbers of covariates would decrease the accuracy with which we can estimate group membership. Thus, in this simulation, we explore the effects of additional covariates on the group membership estimation.
We generated networks of size 30 and 75 with three and five subgroups, respectively, and included either one, two, or four pairs of sender and receiver covariates. For each combination of group number and covariate effects, we simulated 100 data sets from the following CSBM:
We then fit the simulated network data and covariates using an a posteriori CSBM. To optimize convergence, we used a community detection random walk algorithm through the
Our MCMC algorithm generates draws for group membership for each node and we use the posterior mode as the estimated block membership. We then compare the estimated group membership with the true group membership, using an algorithm that permutes the rows to optimize the trace of the classification matrix. This then generates the labels that best align with the true groups. To assess our group membership estimates, we use the classification rate defined as the proportion of nodes that were correctly classified.
Classification rates for node membership are summarized in Table 2. The mean classification rate is determined by taking the average of the 100 classification rates computed for each model fit. Along with the mean, we include the standard deviation (SD) and two tie density measures: the within-group and between-group tie density. We define the within-group density as the number of observed ties that occur within the same block divided by the total number of ties within the same block and between-group density as the number of observed ties that occur between different blocks divided by the total number of possible ties between different blocks. This is not based on the model since we use the true block memberships to calculate these values.
(Simulation 2) A posteriori CSBM Block Membership and Classification Rates and Within- and Between-block Densities
Block membership recovery does not vary greatly across conditions, but there are some detectable patterns. When the variance of the covariates increases to 1, block classification rates are slightly lower. Similarly, we also notice that in general, the eight covariate models have the best classification rates, but there isn’t a noticeable difference between two and four covariates. The models when
We also note that the five-block model fits have slightly better classification rates than the three-block model fits. Although we believe that increasing the numbers of blocks generally increases the difficulty in estimating block membership, the five-block networks had 75 nodes and the three-block networks had 30 nodes. Thus, on average, the five-block networks had 15 nodes per block and the three-block networks has 10 nodes per block. While an increase from 30 nodes to 75 nodes may not seem like a huge leap, this corresponds to an increase in the number of ties from 870 to 5,550. In addition, there appears to be little relationship between tie densities and classification rates. In our experiences fitting noncovariate versions of SBMs, we find classification rates are generally best when there are large differences in within-block and between-block tie densities and the between-block tie density is very low, so it may at first seem surprising that this pattern did not hold for the CSBMs. This is likely due to the fact that covariate effects affect both within-block and between-block tie densities, so the observed within-block and between-block tie densities are not informative in the presence of covariates. The converse is also likely true: When we condition on the covariates, classification rates improve.
We also estimated parameter recovery with the a posteriori CSBM to explore the effects of estimating group membership on covariate parameter recovery. Table 3 reports coverage rates which noticeably decrease as covariate variance increases, and this result aligns with our finding of reduced classification rates with larger variances. A decrease in parameter recovery as covariate variance increases was not observed in Section 3.1, but covariate effect parameter recovery is likely correlated with block membership recovery.
(Simulation 2) Covariate Effect Parameter Recovery in the A Posteriori CSBM
We also notice that
3.3. Simulation 3: Effects of Subgroup Density and Number of Subgroups on Group Membership Recovery
Networks are generally sparse with subgroups being quite dense. There are situations in which the subgroups are also somewhat sparse, so we investigate what ranges of subgroup density can be used to accurately recover group membership. In this simulation, we consider networks with a variety of numbers of nodes, within-block tie probabilities, and numbers of blocks; we consider networks generated from 3-, 5-, and 10-block CSBMs; and we consider network sizes that are 10 times, 15 times, or 20 times the number of blocks For example, we generate networks with 30, 45, and 60 nodes for a three-block CSBM. The data generating model is given as:
For each combination of network size, number of blocks, and B, we simulated 100 networks. Table 4 shows the mean classification rate and SD for block membership, and in general, increasing the number of nodes per block (i.e., increasing network size and keeping the number of blocks fixed) improves block membership estimation. We also notice that block classification rates dramatically improve, as the within-block tie probability increases from .2 to .35, and classification rates do not generally increase when
(Simulation 3) The Average Block Membership Classification Rates and Standard Deviations
We again consider regression coefficient parameter recovery, and the patterns for covariate effect recovery given in Table 5 are less consistent with the patterns observed for block membership classification rates. For example, we notice that block estimation improves as as the number of nodes per block increases, but this pattern is not consistent for covariate effects. We also saw block classification rates decline as the number of blocks increases while keeping the expected number of nodes per block constant, for example, the first row of Table 5 compares networks with an average of 10 nodes per block and we see a slight decrease in parameter recovery. But for covariate effects, this decline is not steep nor is it constant for every covariate parameter. In fact, we notice such a monotonic decline most with
(Simulation 3) Covariate Parameter Recovery Rates
However, Table 5 indicates that covariate parameter recovery is generally quite good even in situations in which block membership estimation was less than ideal. For example, consider rows 1, 2, and 4 of Table 5. For all cells but one, block membership classification rates are less than .80, but parameter recovery rates suggest decent parameter recovery. These patterns suggest that covariate estimates may indeed be affected by extremely poor block membership, but that block membership recovery is not necessary for decent covariate estimation. Finally, we note that we did not observe the poor parameter recovery seen in Simulation 2 with the five-block network and
4. Fitting CSBMs to Education Data
To illustrate the a priori CSBM and a posteriori CSBM, we use advice-seeking network data (Hopkins, Spillane, Jakopovic, & Heaton, 2013; Spillane & Hopkins, 2013) taken from one suburban district in the Midwestern United States in 2013, and we use the pseudonym Auburn Park for the name of the district. The data include school staff surveys as well as social network data from 14 elementary schools. Staff members were asked to name the individuals to whom they seek instructional advice and information, and they were able to nominate any other staff member in the district.
The advice-seeking network is given in Figure 1, and this particular representation is simply an abstraction of the adjacency matrix. A black box in row s and column r corresponds to

The Auburn Park advice-seeking adjacency matrix represented visually. A black box represents a tie and the presence of a white box represents the absence of a tie. Individuals are ordered by school. Note that within-school ties are more likely than between-school ties although within-school ties are quite sparse even in some of the smaller schools.
Note also that the network is not symmetric since advice seeking is not always reciprocated. Thus, in our context, our senders are the individuals seeking advice or information and the receivers are the individuals providing advice or information. The tie then indicates an advice/information relationship from the seeker to the provider.
The network is overall quite sparse, which is unsurprising given the size of the network and the fact that most individuals have few opportunities to interact with one another because they work in different schools. Table 6 shows the number of staff in each school and several measures of density at both the network and school levels. The overall density and within-school and between-school tie densities for the full network of 389 professionals are shown along with within-school tie densities for each school. Note that we define density as the proportion of observed ties out of all possible ties which in this situation is
Density Estimates for the Auburn Park Advice-Seeking Network: Overall Density and Within-School and Between-School Densities Indicate Teachers Interact With Teachers Within Their Own School More Than Teachers Outside Their School
Note. Within-school densities for each school are also provided to show variability by school and smaller networks tend to be more dense.
The overall tie density is very low (.01) with a noticeable difference between within-school (.11) and between-school (.002) densities. Within-school tie density also varies slightly, ranging from .09 to .20, and the higher densities are to some degree associated with smaller schools (r = −.72). Despite these low densities, there is still a substantial difference between within-school and between-school tie densities, which suggests that a CSBM is an appropriate model.
In addition, we explore two measures of node centrality: in-degree and out-degree. Out-degree is the number of ties each node sends and in-degree is the number of ties received by a node. Mean in-degree and out-degree across all 389 nodes are both 3.77, but the distributions of in-degree and out-degree are quite different. Figure 2 show histograms for in-degree (left) and out-degree (right). In general, staff members tend to seek advice from two to six people, whereas the majority of staff are not solicited for advice at all. The staff members who do provide advice tend to provide advice to a large number of staff members.

Histograms of in-degree (left) and out-degree (right) suggest that staff members seek advice from several people, but that very few people receive requests for advice and those who do are likely to receive requests from a very large number of people.
4.1. Fitting A Priori CSBM to Auburn Park Data
We first fit an a priori CSBM to the advice network and include four individual-level indicator covariates: two are sender-related and two are receiver-related. We have two leader indicator variables: an individual is a content leader and an individual is a noncontent leader, such as a principal. For each of these two variables, we have seeker covariates and receiver covariates. The covariates are then: sender is a content leader, receiver is a content leader, sender is a noncontent leader, and receiver is a noncontent leader. While we expect that content leaders are no more or less likely to seek advice, we do expect that they would be sought for advice more often than other teachers.
Furthermore, we might hypothesize that these two classes of formal leaders influence advice seeking differentially depending on whether individuals are in the same or different schools, and CSBMs allow us to examine ties across blocks. Thus, we included an indicator Z for whether two individuals are in the same block, that is, Z
ij
= 1 if i and j are in the same block and 0 otherwise. The full CSBM is given as:
We consider group membership fixed to the home school for each staff member. If teachers divide their time among multiple schools, we used the school reported by the teacher as their designated school. Note that for this particular model, we only estimate the diagonal entries of the B matrix as parameters and constrain the off-diagonal elements to be fixed at .002. We also chose a fairly uninformative prior, a Beta distribution (2,1), for the diagonal elements. Similarly, we used a weak prior, N(0,10), for the covariate effects. For an adequate posterior sample of 360, we excluded a burn-in sample of 2,000 iterations and retained every 50th sample of our 20,000 step chain.
To estimate regression coefficients in the CSBM, we use expected a posteriori (EAP) estimates along with 95% equal-tailed CIs. These estimates are given in Table 7 and can generally be interpreted in a similar way as logistic regression coefficients. As predicted, staff members sought leaders for advice or information throughout the Auburn Park district, but there is a difference between content leaders and formal leaders. Content leaders were much more likely than other staff to be sought for advice within their own school (2.10), but only slightly more likely to be sought for advice from someone at a different school (0.44), whereas noncontent formal leaders were only slightly more likely than other staff to be sought for advice within their own school (0.40) and more likely to be sought for advice from staff in other schools (0.80). Regarding seeking advice, we found that both content and noncontent formal leaders were more likely to seek advice outside of their home schools (1.16 and 2.12, respectively) but were somewhat less likely to seek advice within their home schools (
Posterior Means (EAP) and Equal-Tailed Credible Intervals From the A Priori CSBM Fit
4.2. Fitting a Posteriori CSBM to Auburn Park Data
When block memberships are unknown, we can use an a posteriori CSBM to estimate them. To illustrate fitting this model, we again use the Auburn Park advice network data. To explore how group membership is recovered, we now treat covariate effects as being constant across all blocks, but respective covariates are the same as before, indicators that the sender is a content leader, receiver is a content leader, sender is a noncontent leader, and receiver is a noncontent leader.
The fitted a posteriori CSBM is given as:
We first examine group membership estimates. The posterior distributions for g for the first 60 individuals in the network are shown in Figure 3. The first 30 individuals all report belonging to the same school, but group membership is unclear for these individuals. The posterior distributions imply that many of these individuals appear to belong to 2–3 groups or even more groups for some extreme cases. Group membership is more clear for Individuals 31–60, almost all of whom are estimated to belong to Group 13 with high probability.

Group membership posterior distributions for the first 60 individuals in the network. In many cases, group membership in a single group is clear, but for many other individuals, group membership appears to divided between several groups or in some extreme cases among all 14 groups. Note that Individuals 1–30 report belonging to the same school and Individuals 31–60 report belonging to the same school.
Our classification matrix showing the true block (school) membership for each individual versus the estimated block membership is given in Table 8. 1 The classification rate is 62.7%, which is similar to the classification rates found in Section 3.3.
Furthermore, Table 8 reveals that some of the schools are being grouped together as one school and others are being divided. For example, individuals in Schools 3 and 9 are estimated as belonging to the same school. The same is true for Schools 5 and 12 and Schools 7, 11, and 13. Such a lack of distinction between blocks is likely due to both the low within-school density and the large number of blocks. For example, both Schools 5 and 12 have slightly lower within-school density and School 5 has relatively high across-group density. In other circumstances, it is less clear. For example, there does not appear to be a particular reason that School 7 is grouped with Schools 11 and 13. As a result of these consolidations, the model estimates that the 389 school staff belong to 10 or 11 blocks.
Classification Matrix Showing the True Block (School) Membership for Each Individual Versus the Estimated Group Membership
Note. The school size
To further expound on block estimation, we also calculated lower bounds on classification rates to perhaps illustrate that the classification rates are not terrible. In fact, if block membership were randomly assigned, we’d expect a classification rate near .155, a value empirically derived through Monte Carlo simulation. Thus, our model is detecting a good bit of subgroup structure. Another metric we could use is the proportion of ties that are correctly assigned as belonging to the same or different groups. We found that 93.7% of our pairs of nodes were correctly classified as belonging to either the same group or different groups, as compared to an expected value of 82.5% when group membership is assigned at random. Regardless of which measure we use, the a posteriori CSBM is able to correctly estimate some of the observed subgroup structure.
Despite underwhelming classification rates, the covariate effects appear to be well-recovered. To compare the effects of estimating block membership on covariate effects, we also fit the analogous a priori CSBM using the same four covariates. The results from each model are given in Table 9. In general, the covariate estimates from the a posteriori CSBM are very similar to the analogous estimates from the a priori CSBM, and the 95% CIs overlap, substantially in some cases. The a posteriori CSBM 95% CIs are slightly wider than the corresponding a priori CSBM intervals which is unsurprising given in the additional parameters in the a posteriori CSBM. In addition, we compare Bmm in the a priori CSBM and a posteriori CSBM model fits. The mean EAP estimates for Bmm are .09 and .06 for the a priori and a posteriori CSBM fits, respectively. Although entries of the B matrix adjusted for some of the block misclassification, we also notice that the a posteriori CSBM point estimates and corresponding intervals are slightly more positive than the a priori CSBM, suggesting that incorrect classification of individuals may bias covariate effects.
Regression Coefficient Estimates for Indicator Variables
5. Discussion
We presented a class of models, CSBMs, which allows researchers to model social network data in the presence of existing subgroups or blocks. Students and teachers in schools often form subgroups, either through formal organization, such as departments or schools, or through informal selection, such as working in the same area or teaching the same group of students. While models that relate individual traits to these subgroups are useful, researchers are also interested in the relationships that form or persist in the presence of this subgroup structure, and CSBMs are particularly suited for these types of research questions.
One particularly novel aspect of the CSBM framework is the ability to estimate covariate effects between subgroups, which has not been addressed in the education literature; in fact, a similar model was only recently introduced in the statistical data mining literature (Airoldi et al., 2011). Because CSBMs incorporate covariates independently of subgroup structure, they are unique in that covariates can vary both between and within blocks. In fact, understanding why individuals form ties outside of their subgroup (and with whom) is of particular interest to education researchers (Spillane, Hopkins, & Sweet, in press).
In addition to illustrating model feasibility with real network data, we conducted several simulation studies not only to further demonstrate the utility of these models in practical settings but also to show how robust these models are, specifically with respect to covariate parameter recovery. For example, we hypothesized a possible identifiability issue with estimating the B matrix, but we found little effect of either estimating B or the variance of the covariate on covariate parameter recovery. We did find that parameter recovery generally decreases when estimating group membership, especially when group membership estimation is poor, but parameters are generally recovered with decent probability, even as the variance of the covariates increases. Finally, we found that very low within-group tie density negatively impacted block membership estimation. Our exploration with a real-world data set also corroborated these findings, since covariate effects were quite similar under the a priori CSBM and the posteriori CSBM, where group membership was not well recovered.
Regarding methodological research, there are many open areas for future work. Given the proposed models, we examined block membership recovery and covariate parameter recovery under certain conditions, but these simulations could be extended. In particular, we employed equal-sized blocks, and one open question is whether variability in relative block size affects block membership estimation. In the Auburn Park example, blocks sizes varied and we found that block size did not appear to be related to block recovery. In fact, the smallest school was recovered fairly well (1 node was misclassified and 7 other nodes were incorrectly estimated as belong to this school). In our own experience, we find within-block density to be the strongest predictor of block recovery even when working with blocks very different in size. Generally, if the block has at least a few nodes and the within-block density is above a certain threshold, it will be recovered even when among blocks several times larger in size.
Another area of research is on block-varying covariate effects. We fit models in which covariate effects varied within and between blocks, but certainly there are situations in which we would want covariates to vary across all combinations of interaction. Our simulation studies were largely not affected by identifiability issues, but identifiability is not a trivial issue and determining the circumstances in which parameter recovery is compromised is very important. Focus should also be paid when covariates conflate with block structure. Finally, we assumed independence between block membership and covariates when generating data, but we might consider how including a relationship between block membership and covariates affects not only parameter estimation but also model fit in general.
Other models are also possible within this framework. We may be interested in introducing a mixture component and investigate covariate recovery when using a mixed membership stochastic blockmodel rather than a stochastic blockmodel. Sweet et al. (2014) introduced such a model but did not explore any of the operating characteristics or consider possible estimation issues. Other models for clustered network data, such as the latent position cluster model (Handcock, Raftery, & Tantrum, 2007), could also be considered.
The CSBM is a conditionally independent network model in that we assume the ties are independent of one another given the parameters in the model (covariate effect and block membership). In fact, one of the unique benefits to fitting network data with conditionally independent network models is that covariates, random effects, and latent structural components such as blocks or latent spaces (Hoff et al., 2002) can be combined in additive ways. Depending on the network structure, assumptions about the network nodes and ties, and specific research question, we could combine these components to build a very specific social network model to address that research question. There is currently some work in this area (Dabbs, Adhikari, et al., 2014; Dabbs, Junker, et al., 2014) and we believe this to be an area of active research in the future.
Modeling social network ties as independent is a poor assumption, as these relationships rarely form or persist in isolation. In the case of the stochastic blockmodel, we believe that incorporating relevant covariates into the model will improve the model in two important ways. First, combining covariates and a stochastic blockmodel should improve the power of the model to detect either covariate effects and block membership, and this work is currently being explored in Dabbs, Adhikari, et al. (2014). Second and perhaps more importantly, adding relevant covariates makes the assumptions of the model more believable. Network ties, particularly in a large network such as Auburn Park (Section 4), are unlikely to be independent conditional on only block membership; it is more accurate to model ties as independent given both block membership and several important covariates. In fact, Spillane et al. (in press) use sender, receiver, and edge-level covariates in their CSBM.
Thus, CSBMs have great potential to inform future educational research involving relational or network data in addition to methodological research on social networks. Education researchers have a class of flexible models that can both cluster individuals and provide information about relations within and between these clusters. For example, Spillane, Hopkins, and Sweet (in press) fit CSBMs to illustrate differences in covariate effects within and across schools with regard to how advice and information is shared throughout the district. They found that subject-specific leaders tend to seek advice outside of their own schools but rarely within their own schools even though they are much more likely to provide advice than other staff. Understanding how social resources are shared within a school district then informs district policies and decisions. Since school leaders tend to obtain information outside of their schools, district leaders may encourage organizational routines to enable these interactions. Finally, these models are especially easy to understand because the covariate effects can be interpreted in a similar way as generalized linear model coefficients, since covariates are incorporated in an additive way.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was in part supported by the Institute for Education Sciences under Award #R305D12004.
Note
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
