Abstract
The relational event model (REM) solves a problem for organizational researchers who have access to sequences of time-stamped interactions. It enables them to estimate statistical models without collapsing the data into cross-sectional panels, which removes timing and sequence information. However, there is little guidance in the extant literature regarding issues that may affect REM’s power, precision, and accuracy: How many events or actors are needed? How large should the risk set be? How should statistics be scaled? To gain insights into these issues, we conduct a series of experiments using simulated sequences of relational events under different conditions and using different sampling and scaling strategies. We also provide an empirical example using email communications in a real-life context. Our results indicate that, in most cases, the power and precision levels of REMs are good, making it a strong explanatory model. However, REM suffers from issues of accuracy that can be severe in certain cases, making it a poor predictive model. We provide a set of practical recommendations to guide researchers’ use of REMs in organizational research.
Keywords
Research in organizational networks is starting to undergo a profound transformation. The increasing availability of digital data about human behavior in and across organizations provides opportunities to study organizational phenomena at a larger scale and with finer granularity than was ever possible before (Lazer et al., 2009; Wenzel & Van Quaquebeke, 2018). A growing literature is leveraging these digital trace data to understand organizational phenomena, such as teamwork and group dynamics (Leenders et al., 2016; Onnela et al., 2014; Schecter et al., 2018), intraorganizational network dynamics (e.g., Aral & Van Alstyne, 2011; Goldberg et al., 2015; Kleinbaum, 2012; Kossinets & Watts, 2006; Liu et al., 2016), or interorganizational processes (Ng, 2017; Valeeva et al., 2020). The increasing availability of digital data has been accompanied by the development of suitable statistical frameworks (see Golder & Macy, 2014; Tay et al., 2018).
The relational event model (REM—Butts, 2008) is a statistical framework that has made important inroads in social inquiry (Brandenberger, 2019; Kitts et al., 2017; Vu et al., 2015), and in organizational contexts in particular (e.g., Leenders et al., 2016; Pilny et al., 2016; Quintane & Carnabuci, 2016). Compared to other inferential statistical frameworks such as exponential random graph models (ERGMs—Lusher et al., 2013) or stochastic actor-oriented models (SAOMs—Kalish, 2020; Snijders, 1996), REM is specifically designed to analyze sequences of time-stamped interactions between social actors without needing a priori aggregation. That is, REM can estimate a full sequence of relational events without collapsing the sequence into a cross-sectional network (Butts, 2008). This feature enables retention of the sequence and timing of relational events (Quintane et al., 2013), which are critical to examining network dynamics. In a similar way to ERGM and SAOM, REM enables researchers to specify and compare the differential effect of social mechanisms (e.g., reciprocity, transitivity; see, e.g., Wimmer & Lewis, 2010, for ERGM, Tröster et al., 2019, for SAOM, and Quintane et al., 2013, for REM). However, REM also enables researchers to examine the temporal dimension of these social processes (see Kitts et al., 2017; Quintane & Carnabuci, 2016). As such, REM opens opportunities for organizational researchers to gain new insights into questions around network stability and change, the timing of social and behavioral processes, and more generally network dynamics.
However, researchers’ ability to test hypotheses regarding the prevalence of these temporal social processes in a given empirical context rests on the reliability of the REM’s parameter estimates, which is based on the model’s power, accuracy, and precision. To achieve sufficient power, accuracy, and precision, there are three factors researchers must consider, each of which is based on decisions that researchers make during data collection, calculation of statistics, and model implementation. These factors are (a) the number of events and actors in a sequence, (b) the sampling of the risk set, and (c) scaling strategies for the statistics. When accessing sequences of relational events, researchers have to decide at a minimum how many actors and how many events they need in order to ensure sufficient power, accuracy, and precision of their REM. Furthermore, researchers who have obtained access to long event sequences with many actors may still need to decide how many events they want to sample from the risk set in order to make its computation feasible (see Lerner & Lomi, 2020; Vu et al., 2015). Finally, long event sequences also require researchers to use a scaling strategy to make their count network statistics comparable over time (Brandes et al., 2009; Kitts et al., 2017; Quintane et al., 2014). These issues are important because misspecifications might lead to biased estimates, skewed standard errors, or incorrect significance levels, each of which could affect the substantive results of research projects (Block et al., 2018; Stadtfeld et al., 2018; Wang et al., 2014). However, the extant literature offers little guidance to researchers to ensure appropriate power, accuracy, and precision of the results provided by REM.
In this article, we aim to provide concrete recommendations to researchers regarding the number of events and actors, as well as sampling and scaling strategies to ensure appropriate power, accuracy, and precision of REMs. To do so, we conduct simulation experiments to understand REMs’ power, accuracy, and precision on five endogenous network statistics commonly used in statistical modeling of network data (inertia, reciprocity, activity, popularity and transitive closure). Our experiments vary the characteristics of the event sequences being modeled (number of actors and number of events) as well as the effect size of the endogenous statistics to establish lower bounds on the data required to conduct reliable organizational studies with REM. Using the simulations, we also examine the effect of choosing different sample sizes and different methods of scaling statistics on power, precision, and accuracy. Furthermore, we augment the simulation study with a real-life example using a dataset of over 5,000 emails among 33 employees in an organization in Australia. The case study enables us to expand the analyses done in the simulation to a real empirical context.
Our results show that REMs are versatile models that can be applied to event sequences with a wide range of characteristics. We found that the power and precision of REMs are generally good, even when event sequences have relatively few actors. Based on our simulations, we propose a set of thresholds for the minimum number of events required given a specific number of actors in order to reach acceptable power. Similarly, there is little loss of power or precision when sampling as few as five potential events for each real event (when the sequence is over 5,000 events). However, our results suggest that the accuracy of the model is generally problematic and is negatively affected by sampling and proportional scaling. Table 8 provides a summary of our results.
The Relational Event Model
The relational event framework (Butts, 2008; Butts & Marcum, 2017) was developed as a way to estimate sequences of relational events. A relational event is defined as “a discrete event generated by a social actor (the ‘sender’) and directed toward one or more targets (the ‘receivers’), who may or may not be actors themselves” (Butts, 2008, p. 159). REMs predict the occurrence of the next event in a temporally distributed sequence of events (Marcum & Butts, 2015). This means that, in REM, the dependent variable is the occurrence of the next event in a sequence, which is modeled as a function of the sequence of past events. Butts’s (2008) article provides an example using REM to describe radio communications during the World Trade Center disaster. REM predicts who communicates via radio with whom, based on the history of past communications. More specifically, assuming that the next event in the sequence of radio communications is A reaching out to B, the REM predicts the likelihood of this event—A communicating with B—occurring next, based on predefined factors such as the history of past communications from A to B (demonstrating inertia in communications) or on the history of communications from B to A (demonstrating reciprocity, people tend to respond when being addressed).
REM allows researchers to examine how individual behavior is shaped by social structure, characterized through a set of mechanisms that operate over time. For example, do hospitals engage in the social norm of reciprocity when exchanging patients, instead of sending them to the hospital that can offer the best service for the patient? Using REM, Kitts et al. (2017) examined reciprocity in over 4,000 patient exchanges between 21 hospitals in a region of Italy, spanning 5 years. They show that hospitals do reciprocate patient exchanges over time in ways that are not explained by the availability of beds, the quality of service, or the specialization of hospitals.
REM departs from a typical assumption of other network statistical frameworks, such as ERGMs or SAOMs, for which the issue of dependence between observations is critical (Kalish, 2020; Lusher et al., 2013). In REM, each event is considered to be conditionally independent of all other events in the sequence. While REM does not assume dependence between observations across actors, it assumes temporal dependence. That is, each event occurs conditionally on the realized history of past events. This makes the framework appropriate for contexts in which past behavior influences future behavior (i.e., most organizational contexts), but also enables researchers to examine how social mechanisms operate over time. Following on the same example, in their analysis, Kitts et al. (2017) distinguish between two forms of reciprocity between hospitals: organizational embedding and resource dependence. They demonstrate that these forms of reciprocity operate over different time horizons; embedding operates over longer-term histories of interactions (i.e., one year), while dependence operates over shorter-term histories of interactions (i.e., one month).
Fitting a REM requires the estimation of the probability that a particular sequence of events transpired as a function of exogenous and endogenous factors. To capture this probability, each event is given a rate, or a frequency of occurrence; events that are common have high rates, and events that are rare have low rates. The rate for each dyad is a function of statistics corresponding to social processes such as inertia or reciprocity, as well as parameters that represent the sign and strength of the statistics’ effects. For instance, Schecter et al. (2018) found that group members searching for information in military simulations tended to exhibit inertia in communication, but avoided preferential attachment (that is, communicating with individuals that have both high indegree and outdegree). More specifically, inertia means that the more messages individual A sent to individual B in the past, the higher the likelihood of A sending again a message to B in the future. By contrast, for preferential attachment, the parameter is negative, indicating that a future message from A to B is less likely if B has both sent and received messages to/from many people in the past.
Extending this logic, the REM models every potential pairing of individuals in terms of rates, which are functions of network statistics and parameters. The relative likelihood of an event (link between two nodes) can then be calculated by comparing the rate values for all possible dyads (Butts, 2008). Each time an event occurs, the statistics are updated, and the comparison of dyads is repeated. As the sequence continues to unfold, the rates of events are continuously updated to reflect the new network structure. In this way, REM captures the probability of the full sequence by tuning the rate parameters and maximizing the likelihood of each observed event.
Power, Accuracy, and Precision
The objective of using statistical models such as REM to conduct hypothesis testing is to obtain an estimate,
During hypothesis testing, two types of errors are possible: false positives (Type I), where the model indicates a statistically significant estimate
Possible Outcomes of Relational Event Model.
Note: This table is adapted from Wang et al. (2014, p. 90).
While achieving strong statistical power is important for hypothesis testing, obtaining high accuracy and precision is critical for the interpretation of estimated effects (Block et al., 2018). The accuracy of an estimate refers to the discrepancy between the fitted value
Three Main Issues
Issues of Size
Statistical power, accuracy, and precision are contingent on multiple variables: the significance level, the sample size, and the population effect size (Cohen, 1992). The significance level refers to the probability of a Type I error, sample size refers to the number of observations used to test the hypothesis, and effect size represents the magnitude of the underlying measure. Generally speaking, if one can identify those three variables, then power can be determined. For REM, the significance level and effect size are straightforward to specify. However, the number of events (e.g., the number of emails in an email dataset) in conjunction with the other two factors alone is not sufficient to determine power. We also need some measure of the size of the risk set, which is a function of the number of actors.
We consider that a sequence of relational events is characterized by the number of events (E) in the sequence and by the number of actors (N) that are involved in the sequence at any point in time. 1 These two characteristics (number of events and number of actors) together have the potential to affect power, accuracy, and precision in a REM. Additionally, these two characteristics affect the size of the risk set. Estimating a REM requires the construction of a risk set for each event, which consists of all the potential events that could have occurred instead of each event being observed in the sequence. 2
Hence, power, accuracy, and precision are affected by the number of events and the number of actors in the event sequence. While it is intuitive that smaller number of actors and smaller number of events will result in lower power, precision, and accuracy, we do not know what the lower bounds are for the number of events and the number of actors. Furthermore, we do not know how different combinations of number of events and number of actors together affect the lower bound (for example, a small number of actors with a large number of events, or a small number of events with a large number of actors). Finally, the number of events affects the size of count statistics, which makes scaling important.
Issues of Risk Set and Sampling
Estimation of the REM requires the computation of statistics for a risk set (Butts, 2008). The risk set is determined for each event in the sequence (and it is therefore a function of the number of events in the sequence). For each event, the risk set is composed of all potential events that could have occurred at the same time as the observed event. Because the events are dyadic, this means that the risk set contains all the potential dyads that could have existed instead of the event that actually occurred (hence the risk set is also a function of the number of actors). For instance, in a sequence with three actors (A, B, and C), we observe an interaction from A to B. The events that could have occurred instead of event AB are all other dyadic combinations (BA, AC, CA, BC, and CB) between the three actors. For a sequence with three actors, the risk set contains five events for each observation, which is equal to the number of actors N multiplied by N−1 (i.e., we do not allow events from an actor to him or herself). Also, we remove the observed event from the risk set, hence the size of the risk set is N*(N−1)−1. The total size of the risk set for a full sequence is therefore E*(N*(N−1)−1) potential events.
Identifying the size of the risk set is critical to determining the statistical power, accuracy, and precision of the REM because every element of the risk set—that is, potential dyadic event—is essentially an observation of an event that did not occur. This problem is analogous to logistic regression, where populations associated with both responses (1s and 0s) must be accounted for when computing power (Hsieh et al., 1998). As the size of the risk set increases for each event (i.e., the number of actors increases), the ratio of observed to unobserved events will become very small. For example, in a network with 100 actors, for each observed event there are N*(N−1)−1 = 9,899 unobserved events. Thus, if we have a short sequence but many actors, identifying the mechanisms that contribute to events occurring becomes increasingly difficult. As a result, as the size of the risk set increases (N increases), we likely need the number of events (E) to increase as well.
A key problem with the computation of the risk set is that the number of potential events in a dyadic dataset increases in the order of the square of the number of actors in the dataset. This problem is compounded by the potentially large number of observations needed to identify meaningful effects. The computation of statistics for each potential event in the risk set becomes quickly intractable even with sequences that have a reasonable number of actors. For instance, in an organization of 129 employees that exchange 75,308 emails (see Quintane & Carnabuci, 2016), the size of the risk set for each event is (129*128−1) = 16,511 potential events, and the total size of the risk set across the full sequence is 1,243,410,388 potential events. Because we would have to calculate the values of statistics, such as reciprocity or transitivity, for each potential event, this computational problem becomes unmanageable when the number of actors and the number of events become too large (Butts, 2008; Vu et al., 2015).
To remedy this issue, random samples of the risk set may be drawn (Lerner & Lomi, 2020; Vu et al., 2015). This means that researchers randomly select a few (a fixed number such as 5, 10, 20 or a percentage such as 10%) potential events out of the universe of potential events and let only this random sample constitute the risk set. While this strategy is very effective in reducing the size of the risk set, we have no understanding of its effect on the power, accuracy, and precision of the model.
Issues of Scaling
The longitudinal dimension of digital datasets requires researchers to use some form of scaling to make network statistics comparable over time (Brandes et al., 2009; Kitts et al., 2017; Quintane et al., 2014). This is because when sequences are long, the values of network statistics based on counts (i.e., most endogenous network statistics) can become very large, resulting in poor estimation of the REM (DuBois et al., 2013). There are two primary reasons for this problem; first, variation across dyads and, second, variation across statistics. Assume we have an event sequence with 1,000 events between three actors A, B, and C. To calculate inertia for a given dyad (say, A to B), we would count the number of events in the sequence in which A interacted with B. For example, if our dependent variable was the fifth event and we are using inertia to predict that event, then we would count the instances of A interacting with B in the first four events. Clearly, the value of inertia would only range from 0 to 4. Similarly, calculating inertia at event 1,000 means counting the number of events from A to B that have occurred in the previous 999 events. Consequently, inertia could take any value from 0 to 999. Hence, the value of the statistic for inertia will be systematically larger at the end of the dataset than at the beginning of the dataset. As a result, the differences across dyads will also be amplified due to computation of statistics, rather than some underlying temporal process. Likewise, for long sequences the values of different statistics may diverge in magnitude. For instance, actor-level measures like activity or preferential attachment will grow much faster than dyadic measures like inertia or reciprocity. These differences arise purely due to the nature of the counting process used to compute statistics, leading to inconsistent measurement between different portions of the sequence.
Researchers have used three main forms of scaling to control the magnitude of network statistics over time (Brandes et al., 2009; Kitts et al., 2017; Quintane et al., 2014): proportional, exponential decay, and sliding window. Proportional scaling involves dividing the statistic for each dyad by the sum of the statistic across dyads, yielding values between zero and one (Butts, 2008; Quintane et al., 2014). For instance, the scaled version of inertia would be
Simulation Study
We examine the power, accuracy, and precision of the REM using different combinations of sequence characteristics (number of events and number of actors). Because we want to understand the extent to which the REM can recover a given effect size at a certain level of significance depending on characteristics of the sequence, our first analytical strategy is to do a simulation study. This enables us to specify explicitly the effect size for each variable of interest and to isolate the parameter estimate of each variable, which might be confounded in a real dataset. We are also able to vary the number of events and the number of actors in the sequence as well as the sampling and the scaling strategies to systematically examine the effect of combinations of these factors on power, precision, and accuracy.
Simulation Process
The simulation process is composed of two sequential steps. In the first step, we generate a set of synthetic event sequences with varying characteristics. Because we generate these synthetic sequences using values that we specify for a given set of parameters, we know exactly what the true value of
Step 1: Generating Sequences and Statistics
We generated a series of sequences to assess the statistical power, accuracy, and precision of the REM under a variety of conditions. When generating the sequences, we varied the statistic used to generate the sequences and the magnitude of the effect. 3 We tested each combination for a variety of sequence lengths and risk set sizes. For every combination, we generated 50 sequences and fit an REM to each sequence.
Generating one sequence
A synthetic sequence is created by iteratively drawing events from a probability distribution based on the sequence up to that point. The probability of observing a particular relational event
In the above expression,
where P is the number of statistics, and Initialization: Set the sequence length E, risk set size Compute Draw a new event Set
In the following sections, we detail each of the components that need to be specified in order to generate the sequence.
Sufficient statistics
In each synthetic sequence, we only specify one statistic at a time out of five statistics in total. Put another way, the parameter
For each pair of actors in the network, we would count the number of instances in the prior sequence
The five statistics that we chose—inertia, reciprocity, activity, popularity, and transitivity—are commonly used in studies utilizing REMs (Brandes et al., 2009; Butts, 2008; Quintane et al., 2013) and statistical modeling of networks more generally (Lusher et al., 2013). Each of these statistics describes a type of sequential behavior that individuals in an organizational setting may exhibit (Pilny et al., 2016; Quintane & Carnabuci, 2016; Schecter et al., 2018). We calculate the five statistics with respect to the prior sequence of events, as described above. In Table 2, we present these statistics.
Statistic Definitions and Formulae.
Note:
is the sender,
is the receiver, and
is a third party. Arrows indicate direction of events. Past interactions are represented as solid arrows
, and a future event is represented as a dashed arrow
.
Inertia, also referred to as persistence (Butts, 2008), is a measure of how often events occur within the same dyad over time. In other words, if i sends more events to j, i will become more (less) likely to send subsequent events to j. Reciprocity is a measure of how often dyad
Effect sizes
To ensure consistency across simulations, we use standard values for the parameters
Sequence characteristics
Our first objective is to evaluate the REM across various sequence characteristics (size and length) in conjunction with various effects and effect sizes. We generated sequences with 5, 10, 20, 40, and 50 actors; the risk set contained all possible dyads, leading to a size of
Step 2: Estimating REMs
Once we generate a synthetic event sequence, we need to apply REM to the data to estimate the parameters
The expression above is the product of probabilities for each event in the sequence, with the rates, parameters, and risk set equivalent to those described previously. Parameter estimates
Outcomes
Across all conditions, we determined the statistical power by counting the number of times the fitted parameter for the relevant statistic
Scaling
As mentioned earlier, three main forms of scaling have been used in the existing literature: proportional scaling, exponential decay, and sliding window. With the proportional scaling approach, each sufficient statistic is divided by some relevant value to ensure each statistic varies between 0 and 1 (Quintane et al., 2014). For instance, inertia captures the frequency with which actor i sends messages to actor j up to time t. Inertia can thus be scaled by dividing
The next method, exponential decay, involves iteratively reducing the weight of messages that have occurred in the past. With the exponential decay method, each dyad will now have a weight,
Here,
Finally, the sliding window scaling approach involves calculating the statistics using their typical formulae (see Table 2), but only considering events that occurred during a specified time period. We define a dyadic weight
In Table 3 we give the formulae for our five variables under various scaling methods. We tested the effect of scaling on sequences with fixed characteristics. Each network was composed of 10 nodes, and the sequence length was 500 events. For proportional scaling we used the specification described in Table 3. For exponential decay, we applied an iterative weighting scheme with a discount factor of
Formulae for Statistics Under Various Scales.
Sampling
While scaling remedies the issue of long sequences, sequences including a large number of actors pose computational issues. Specifically, when the risk set contains numerous potential events, the denominator of the likelihood function becomes difficult to compute directly (Butts, 2008). This issue is akin to the computational problems of ERGMs (Lusher et al., 2013). Following Butts (2008) and Vu et al. (2015), we approximate the denominator by randomly sampling from the risk set and only computing the statistics for those samples. Vu et al. (2015) suggest the number may be as few as 5 to 10 samples from the risk set. In a more recent study of large relational event networks, Lerner and Lomi (2020) find similar support for small samples from the risk set. To test the effect of sampling, we generated sequences with 100 actors (risk set of 9,899 dyads) and sequence lengths of 5,000 and 10,000. 7 We varied the number of samples taken from the risk set; we tested sample sizes of 1, 5, 10, 20, 30, 50, 100, and 200. All five statistics and all three effect sizes were tested for every combination of number of events and sample size.
Simulation Results
Sequence Characteristics
We used a first set of simulations to determine lower bounds for the number of events and number of actors that would lead to sufficient power, accuracy, and precision across different effects and effect sizes. Figure 1 provides a partial summary of our findings; because of the large number of scenarios tested, we report the full results in Online Appendix A.

Results for standard relational event model. Note: Power, accuracy (bias), and precision (standard errors) for inertia, reciprocity, activity, popularity, and transitivity. Horizontal axis is the number of events E. All values are averages across 50 simulations.
In each figure, the horizontal axis represents the number of events E (from 0 to 1,000). For purposes of illustration, event sequences with 10 actors (N = 10) are considered small, while event sequences with 40 actors (N = 40) are considered large. Only results for small and large effects sizes are included because moderate effect sizes consistently fall in the middle for all outcome measures.
Power
From Figure 1 we observe that there are a few general trends in terms of statistical power. More events, more actors, and bigger effect sizes all lead to greater power, regardless of the statistic generating the sequence. However, there is some variability across different statistics. In particular, inertia and reciprocity are relatively hard to detect (∼50% power) when the effect size is small and there are many actors. By contrast, the REM detects activity and popularity with high power for sequences with relatively low number of events regardless of the number of actors. Overall, when the number of actors is small, number of events is small, and/or the effect sizes are small, we cannot be fully confident that we will detect an effect. However, increasing any of these variables improves statistical power.
To summarize our analyses and provide recommendations regarding sufficient N and E, we follow Wang et al. (2014) and fit a linear regression model to our simulated data. We regressed our estimates of power on
Regression Equations for Predicting Power.
Note: OLS regression of estimated power on input variables. Standard errors in parentheses.
*p < .01. **p < .001.

Sequence length (E) thresholds across actors (N), effects, and effect sizes. Note: Values are estimated from regression equations in Table 4. Threshold is to achieve power of 0.80 at 95% confidence level.
Our models in Table 4 support the high-level trends evident in our figures; power is significantly enhanced when
Applying the regression results, in Figure 2 we provide concrete recommendations on the number of observations required to achieve good power for a given number of actors and a specific effect. The figure should be read in the following way. In order to achieve 80% power with a 95% confidence level given a number of actors of 10 and a medium effect size, what is the minimum sequence length that should be considered? For transitivity, Figure 2 gives the answer: 217 events. Alternatively, the results can be interpreted as the number of events per actor required to detect an effect. For instance, with N = 5 actors, Figure 2 indicates that 520 events are necessary to detect a small inertia effect with 80% power. This threshold corresponds to 104 events per actor (E = 520 / N = 5). In fact, we find that this is the highest number of events per actor required according to our regression models. Thus, a conservative rule for determining an appropriate number of events for a given number of actors is to collect at least 100 events per actor. We should note however that these results are likely lower bounds given that we did not test for the interactions between variables.
Accuracy
Turning to the accuracy of the model (second column of graphs in Figure 1), we find that there is a slight underestimation of the effects, across all five measures. The only exception is an overestimation of the large effect of inertia when the sequence has a small number of actors. The underestimation bias is most pronounced for activity and popularity and larger effect sizes tend to be consistently significantly underestimated, with underestimation reaching more than 50% of the real parameter estimate. Here, number of actors and number of events have little effect on accuracy.
Precision
Finally, our results indicate that the REM achieves a high degree of precision (i.e., small standard errors). When the model is applied to sequences with more actors or more events, precision increases, though the marginal benefit tapers off significantly after approximately 100 events. This finding is true across all statistics and effect sizes, though reciprocity tends to exhibit the greatest variation. Thus, we can conclude that the REM yields extremely precise coefficient estimates once a relatively small threshold of events is crossed.
Sampling
We next explore the impact of sampling the risk set on power, accuracy, and precision; our results are provided in Figure 3. As before, Figure 3 illustrates the key trends, while the full results are reported in Online Appendix B.

Results for sampling of risk set. Note: Power, accuracy, and precision for inertia, reciprocity, activity, popularity, and transitivity. The horizontal axis is the number of samples from the risk set. All values are averages across 50 simulations. All sequences have N = 100 actors.
Regardless of number of events, effect size, or statistic, we find that power tends to stabilize once the sampled amount reaches 5 to 10 potential events. Consistent with our prior results, our findings suggest that power is good across sample sizes. We should note however that we are assuming constant values of the parameters over the duration of the sequence. If the parameter values were time-varying, we likely would need larger samples from the risk set at each observation to effectively capture these changes. Thus, the result of 5 to 10 samples is likely a lower bound, and in more complex datasets more samples may be required.
We do note some differences in power across statistics. The REM consistently detects activity and popularity with samples from the risk set, even when the effect is weak. For inertia, reciprocity, and transitivity, the REM has only moderate power for detecting weak effect sizes. Concerning accuracy, when we sample the risk set there is again a slight tendency to underestimate the effect size, regardless of the number of events, sample size, or effect. We do find that the bias is smaller in magnitude when the effect sizes are smaller and the bias is larger when the effect sizes are large—this suggests that the parameter estimate from the REM is approximately the same, regardless of the magnitude of the underlying effect. In terms of precision, we identify two trends. First, sequences with more events tend to lead to smaller standard errors, regardless of the sample size. Second, estimates of coefficients for larger effects have smaller errors in general.
Scaling
Finally, we consider the impact of various scaling methods on power, accuracy, and precision; the results are illustrated in Figure 4. A more detailed reporting of our findings is provided in Online Appendix C.

Results for scaling of statistics. Note: Power, accuracy, and precision values for inertia, reciprocity, activity, popularity, and transitivity. Blue bars indicate a small effect, and red bars indicate a large effect. All values are averages across 50 simulations. All sequences have 500 events and 10 actors.
We generated sequences with 500 events and 10 actors, and then fit the REM using each of the three scaling methods. Overall, the only scaling method that leads to power close to 100% for all combinations is the sliding window approach. Proportional scaling has a power above 80% for all statistics with a large effect, and for statistics with a smaller effect this method has a power of around 50% for inertia and reciprocity. Power is at or close to 100% for activity, popularity, and transitivity, even for smaller effects. Exponential scaling has power between 60% and 90% for all statistics with small effects. With large effects, exponential scaling has power around 100% for inertia, reciprocity, and transitivity. However, power drops at or below 50% for activity and popularity.
In terms of model accuracy for various scaling approaches, we find that the REM significantly overestimates proportionately scaled statistics, regardless of effect or effect size. By contrast, an exponential or sliding window approach leads to a relatively small underestimation with smaller effects and a larger underestimation for activity, popularity, and transitivity for large effect sizes. Thus, all methods lead to some bias, but the inaccuracy is most severe under proportional scaling. Last, when different scaling approaches are used, we find that the standard errors are largest when applying proportional scaling. Again, this pattern holds across effects and effect sizes. Overall, proportional scaling leads to systematic inaccuracy in the REM, with a tendency to generate large parameters with large standard errors. However, given the relatively high power, we can conclude that the scaling issue could be addressed by standardizing the statistics.
Empirical Example: Corporate Communications
We obtained data from an organization in order to replicate the analyses that we conducted in the simulation regarding sampling the risk set and scaling the sufficient statistics in a real organization. The organization that we obtained data from is an IT recruitment company operating in Australia. We obtained all email communications between all members of the IT department (N = 33 employees) during a full month (October 2012: E = 5,391 email exchanges). We removed all emails sent to or received from email addresses external to the company.
The empirical example supplements the simulation in three key ways. First, the number of events in the dataset is much larger than in the simulation. Because of this large E, the need to conduct sampling of the risk set and scaling of the statistics is more salient. Second, in a real dataset, the specified effects do not exist in isolation, but rather interact with one another simultaneously. Finally, our empirical example provides a tangible context for interpreting REM results. However, a drawback of empirical data relative to simulated data is the lack of a “ground truth.” In other words, there is no way to tell what the real effect sizes are, and whether the statistics actually exert influence on the pattern of interactions.
Analysis Procedure
Our first step was to fit a model using all five sufficient statistics—inertia, reciprocity, activity, popularity, and transitivity—together with no scaling or sampling. The resulting model served as our baseline for evaluating sampling and scaling strategies. We next fit the REM to the same data using all five statistics, but only taking samples from the risk set. The number of samples varied from 1, 5, 10, 20, 30, 50, 100, and 200 to be consistent with the simulations. To ensure we accounted for sampling variability, we repeated each step 20 times. We then computed the average power, accuracy, and precision of the model, relative to the baseline result. For scaling, we fit the REM to the original dataset using the same five statistics, with each of the three scaling strategies implemented. We tested the proportional scaling method, a sliding window of 500 events,
9
and exponential decay with a half-life of three days. The formula for the weight is
Empirical Results
Our first REM measures the effects of inertia, reciprocity, activity, popularity, and transitivity on corporate communication. The results are presented in Table 5.
Relational Event Results for Organizational Data.
Note: N = 33, E = 5,391.
*p < .01. **p < .001.
We find that all five statistics are positive and significant, and that the model is a significantly better fit to the data than the null model, that is, a random sequence. The positive effect of inertia indicates that members of the organization were more likely to send emails to other individuals whom they have contacted more in the past. Likewise, the positive effect of reciprocity suggests that people are significantly more likely to send emails to recipients from whom they have received many emails in the past. The activity and popularity effects indicate that emails are more likely to originate from active individuals, and are more likely to be targeted toward popular individuals. Finally, the positive transitivity effect can be translated as a tendency for email communications to occur as part of small groups, which may reflect the task structure of the organization (Quintane et al., 2013).
Sampling
We next turn to our analysis of sampling strategies. Given that the risk set is large (33*32-1)*5,391 = 5,687,505, it is costly to compute statistics for the entire sequence at once. Thus, by taking a small number of samples we can significantly reduce this computational burden and therefore handle much larger datasets. In Table 6 we present our findings for power, accuracy, and precision across various numbers of samples. All results are averaged across 20 runs.
Summary of Results for Sampling of Organizational Data.
Note: All values are averaged over 20 repetitions. Power is computed at 95% confidence level. Accuracy is normalized by estimate values from base model for comparison.
We first observe that for inertia, reciprocity, and transitivity we achieve 100% power at the 95% confidence level with any number of samples from the risk set. In other words, in each case these statistics were positive and statistically significant. Activity and popularity achieved poor power with one sample, but with five samples from the risk set they reached 95% power, and with 10 or more samples they reached 100% power.
Turning to accuracy, we find that in general, the model becomes more accurate with a larger number of samples from the risk set. With 10 samples from the risk set, all variables are within 50% of the baseline values except for inertia which requires 50 samples. Interestingly, inertia and activity are consistently overestimated, while the other variables are closer to the baseline values. At the extreme end of our test, all five variables are within approximately 10% of the original values with only 200 samples from the risk set. This finding indicates that with 200 samples, our estimates are effectively identical to the results of the full model. Finally, we consider the precision of the REM under sampling. For all five statistics, standard errors decline as the number of samples increases. Consistent with the baseline model, inertia and reciprocity have the least precision (largest standard errors), followed by transitivity, and then activity and popularity. Overall, we find that with any number of samples, the standard errors of all five statistics are comparable the baseline model.
Scaling
For the last phase of our analysis, we fit the REM to the full sequence, that is, with no sampling, using three alternative scaling strategies. We apply proportional scaling, a sliding window of 500 events, and a half-life decay function with a half-life of three days. Furthermore, we standardized the variables before the estimation of the model by creating z-scores (mean-centering the variables and dividing by their standard deviation), so that all variables have mean zero and standard deviation of one. The results of these REMs are presented in Table 7.
Relational Event Results for Organizational Data With Scaling and Standardization.
*p < .01. **p < .001.
We find that the parameter estimates are consistent in sign and significance across scaling methods; all five statistics have a positive and significant effect on predicting email events. Thus, scaling does not seem to negatively impact statistical power. Similarly, in terms of precision, the standard errors of all models with scaled statistics are either similar in magnitude or smaller than the unscaled benchmark, confirming the general high level of precision of REM. However, the accuracy of REM across different forms of scaling is problematic. All three scaling strategies overestimate reciprocity, activity, and popularity while they underestimate transitivity. This issue is particularly pronounced for proportional scaling, for which the parameter estimate for activity is more than six times higher than the unscaled baseline. Further, the parameter estimate for transitivity is four times lower than the unscaled standard. For inertia, proportional scaling overestimates the parameter estimate while the sliding window and half-life scaling approaches underestimate it.
Discussion
In this article, we used a series of simulations to examine systematically the power, accuracy, and precision of the REM under varying conditions of sequence length, network size and effect sizes, as well as using different sampling thresholds and scaling strategies. To complement the simulation study, we also analyzed a dataset containing over 5,000 emails exchanged between employees of an Australian IT company. The two sets of analyses provided distinct but consistent and complementary insights. In the simulation we varied the characteristics of the sequence N (number of actors) and E (number of events) in order to identify the lower boundaries of power, precision, and accuracy; that is, the smallest number of actors and number of events that can be used with the REM while providing reliable results. In the simulation we were able to provide results for each statistic individually. By contrast, the empirical example enabled us to examine the performance of scaling and sampling strategies in a real empirical context without isolating the effects of each statistic.
The main result identified by our analyses is that REM has generally good power and good precision. We found that REM requires relatively few events per actor to obtain good power and precision (as per Figure 2). 10 At the same time, REM consistently displayed relatively poor accuracy, especially for large effects. For example, Figure 3 showed that a large effect of transitivity is underestimated by about 80%. Furthermore, we found that scaling strategies accentuate these accuracy issues, especially when using proportional scaling. Finally, sampling requires few events (5 to 10) taken randomly from the risk set to obtain good power and precision, however obtaining a satisfactory level of accuracy requires a much higher sample (around 20% of the risk set).
In Table 8, we provide specific and concrete guidance regarding the network size and sequence length needed to recover specific effect sizes, as well as the most common issues affecting power, accuracy, and precision in our simulation.
Summary of Findings.
Based on our results, we propose the following guidelines for applying REM. First, researchers should consider using at least 100 events per actor to detect weak effects with sufficient power in small networks. For larger networks or stronger effects, fewer events can potentially be collected. Second, when researchers need to sample their risk set because they have a large number of actors, they can draw a minimum of five samples from the risk set, with a minimum sequence length of over 5,000 events to ensure good power and precision, being aware that accuracy may be a problem. Third, and by contrast, we urge researchers to exert caution when interpreting the magnitude of their parameter estimates, especially when sampling from the risk set and scaling their statistics. To improve accuracy, we recommend sampling at least 20% of the risk set and using either sliding window or exponential decay scaling over proportional scaling. In conclusion, the REM functions well as an explanatory model, given its ability to consistently detect significant effects with high precision. Thus, researchers can confidently apply the model for hypothesis testing. However, the REM functions relatively worse as a predictive model. Essentially, its inability to accurately differentiate between small and large effect sizes would make prediction of future events unreliable. In the following sections, we proceed to discuss our findings in greater depth and justify our recommendations.
Implications of Actors, Events, and Statistics
An important first step in conducting REM analysis is to decide on an appropriate number of events relative to the number of actors being studied. In our baseline simulations, we restricted our analyses to relatively few actors (up to N = 100 people) and shorter sequences (E = 1,000 events). Our findings suggest that for small networks (5 to 10 people) and short sequences (100 to 200 events), the REM may have difficulties in detecting weak or moderate effect sizes, particularly for dyadic effects like inertia and reciprocity. However, when networks reach 20 to 50 people and sequences are longer than 1,000 events, the model will detect most statistics with confidence. Nevertheless, across all approaches, data sizes, and effects, the REM tends to have a negative bias, that is, underestimation, which is more marked with larger effects and does not improve with more events. By contrast, precision of the REM is relatively strong (small SEs relative to effect size), and is better when measuring actor-level measures like activity and popularity. Further, the precision of the REM increases sharply as the number of events increases, particularly for larger effect sizes.
Our findings suggest that researchers can conduct successful studies with REM with relatively little data. Specifically, for research on small groups (e.g., Schecter et al., 2018), sufficient power can be obtained for moderate effect sizes with only a few hundred events. The challenge remains, however, when the effects in question are relatively weak. In such circumstances, the number of events should be increased to guarantee reliable results, particularly when the sequence includes a larger number of actors. On the other end of the spectrum, REM can be applied to datasets with more actors and events (N > 30, E > 1,000) and detect large effect sizes with over 90% power, albeit with a potential lack of accuracy regarding the true effect size. Finally, it is worth noting that our results demonstrate the need to consider both the number of actors and number of events. Specifically, for dyadic effects such as inertia or reciprocity, the REM does not consistently detect weak effects for E < 1,000 and N = 40, while it does detect these effects for the same E when N = 10. This conclusion diverges from cross-sectional network research where the emphasis is on actors as well as panel regression-style analyses where the emphasis is on the number of observations.
Implications of Sampling
We next consider the implications of sampling the risk set for research using REM. Though the networks we consider in this study are not particularly large (cf. Lerner & Lomi, 2020), even a network of N = 100 nodes with a sequence of E = 1,000 events implies a total risk set of 9,899,000 ((99*100 -1)*1,000) potential events, which carries a substantial computational burden. Examining the impact of different sampling strategies on networks of this size and sequences of this length is therefore already highly meaningful. As the characteristics E and N grow, this issue becomes even more salient.
For both our simulations and email dataset, we find that only a small number of samples are required to achieve sufficient power in most circumstances. In our simulation study, drawing 5 to 10 samples from the risk set was sufficient to achieve good power. Likewise, in our empirical analysis we found that with at least five samples from the risk set, we were able to achieve high power for all five statistics simultaneously. Precision was also strong in both cases, and improved somewhat with more samples. Further, our simulation findings suggest that precision will improve with longer sequences and larger effects, regardless of sample size. Interestingly, our results for accuracy diverged between the simulation and empirical analysis. The simulation study results suggest that the REM will consistently underestimate the effect size of statistics, regardless of the number of samples. When analyzing our empirical data, we find that some effects are overestimated, while others are underestimated. Accuracy improved with more samples, but the number of samples required to achieve high accuracy was relatively large (over 200 samples or 20% of the risk set for each event). These results could be due to the interactions between the variables which did not exist in the simulations, or to unobservable intricacies in the empirical data.
Taken collectively, our results highlight an interesting and important tradeoff when conducting studies with REM on large-scale data. Specifically, much of modern organizational research focuses on large datasets with many actors and numerous events. As a result, the risk sets associated with this data will make direct computation of the REM intractable, and some form of sampling strategy becomes necessary. On one hand, a small number of samples is sufficient to achieve good power. As a result, researchers can apply the REM to very large datasets and have confidence that they will not miss any significant effects. Further, we see little evidence of false positives. 11 On the other hand, sampling the risk set consistently produces results that are not accurate with regard to the baseline or ground truth values. Further, our empirical analyses suggest that any loss of accuracy could be an under- or over-estimation.
For example, Table 6 shows that with a sample size of 10 events, inertia, activity and transitivity are captured with full power and acceptable precision, but their accuracy is deficient. In our organizational example, this would mean overestimating the extent to which individuals keep communicating with the same partners, as well as the importance of central individuals that connect with alters across the network. By contrast, we would be underestimating the effect of transitivity in generating the sequence of events. More concretely, we might explain the pattern of communication due to the persistence of communication as well as the emergence of central actors, while in fact transitivity plays a more crucial role. This misspecification would be problematic if trying to predict which actors the next exchange will occur between (i.e., to identify transfer of knowledge or the spread of ideas).
Implications of Scaling
The simulation and the empirical example provided similar results in terms of scaling. Proportional scaling leads to an overestimation of parameter estimates and to problems of precision for inertia, reciprocity, and transitivity. Furthermore, the simulation signals problems of power with inertia and reciprocity for small effects with proportional scaling. Exponential scaling and sliding window have some punctual issues with accuracy, precision and power for large effects, but overall they provide comparable estimates.
The empirical example also reveals more subtle differences between the scaling methods, potentially reflecting the theoretical assumptions regarding time that are embedded in each scaling method. Proportional scaling tends to place more weight on long term patterns of past behavior, since the addition of each additional event has a smaller marginal impact on the value of the statistics. By contrast, the sliding window method places greater weight on more recent events (assuming that the researcher choses a relatively short sliding window as we do here). The overestimation of inertia by proportional scaling, compared to the unscaled baseline, and its underestimation when using the sliding window can be interpreted in light of this. Inertia has a tendency to build up continuously over time, which means that differences in long term rates of activity between individuals are relatively good predictors of future activity differences (Karsai et al., 2014). Hence, the longer the observation period over which inertia is observed, the stronger the effect of the statistic in predicting future events, which is consistent with how proportional scaling overemphasizes the longer-term patterns compared to a shorter sliding window. Half-life scaling also places more weight on recent events, but to a somewhat lesser extent than the sliding window approach. This result would suggest that identifying a scaling strategy that balances the importance given to longer-term trends compared to shorter-term variations in these trends is critical in capturing how social processes operate in a social setting. Future research should identify conditions under which the REM can detect information from different lengths of past histories (e.g., longer-term vs. shorter-term history of past events).
Limitations and Future Directions
There are a few limitations to this study that suggest avenues for further research. First, in this article we focus on the ordinal version of the REM. REMs can incorporate exact measurements of time (Butts, 2008), rather than ordinal information alone. However, we choose to focus on the ordinal version for a few reasons. First, many organizational studies utilizing event data do not use exact timing for substantive reasons, namely, that communication is happening asynchronously and thus exact times are less relevant. For instance, REM has been used to study software development (Brunswicker & Schecter, 2019; Quintane et al., 2014). Second, introducing a temporal variable brings in an additional factor which could influence power, accuracy, and precision. For instance, it is unclear what the effect of long time intervals versus short intervals is on the REM. Further, the distribution of interevent times (e.g., exponential, weibull, normal) could also have an impact. Because our focus is on the appropriate selection of dataset, sampling strategy, and scaling strategy, we argue that temporal information is beyond the scope of our article. Future work should consider the role of time in determining statistical power.
A second limitation of our study is that it focused on five statistics (inertia, reciprocity, activity, popularity, and transitivity) only. While empirical applications of the REM have used many other statistics such as participation shifts (Leenders et al., 2016), temporal effects (Vu et al., 2015), and actor fixed effects (Butts, 2008), we believe that these statistics are the most common and representative ones. Further, these statistics are most closely related with frequently used static network structures (e.g., Lusher et al., 2013). Finally, the statistics we tested are all time-varying count variables, that is, they are continuously changing over the course of a sequence. Because of this, we anticipate that they will be at least as difficult—if not more—to estimate accurately as other statistics.
Third, there are several related but distinct ways to model event networks. In particular, the REM is tie-oriented in that probabilities are based upon links forming in the network. Actor-oriented methods such as DyNAM (Stadtfeld & Block, 2017; Stadtfeld et al., 2017) operate under somewhat different mathematical and theoretical assumptions. Accordingly, it is not clear how our results would extend to an actor-oriented rather than tie-oriented model. We encourage future work to explore the differences in these two modeling paradigms.
Conclusion
REMs are statistical models specifically suited for sequences of relational events (i.e., time-stamped interactions between social actors, such as emails, phone calls, etc.). These data structures are becoming increasingly popular among organizational network researchers due to the depth and breadth of information that they contain about interactions between social actors. REMs enable researchers to estimate the full sequence of interactions using well known network concepts as well as individual or dyadic covariates. However, the REM is still relatively new, and there are no systematic studies of its power, accuracy, and precision. In this study, we explored the boundary conditions for achieving sufficient power, precision, and accuracy with REMs. Further, we determine the impact of using scaling and sampling strategies to estimate large sequences of relational events. Our results shed light on the utility of the REM and can serve as a foundation for future research.
Supplemental Material
Supplemental Material, REM_Power_Appendix - The Power, Accuracy, and Precision of the Relational Event Model
Supplemental Material, REM_Power_Appendix for The Power, Accuracy, and Precision of the Relational Event Model by Aaron Schecter and Eric Quintane in Organizational Research Methods
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
