The Power,Accuracy,and Precision of the Relational Event Model

Abstract

The relational event model (REM) solves a problem for organizational researchers who have access to sequences of time-stamped interactions. It enables them to estimate statistical models without collapsing the data into cross-sectional panels, which removes timing and sequence information. However, there is little guidance in the extant literature regarding issues that may affect REM’s power, precision, and accuracy: How many events or actors are needed? How large should the risk set be? How should statistics be scaled? To gain insights into these issues, we conduct a series of experiments using simulated sequences of relational events under different conditions and using different sampling and scaling strategies. We also provide an empirical example using email communications in a real-life context. Our results indicate that, in most cases, the power and precision levels of REMs are good, making it a strong explanatory model. However, REM suffers from issues of accuracy that can be severe in certain cases, making it a poor predictive model. We provide a set of practical recommendations to guide researchers’ use of REMs in organizational research.

Keywords

construct validation reliability validity criterion predictive validity nonlinear modeling quantitative research longitudinal analysis

Research in organizational networks is starting to undergo a profound transformation. The increasing availability of digital data about human behavior in and across organizations provides opportunities to study organizational phenomena at a larger scale and with finer granularity than was ever possible before (Lazer et al., 2009; Wenzel & Van Quaquebeke, 2018). A growing literature is leveraging these digital trace data to understand organizational phenomena, such as teamwork and group dynamics (Leenders et al., 2016; Onnela et al., 2014; Schecter et al., 2018), intraorganizational network dynamics (e.g., Aral & Van Alstyne, 2011; Goldberg et al., 2015; Kleinbaum, 2012; Kossinets & Watts, 2006; Liu et al., 2016), or interorganizational processes (Ng, 2017; Valeeva et al., 2020). The increasing availability of digital data has been accompanied by the development of suitable statistical frameworks (see Golder & Macy, 2014; Tay et al., 2018).

The relational event model (REM—Butts, 2008) is a statistical framework that has made important inroads in social inquiry (Brandenberger, 2019; Kitts et al., 2017; Vu et al., 2015), and in organizational contexts in particular (e.g., Leenders et al., 2016; Pilny et al., 2016; Quintane & Carnabuci, 2016). Compared to other inferential statistical frameworks such as exponential random graph models (ERGMs—Lusher et al., 2013) or stochastic actor-oriented models (SAOMs—Kalish, 2020; Snijders, 1996), REM is specifically designed to analyze sequences of time-stamped interactions between social actors without needing a priori aggregation. That is, REM can estimate a full sequence of relational events without collapsing the sequence into a cross-sectional network (Butts, 2008). This feature enables retention of the sequence and timing of relational events (Quintane et al., 2013), which are critical to examining network dynamics. In a similar way to ERGM and SAOM, REM enables researchers to specify and compare the differential effect of social mechanisms (e.g., reciprocity, transitivity; see, e.g., Wimmer & Lewis, 2010, for ERGM, Tröster et al., 2019, for SAOM, and Quintane et al., 2013, for REM). However, REM also enables researchers to examine the temporal dimension of these social processes (see Kitts et al., 2017; Quintane & Carnabuci, 2016). As such, REM opens opportunities for organizational researchers to gain new insights into questions around network stability and change, the timing of social and behavioral processes, and more generally network dynamics.

However, researchers’ ability to test hypotheses regarding the prevalence of these temporal social processes in a given empirical context rests on the reliability of the REM’s parameter estimates, which is based on the model’s power, accuracy, and precision. To achieve sufficient power, accuracy, and precision, there are three factors researchers must consider, each of which is based on decisions that researchers make during data collection, calculation of statistics, and model implementation. These factors are (a) the number of events and actors in a sequence, (b) the sampling of the risk set, and (c) scaling strategies for the statistics. When accessing sequences of relational events, researchers have to decide at a minimum how many actors and how many events they need in order to ensure sufficient power, accuracy, and precision of their REM. Furthermore, researchers who have obtained access to long event sequences with many actors may still need to decide how many events they want to sample from the risk set in order to make its computation feasible (see Lerner & Lomi, 2020; Vu et al., 2015). Finally, long event sequences also require researchers to use a scaling strategy to make their count network statistics comparable over time (Brandes et al., 2009; Kitts et al., 2017; Quintane et al., 2014). These issues are important because misspecifications might lead to biased estimates, skewed standard errors, or incorrect significance levels, each of which could affect the substantive results of research projects (Block et al., 2018; Stadtfeld et al., 2018; Wang et al., 2014). However, the extant literature offers little guidance to researchers to ensure appropriate power, accuracy, and precision of the results provided by REM.

In this article, we aim to provide concrete recommendations to researchers regarding the number of events and actors, as well as sampling and scaling strategies to ensure appropriate power, accuracy, and precision of REMs. To do so, we conduct simulation experiments to understand REMs’ power, accuracy, and precision on five endogenous network statistics commonly used in statistical modeling of network data (inertia, reciprocity, activity, popularity and transitive closure). Our experiments vary the characteristics of the event sequences being modeled (number of actors and number of events) as well as the effect size of the endogenous statistics to establish lower bounds on the data required to conduct reliable organizational studies with REM. Using the simulations, we also examine the effect of choosing different sample sizes and different methods of scaling statistics on power, precision, and accuracy. Furthermore, we augment the simulation study with a real-life example using a dataset of over 5,000 emails among 33 employees in an organization in Australia. The case study enables us to expand the analyses done in the simulation to a real empirical context.

Our results show that REMs are versatile models that can be applied to event sequences with a wide range of characteristics. We found that the power and precision of REMs are generally good, even when event sequences have relatively few actors. Based on our simulations, we propose a set of thresholds for the minimum number of events required given a specific number of actors in order to reach acceptable power. Similarly, there is little loss of power or precision when sampling as few as five potential events for each real event (when the sequence is over 5,000 events). However, our results suggest that the accuracy of the model is generally problematic and is negatively affected by sampling and proportional scaling. Table 8 provides a summary of our results.

The Relational Event Model

The relational event framework (Butts, 2008; Butts & Marcum, 2017) was developed as a way to estimate sequences of relational events. A relational event is defined as “a discrete event generated by a social actor (the ‘sender’) and directed toward one or more targets (the ‘receivers’), who may or may not be actors themselves” (Butts, 2008, p. 159). REMs predict the occurrence of the next event in a temporally distributed sequence of events (Marcum & Butts, 2015). This means that, in REM, the dependent variable is the occurrence of the next event in a sequence, which is modeled as a function of the sequence of past events. Butts’s (2008) article provides an example using REM to describe radio communications during the World Trade Center disaster. REM predicts who communicates via radio with whom, based on the history of past communications. More specifically, assuming that the next event in the sequence of radio communications is A reaching out to B, the REM predicts the likelihood of this event—A communicating with B—occurring next, based on predefined factors such as the history of past communications from A to B (demonstrating inertia in communications) or on the history of communications from B to A (demonstrating reciprocity, people tend to respond when being addressed).

REM allows researchers to examine how individual behavior is shaped by social structure, characterized through a set of mechanisms that operate over time. For example, do hospitals engage in the social norm of reciprocity when exchanging patients, instead of sending them to the hospital that can offer the best service for the patient? Using REM, Kitts et al. (2017) examined reciprocity in over 4,000 patient exchanges between 21 hospitals in a region of Italy, spanning 5 years. They show that hospitals do reciprocate patient exchanges over time in ways that are not explained by the availability of beds, the quality of service, or the specialization of hospitals.

REM departs from a typical assumption of other network statistical frameworks, such as ERGMs or SAOMs, for which the issue of dependence between observations is critical (Kalish, 2020; Lusher et al., 2013). In REM, each event is considered to be conditionally independent of all other events in the sequence. While REM does not assume dependence between observations across actors, it assumes temporal dependence. That is, each event occurs conditionally on the realized history of past events. This makes the framework appropriate for contexts in which past behavior influences future behavior (i.e., most organizational contexts), but also enables researchers to examine how social mechanisms operate over time. Following on the same example, in their analysis, Kitts et al. (2017) distinguish between two forms of reciprocity between hospitals: organizational embedding and resource dependence. They demonstrate that these forms of reciprocity operate over different time horizons; embedding operates over longer-term histories of interactions (i.e., one year), while dependence operates over shorter-term histories of interactions (i.e., one month).

Fitting a REM requires the estimation of the probability that a particular sequence of events transpired as a function of exogenous and endogenous factors. To capture this probability, each event is given a rate, or a frequency of occurrence; events that are common have high rates, and events that are rare have low rates. The rate for each dyad is a function of statistics corresponding to social processes such as inertia or reciprocity, as well as parameters that represent the sign and strength of the statistics’ effects. For instance, Schecter et al. (2018) found that group members searching for information in military simulations tended to exhibit inertia in communication, but avoided preferential attachment (that is, communicating with individuals that have both high indegree and outdegree). More specifically, inertia means that the more messages individual A sent to individual B in the past, the higher the likelihood of A sending again a message to B in the future. By contrast, for preferential attachment, the parameter is negative, indicating that a future message from A to B is less likely if B has both sent and received messages to/from many people in the past.

Extending this logic, the REM models every potential pairing of individuals in terms of rates, which are functions of network statistics and parameters. The relative likelihood of an event (link between two nodes) can then be calculated by comparing the rate values for all possible dyads (Butts, 2008). Each time an event occurs, the statistics are updated, and the comparison of dyads is repeated. As the sequence continues to unfold, the rates of events are continuously updated to reflect the new network structure. In this way, REM captures the probability of the full sequence by tuning the rate parameters and maximizing the likelihood of each observed event.

Power, Accuracy, and Precision

The objective of using statistical models such as REM to conduct hypothesis testing is to obtain an estimate, $\hat{θ}$ , of some true coefficient $θ$ that represents the magnitude of a particular effect (Stadtfeld et al., 2018). In the case of REM, the estimates $\hat{θ}$ correspond to the effect of particular patterns (e.g., inertia) on the probability of observing an event. For instance, suppose members of an organization exhibit a tendency to email others to whom they have frequently sent emails in the past (i.e., inertia). Then, the true value $θ_{I N E R T I A}$ for the inertia statistic should be positive, and the REM should find a positive and significant estimate ${\hat{θ}}_{I N E R T I A}$ . Further, the estimate ${\hat{θ}}_{I N E R T I A}$ should be as close in value to $θ_{I N E R T I A}$ as possible.

During hypothesis testing, two types of errors are possible: false positives (Type I), where the model indicates a statistically significant estimate $\hat{θ}$ , even though the true value of $θ$ is zero, and false negatives (Type II), where the model returns an estimate of $\hat{θ} = 0$ when the true value of the parameter is not equal to zero. Given these two errors, statistical power is the probability that a true effect ( $θ \neq 0$ ) is detected ( $\hat{θ} \neq 0$ ) at significance level α (Cohen, 1988, 1992), which is one minus the probability of a Type II error at the fixed significance level. Higher power also leads to fewer instances of falsely overlooking an important effect. We summarize these outcomes in Table 1.

Table 1.

Possible Outcomes of Relational Event Model.

Result of Relational Event Model	True Effect
Result of Relational Event Model	Effect Absent $(θ = 0)$	Effect Present $(θ \neq 0)$
Significant effect found $(\hat{θ} \neq 0)$	False positive (Type I error)	True positive (Power)
No effect found $(\hat{θ} = 0)$	True negative	False negative (Type II error)

Note: This table is adapted from Wang et al. (2014, p. 90).

While achieving strong statistical power is important for hypothesis testing, obtaining high accuracy and precision is critical for the interpretation of estimated effects (Block et al., 2018). The accuracy of an estimate refers to the discrepancy between the fitted value $\hat{θ}$ and the true value $θ$ . An accurate estimate of an effect is said to be unbiased because it does not systematically deviate above or below the true value. On the other hand, the precision of an estimate refers to the variability of the measurement. Precise estimates will have smaller standard errors, indicating a greater level of confidence in the value of the coefficient.

Three Main Issues

Issues of Size

Statistical power, accuracy, and precision are contingent on multiple variables: the significance level, the sample size, and the population effect size (Cohen, 1992). The significance level refers to the probability of a Type I error, sample size refers to the number of observations used to test the hypothesis, and effect size represents the magnitude of the underlying measure. Generally speaking, if one can identify those three variables, then power can be determined. For REM, the significance level and effect size are straightforward to specify. However, the number of events (e.g., the number of emails in an email dataset) in conjunction with the other two factors alone is not sufficient to determine power. We also need some measure of the size of the risk set, which is a function of the number of actors.

We consider that a sequence of relational events is characterized by the number of events (E) in the sequence and by the number of actors (N) that are involved in the sequence at any point in time.¹ These two characteristics (number of events and number of actors) together have the potential to affect power, accuracy, and precision in a REM. Additionally, these two characteristics affect the size of the risk set. Estimating a REM requires the construction of a risk set for each event, which consists of all the potential events that could have occurred instead of each event being observed in the sequence.²

Hence, power, accuracy, and precision are affected by the number of events and the number of actors in the event sequence. While it is intuitive that smaller number of actors and smaller number of events will result in lower power, precision, and accuracy, we do not know what the lower bounds are for the number of events and the number of actors. Furthermore, we do not know how different combinations of number of events and number of actors together affect the lower bound (for example, a small number of actors with a large number of events, or a small number of events with a large number of actors). Finally, the number of events affects the size of count statistics, which makes scaling important.

Issues of Risk Set and Sampling

Estimation of the REM requires the computation of statistics for a risk set (Butts, 2008). The risk set is determined for each event in the sequence (and it is therefore a function of the number of events in the sequence). For each event, the risk set is composed of all potential events that could have occurred at the same time as the observed event. Because the events are dyadic, this means that the risk set contains all the potential dyads that could have existed instead of the event that actually occurred (hence the risk set is also a function of the number of actors). For instance, in a sequence with three actors (A, B, and C), we observe an interaction from A to B. The events that could have occurred instead of event AB are all other dyadic combinations (BA, AC, CA, BC, and CB) between the three actors. For a sequence with three actors, the risk set contains five events for each observation, which is equal to the number of actors N multiplied by N−1 (i.e., we do not allow events from an actor to him or herself). Also, we remove the observed event from the risk set, hence the size of the risk set is N*(N−1)−1. The total size of the risk set for a full sequence is therefore E*(N*(N−1)−1) potential events.

Identifying the size of the risk set is critical to determining the statistical power, accuracy, and precision of the REM because every element of the risk set—that is, potential dyadic event—is essentially an observation of an event that did not occur. This problem is analogous to logistic regression, where populations associated with both responses (1s and 0s) must be accounted for when computing power (Hsieh et al., 1998). As the size of the risk set increases for each event (i.e., the number of actors increases), the ratio of observed to unobserved events will become very small. For example, in a network with 100 actors, for each observed event there are N*(N−1)−1 = 9,899 unobserved events. Thus, if we have a short sequence but many actors, identifying the mechanisms that contribute to events occurring becomes increasingly difficult. As a result, as the size of the risk set increases (N increases), we likely need the number of events (E) to increase as well.

A key problem with the computation of the risk set is that the number of potential events in a dyadic dataset increases in the order of the square of the number of actors in the dataset. This problem is compounded by the potentially large number of observations needed to identify meaningful effects. The computation of statistics for each potential event in the risk set becomes quickly intractable even with sequences that have a reasonable number of actors. For instance, in an organization of 129 employees that exchange 75,308 emails (see Quintane & Carnabuci, 2016), the size of the risk set for each event is (129*128−1) = 16,511 potential events, and the total size of the risk set across the full sequence is 1,243,410,388 potential events. Because we would have to calculate the values of statistics, such as reciprocity or transitivity, for each potential event, this computational problem becomes unmanageable when the number of actors and the number of events become too large (Butts, 2008; Vu et al., 2015).

To remedy this issue, random samples of the risk set may be drawn (Lerner & Lomi, 2020; Vu et al., 2015). This means that researchers randomly select a few (a fixed number such as 5, 10, 20 or a percentage such as 10%) potential events out of the universe of potential events and let only this random sample constitute the risk set. While this strategy is very effective in reducing the size of the risk set, we have no understanding of its effect on the power, accuracy, and precision of the model.

Issues of Scaling

The longitudinal dimension of digital datasets requires researchers to use some form of scaling to make network statistics comparable over time (Brandes et al., 2009; Kitts et al., 2017; Quintane et al., 2014). This is because when sequences are long, the values of network statistics based on counts (i.e., most endogenous network statistics) can become very large, resulting in poor estimation of the REM (DuBois et al., 2013). There are two primary reasons for this problem; first, variation across dyads and, second, variation across statistics. Assume we have an event sequence with 1,000 events between three actors A, B, and C. To calculate inertia for a given dyad (say, A to B), we would count the number of events in the sequence in which A interacted with B. For example, if our dependent variable was the fifth event and we are using inertia to predict that event, then we would count the instances of A interacting with B in the first four events. Clearly, the value of inertia would only range from 0 to 4. Similarly, calculating inertia at event 1,000 means counting the number of events from A to B that have occurred in the previous 999 events. Consequently, inertia could take any value from 0 to 999. Hence, the value of the statistic for inertia will be systematically larger at the end of the dataset than at the beginning of the dataset. As a result, the differences across dyads will also be amplified due to computation of statistics, rather than some underlying temporal process. Likewise, for long sequences the values of different statistics may diverge in magnitude. For instance, actor-level measures like activity or preferential attachment will grow much faster than dyadic measures like inertia or reciprocity. These differences arise purely due to the nature of the counting process used to compute statistics, leading to inconsistent measurement between different portions of the sequence.

Researchers have used three main forms of scaling to control the magnitude of network statistics over time (Brandes et al., 2009; Kitts et al., 2017; Quintane et al., 2014): proportional, exponential decay, and sliding window. Proportional scaling involves dividing the statistic for each dyad by the sum of the statistic across dyads, yielding values between zero and one (Butts, 2008; Quintane et al., 2014). For instance, the scaled version of inertia would be $X_{i j}^{I N E R T I A} (H_{t}) = n_{i j t} / \sum_{k} n_{i k t}$ . The measure would now be interpreted as “the proportion of i’s messages sent to j, up to time t.” Alternatively, the exponential decay approach reduces the effect of events over time, so that recent events carry more weight than distant events. For instance, some studies have used a half-life (Brandes et al., 2009; Lerner et al., 2013), where the half-life is the time until an event has a weight of one half. Other studies have used a sigmoidal function to decrease the weight of prior events (e.g., Kitts et al., 2017). Finally, a sliding window approach involves computing the statistics using counts, but only considering events within a certain range. For instance, only events within the last week are counted, or only the last 500 events.

Simulation Study

We examine the power, accuracy, and precision of the REM using different combinations of sequence characteristics (number of events and number of actors). Because we want to understand the extent to which the REM can recover a given effect size at a certain level of significance depending on characteristics of the sequence, our first analytical strategy is to do a simulation study. This enables us to specify explicitly the effect size for each variable of interest and to isolate the parameter estimate of each variable, which might be confounded in a real dataset. We are also able to vary the number of events and the number of actors in the sequence as well as the sampling and the scaling strategies to systematically examine the effect of combinations of these factors on power, precision, and accuracy.

Simulation Process

The simulation process is composed of two sequential steps. In the first step, we generate a set of synthetic event sequences with varying characteristics. Because we generate these synthetic sequences using values that we specify for a given set of parameters, we know exactly what the true value of $θ$ is for each statistic for each sequence. In a second step, we estimate a REM for each sequence and calculate the fitted value $\hat{θ}$ for each statistic. We then compare the averages and standard deviations of estimated values $\hat{θ}$ for each statistic across the corresponding simulated sequences to the true value $θ$ to evaluate power, precision, and accuracy. We detail these two steps below.

Step 1: Generating Sequences and Statistics

We generated a series of sequences to assess the statistical power, accuracy, and precision of the REM under a variety of conditions. When generating the sequences, we varied the statistic used to generate the sequences and the magnitude of the effect.³ We tested each combination for a variety of sequence lengths and risk set sizes. For every combination, we generated 50 sequences and fit an REM to each sequence.

Generating one sequence

A synthetic sequence is created by iteratively drawing events from a probability distribution based on the sequence up to that point. The probability of observing a particular relational event $e_{t} = (i, j, t)$ , assuming ordinal data,⁴ follows from (Butts, 2008):

p (e_{t} = (i, j, t) H_{t}; θ) = \frac{λ_{i j} (t; θ)}{\sum_{(u, v) \in A_{t}} λ_{u v} (t; θ)} .

In the above expression, $H_{t} = {e_{1}, \dots, e_{t - 1}}$ is the history of events up to but not including time t. The value $λ_{i j} (t; θ)$ is the hazard rate for dyad $(i, j)$ at time t, given a set of parameters $θ$ . The risk set $A_{t}$ contains all possible events—dyadic interactions—at time t. Each unique dyad has a hazard rate that depends on the parameters $θ$ and a set of sufficient statistics, which are measures describing features of the prior sequence. The hazard rate takes the following functional form:

log (λ_{i j} (t; θ)) = \sum_{p = 1}^{P} θ_{p} X_{i j p} (H_{t}),

where P is the number of statistics, and $X_{i j p} (H_{t})$ is the pth statistic for dyad $(i, j)$ as a function of history $H_{t}$ . The parameters $θ$ correspond to the magnitudes of each effect included in the model; we specify these for each sequence we generate. When a new event is drawn, we add it to the history, recompute the statistics, and repeat. Given this information, the procedure for creating a synthetic sequence is as follows:

Initialization: Set the sequence length E, risk set size $| A | = N * (N - 1)$ , and effect sizes $θ$ . Randomly draw an initial event e ₁ and set $t = 2$ . Let $H_{t} = e_{1}$ .

Compute $X_{i j p} (H_{t})$ for each $p = 1, \dots, P$ and for every dyad $(i, j) \in A_{t}$ .

Draw a new event $e_{t} = (i, j, t)$ from the distribution $\frac{λ_{i j} (t; θ)}{\sum_{(u, v) \in A_{t}} λ_{u v} (t; θ)}$ . Update the event history $H_{t} = {H_{t}, e_{t}}$ .

Set $t = t + 1$ . If $t = E$ , stop and return sequence $H = H_{t}$ . Otherwise, return to Step 2 and repeat.

In the following sections, we detail each of the components that need to be specified in order to generate the sequence.

Sufficient statistics

In each synthetic sequence, we only specify one statistic at a time out of five statistics in total. Put another way, the parameter $θ$ will be greater than 0 for one variable, while the rest will be set to zero. Then, when we generate a sequence, every event is drawn from a probability distribution that is defined only by the isolated effect. For instance, suppose we are generating a sequence according to inertia. At each step, the probability of selecting dyad $(i, j)$ is given by:

\frac{λ_{i j} (t; θ)}{\sum_{(u, v) \in A_{t}} λ_{u v} (t; θ)} = \frac{exp (θ_{I N E R T I A} \times X_{i j}^{I N E R T I A} (H_{t}))}{\sum_{(u, v) \in A_{t}} exp (θ_{I N E R T I A} \times X_{u v}^{I N E R T I A} (H_{t}))}

For each pair of actors in the network, we would count the number of instances in the prior sequence $H_{t}$ that the dyad appeared. Then, the dyad with the highest number of occurrences would be most likely to occur next. In fact, if two dyads differ by exactly one event (i.e., $X_{i j}^{I N E R T I A} (H_{t}) - X_{u v}^{I N E R T I A} (H_{t}) = 1$ ), then the odds ratio for selecting $(i, j)$ over $(u, v)$ is $exp (θ_{I N E R T I A})$ . Thus, a larger effect size increases the odds of selecting frequently occurring dyads. When a new event is selected, it is added to the event history and we proceed.

The five statistics that we chose—inertia, reciprocity, activity, popularity, and transitivity—are commonly used in studies utilizing REMs (Brandes et al., 2009; Butts, 2008; Quintane et al., 2013) and statistical modeling of networks more generally (Lusher et al., 2013). Each of these statistics describes a type of sequential behavior that individuals in an organizational setting may exhibit (Pilny et al., 2016; Quintane & Carnabuci, 2016; Schecter et al., 2018). We calculate the five statistics with respect to the prior sequence of events, as described above. In Table 2, we present these statistics.

Table 2.

Statistic Definitions and Formulae.

Statistic	Definition	Formula
Inertia	The tendency for an actor i to send a message to actor j, based on i’s prior messages sent to j.	$X_{i j}^{I N E R T I A} (H_{t}) = n_{i j t}$
Reciprocity	The tendency for an actor i to send a message to actor j, based on j’s prior messages sent to i.	$X_{i j}^{R E C I P} (H_{t}) = n_{j i t}$
Activity	The tendency for an actor i to send a message to actor j, based on i’s total prior volume of messages sent.	$X_{i j}^{A C T} (H_{t}) = \sum_{k} n_{i k t}$
Popularity	The tendency for an actor i to send a message to actor j, based on j’s prior volume of messages received.	$X_{i j}^{P O P} (H_{t}) = \sum_{k} n_{k j t}$
Transitivity	The tendency for an actor i to send a message to actor j, based on the strength of the two-paths from i to j through a third party k.	$X_{i j}^{T R A N S} (H_{t}) = \sum_{k} \sqrt{n_{i k t} n_{k j t}}$

Note: is the sender, is the receiver, and is a third party. Arrows indicate direction of events. Past interactions are represented as solid arrows , and a future event is represented as a dashed arrow .

Inertia, also referred to as persistence (Butts, 2008), is a measure of how often events occur within the same dyad over time. In other words, if i sends more events to j, i will become more (less) likely to send subsequent events to j. Reciprocity is a measure of how often dyad $(i, j)$ occurs as a function of events sent from j to i previously. Put another way, j sending events to i makes i more (less) likely to respond with a subsequent event. Activity describes the tendency for a particular node to initiate a new relational event. In other words, i sending more (less) events in the past makes i more (less) likely to send a new event, regardless of the recipient. Similar to activity, popularity represents the tendency for events to be directed toward a particular individual. The more (less) individual j receives events, the more (less) likely they are to receive subsequent events, regardless of the sender. Finally, transitivity is a triadic statistic, in that it relates to the likelihood of communication between i and j to prior events between a third-party k (Quintane & Carnabuci, 2016). Essentially, transitivity is a measure of how likely i is to send an event to j if i frequently sent events to k in the past, and k also sent events frequently to j.

Effect sizes

To ensure consistency across simulations, we use standard values for the parameters $θ$ . Specifically, we use small, medium, and large effect sizes. To determine appropriate effect size values, we leverage the relationship between the REM and a Cox proportional hazards model.⁵ Drawing on prior work, hazard ratios of 1.22, 1.86, and 3.00 represent small, medium, and large effect sizes (Olivier et al., 2017). Taking the natural logarithm of these values, we can obtain values for $θ$ corresponding to those effect sizes. We also include an effect size of zero to account for false positives.

Sequence characteristics

Our first objective is to evaluate the REM across various sequence characteristics (size and length) in conjunction with various effects and effect sizes. We generated sequences with 5, 10, 20, 40, and 50 actors; the risk set contained all possible dyads, leading to a size of $N (N - 1)$ . The number of events varied between 50, 100, 500, and 1,000 events for each number of actors. All five statistics and all three effect sizes are tested for each combination of number of events and effect size.

Step 2: Estimating REMs

Once we generate a synthetic event sequence, we need to apply REM to the data to estimate the parameters $\hat{θ}$ . We follow prior work (Butts, 2008; Quintane et al., 2014) in defining the likelihood function as:

f (H; θ) = \prod_{t = 1, \dots, E} p (e_{t} = (i, j, t) | H_{t}; θ)

= \prod_{t = 1, \dots, E} \frac{λ_{i j} (t; θ)}{\sum_{(u, v) \in A_{t}} λ_{u v} (t; θ)}

The expression above is the product of probabilities for each event in the sequence, with the rates, parameters, and risk set equivalent to those described previously. Parameter estimates $\hat{θ}$ are found by maximizing the likelihood function $f (H; θ)$ , and the standard error of the estimate is calculated using the Fisher information matrix. After obtaining estimates $\hat{θ}$ by applying REM to the generated sequences, we compute the power, accuracy, and precision for each condition.

Outcomes

Across all conditions, we determined the statistical power by counting the number of times the fitted parameter for the relevant statistic $\hat{θ}$ was significant at the $α = 0.05$ level and dividing by the number of sequences (50). In other words, we estimated the probability of finding a significant effect when there should be one; a value of one is perfect power, while a value of zero is no power. We measured accuracy by taking the average difference between the fitted parameter and the actual effect size used to generate the sequence; we scaled this value by the real effect size to ensure the results are comparable. Negative accuracy is indicative of underestimation, while positive accuracy indicates overestimation. For precision, we recorded the average standard error of the relevant parameter. Values closer to zero indicate greater precision.

Scaling

As mentioned earlier, three main forms of scaling have been used in the existing literature: proportional scaling, exponential decay, and sliding window. With the proportional scaling approach, each sufficient statistic is divided by some relevant value to ensure each statistic varies between 0 and 1 (Quintane et al., 2014). For instance, inertia captures the frequency with which actor i sends messages to actor j up to time t. Inertia can thus be scaled by dividing $n_{i j t}$ by the total activity of actor i up to time t, $\sum_{k} n_{i k t}$ . The scaled statistic inertia is then calculated as $n_{i j t} / \sum_{k} n_{i k t}$ , and can be interpreted as the proportion of i’s prior messages that were directed at j.

The next method, exponential decay, involves iteratively reducing the weight of messages that have occurred in the past. With the exponential decay method, each dyad will now have a weight, $ω_{i j t}$ , rather than a pure count of events. One method for calculating exponential decay is a discounting approach, where we calculate $ω_{i j t}$ as $ω_{i j t} = 1 {e_{t} = (i, j, t)} + γ ω_{i j (t - 1)}$ .

Here, $γ$ is a decay factor between 0 and 1 that controls the influence of prior events, and $1 {\dots}$ is an indicator function taking a value of 1 if the condition is true, and 0 otherwise. Using this weight, the statistics are computed using their basic formulae from Table 2. Other approaches to exponential decay include half-life decay (Brandes et al., 2009) and sigmoidal decay (Kitts et al., 2017).

Finally, the sliding window scaling approach involves calculating the statistics using their typical formulae (see Table 2), but only considering events that occurred during a specified time period. We define a dyadic weight $ν_{i j t}^{} (δ)$ as the count of interactions from i to j during the interval $[t - δ, t)$ . For instance, inertia can be calculated only for events in the past day or past week. The sliding window approach is appropriate when events beyond the window likely have little to no influence, but events within the window have equal weight.⁶

In Table 3 we give the formulae for our five variables under various scaling methods. We tested the effect of scaling on sequences with fixed characteristics. Each network was composed of 10 nodes, and the sequence length was 500 events. For proportional scaling we used the specification described in Table 3. For exponential decay, we applied an iterative weighting scheme with a discount factor of $γ = 0.95$ . Finally, with the sliding window approach, we considered only the prior 100 events in the sequence, as this was the minimum sequence length we tested in our simulations.

Table 3.

Formulae for Statistics Under Various Scales.

Statistic	Base Formula	Proportional Scaling	Exponential Decay	Sliding Window
Inertia	$X_{i j}^{I N E R T I A} (H_{t}) = n_{i j t}$	$= \frac{n_{i j t}}{\sum_{k} n_{i k t}}$	$= ω_{i j t}$	$= ν_{i j t} (δ)$
Reciprocity	$X_{i j}^{R E C I P} (H_{t}) = n_{j i t}$	$= \frac{n_{j i t}}{\sum_{k} n_{j k t}}$	$= ω_{j i t}$	$= ν_{j i t} (δ)$
Activity	$X_{i j}^{A C T} (H_{t}) = \sum_{k} n_{i k t}$	$= \frac{\sum_{k} n_{i k t}}{\sum_{l} \sum_{h} n_{l h t}}$	$= \sum_{k} ω_{i k t}$	$= \sum_{k} ν_{i k t} (δ)$
Popularity	$X_{i j}^{P O P} (H_{t}) = \sum_{k} n_{k j t}$	$= \frac{\sum_{k} n_{k j t}}{\sum_{l} \sum_{h} n_{l h t}}$	$= \sum_{k} ω_{k j t}$	$= \sum_{k} ν_{k j t} (δ)$
Transitivity	$X_{i j}^{T R A N S} (H_{t}) = \sum_{k} \sqrt{n_{i k t} n_{k j t}}$	$= \frac{\sum_{k} \sqrt{n_{i k t} n_{k j t}}}{\sum_{l} \sum_{m} \sum_{h} \sqrt{n_{l h t} n_{h m t}}}$	$= \sum_{k} \sqrt{ω_{i k t} ω_{k j t}}$	$= \sum_{k} \sqrt{ν_{i k t} (δ) \cdot ν_{k j t} (δ)}$

Sampling

While scaling remedies the issue of long sequences, sequences including a large number of actors pose computational issues. Specifically, when the risk set contains numerous potential events, the denominator of the likelihood function becomes difficult to compute directly (Butts, 2008). This issue is akin to the computational problems of ERGMs (Lusher et al., 2013). Following Butts (2008) and Vu et al. (2015), we approximate the denominator by randomly sampling from the risk set and only computing the statistics for those samples. Vu et al. (2015) suggest the number may be as few as 5 to 10 samples from the risk set. In a more recent study of large relational event networks, Lerner and Lomi (2020) find similar support for small samples from the risk set. To test the effect of sampling, we generated sequences with 100 actors (risk set of 9,899 dyads) and sequence lengths of 5,000 and 10,000.⁷ We varied the number of samples taken from the risk set; we tested sample sizes of 1, 5, 10, 20, 30, 50, 100, and 200. All five statistics and all three effect sizes were tested for every combination of number of events and sample size.

Simulation Results

Sequence Characteristics

We used a first set of simulations to determine lower bounds for the number of events and number of actors that would lead to sufficient power, accuracy, and precision across different effects and effect sizes. Figure 1 provides a partial summary of our findings; because of the large number of scenarios tested, we report the full results in Online Appendix A.

Figure 1.

Results for standard relational event model. Note: Power, accuracy (bias), and precision (standard errors) for inertia, reciprocity, activity, popularity, and transitivity. Horizontal axis is the number of events E. All values are averages across 50 simulations.

In each figure, the horizontal axis represents the number of events E (from 0 to 1,000). For purposes of illustration, event sequences with 10 actors (N = 10) are considered small, while event sequences with 40 actors (N = 40) are considered large. Only results for small and large effects sizes are included because moderate effect sizes consistently fall in the middle for all outcome measures.

Power

From Figure 1 we observe that there are a few general trends in terms of statistical power. More events, more actors, and bigger effect sizes all lead to greater power, regardless of the statistic generating the sequence. However, there is some variability across different statistics. In particular, inertia and reciprocity are relatively hard to detect (∼50% power) when the effect size is small and there are many actors. By contrast, the REM detects activity and popularity with high power for sequences with relatively low number of events regardless of the number of actors. Overall, when the number of actors is small, number of events is small, and/or the effect sizes are small, we cannot be fully confident that we will detect an effect. However, increasing any of these variables improves statistical power.

To summarize our analyses and provide recommendations regarding sufficient N and E, we follow Wang et al. (2014) and fit a linear regression model to our simulated data. We regressed our estimates of power on $\sqrt{E}$ , $\sqrt{N}$ , and the effect size for the different statistics. In total there are 120 data points for each of the five statistics; these points represent the different combinations of five risk set options (N = 5, 10, 20, 40, 50), eight sequence length options (E = 10, 20, 50, 75, 100, 250, 500, 1,000), and three effect sizes (1.22, 1.86, and 3.00). The dependent variable is the estimated power from our simulations, that is, the frequency with which we identified a significant effect for the particular combination of parameters. We use the square root to account for the nonlinear relationships between E, N, and power. Our regression results for each of the five sufficient statistics are presented in Table 4. Further, in Figure 2 we provide minimum sequence length E to achieve power of 0.80 at the $α = 0.05$ significance level for each statistic across effect sizes and number of actors S. To calculate the values in Figure 2, we fixed the number of individuals N and the effect size for a given statistic. Then, we varied the sequence length from E = 1 to 2,000. We used these values of N, E, and effect size in the regressions presented in Table 4 to get an estimate of power. The resulting numbers in Figure 2 are the lowest value of E at which our model predicted power would be at least 80%.⁸ We color each cell based on the number of events, with darker numbers indicating a higher number of events required.

Table 4.

Regression Equations for Predicting Power.

Variable	Inertia	Reciprocity	Activity	Popularity	Transitivity
Constant	0.085 (0.093)	0.204* (0.082)	0.236* (0.103)	0.240* (0.103)	0.264* (0.095)
$\sqrt{N}$	–0.067** (0.012)	–0.088** (0.011)	0.022 (0.017)	0.018 (0.014)	–0.059** (0.013)
$\sqrt{E}$	0.028** (0.002)	0.025** (0.002)	0.019** (0.003)	0.017** (0.003)	0.026** (0.003)
Effect Size	0.187** (0.031)	0.220** (0.027)	0.078* (0.034)	0.109* (0.034)	0.186** (0.031)
N	120	120	120	120	120
R ²	.627	.702	.318	.303	.582

Note: OLS regression of estimated power on input variables. Standard errors in parentheses.

*p < .01. **p < .001.

Figure 2.

Sequence length (E) thresholds across actors (N), effects, and effect sizes. Note: Values are estimated from regression equations in Table 4. Threshold is to achieve power of 0.80 at 95% confidence level.

Our models in Table 4 support the high-level trends evident in our figures; power is significantly enhanced when $\sqrt{E}$ is increased for each of the five statistics and for larger effect sizes. For the inertia, reciprocity, and transitivity power is decreased when the number of actors N is increased while keeping the sequence length constant. For activity and popularity, power marginally increases with more actors. Generally speaking, we are able to detect statistics with greater power when we have longer sequences and larger effects. Further, we typically need as many or more events when we have a greater number of actors.

Applying the regression results, in Figure 2 we provide concrete recommendations on the number of observations required to achieve good power for a given number of actors and a specific effect. The figure should be read in the following way. In order to achieve 80% power with a 95% confidence level given a number of actors of 10 and a medium effect size, what is the minimum sequence length that should be considered? For transitivity, Figure 2 gives the answer: 217 events. Alternatively, the results can be interpreted as the number of events per actor required to detect an effect. For instance, with N = 5 actors, Figure 2 indicates that 520 events are necessary to detect a small inertia effect with 80% power. This threshold corresponds to 104 events per actor (E = 520 / N = 5). In fact, we find that this is the highest number of events per actor required according to our regression models. Thus, a conservative rule for determining an appropriate number of events for a given number of actors is to collect at least 100 events per actor. We should note however that these results are likely lower bounds given that we did not test for the interactions between variables.

Accuracy

Turning to the accuracy of the model (second column of graphs in Figure 1), we find that there is a slight underestimation of the effects, across all five measures. The only exception is an overestimation of the large effect of inertia when the sequence has a small number of actors. The underestimation bias is most pronounced for activity and popularity and larger effect sizes tend to be consistently significantly underestimated, with underestimation reaching more than 50% of the real parameter estimate. Here, number of actors and number of events have little effect on accuracy.

Precision

Finally, our results indicate that the REM achieves a high degree of precision (i.e., small standard errors). When the model is applied to sequences with more actors or more events, precision increases, though the marginal benefit tapers off significantly after approximately 100 events. This finding is true across all statistics and effect sizes, though reciprocity tends to exhibit the greatest variation. Thus, we can conclude that the REM yields extremely precise coefficient estimates once a relatively small threshold of events is crossed.

Sampling

We next explore the impact of sampling the risk set on power, accuracy, and precision; our results are provided in Figure 3. As before, Figure 3 illustrates the key trends, while the full results are reported in Online Appendix B.

Figure 3.

Results for sampling of risk set. Note: Power, accuracy, and precision for inertia, reciprocity, activity, popularity, and transitivity. The horizontal axis is the number of samples from the risk set. All values are averages across 50 simulations. All sequences have N = 100 actors.

Regardless of number of events, effect size, or statistic, we find that power tends to stabilize once the sampled amount reaches 5 to 10 potential events. Consistent with our prior results, our findings suggest that power is good across sample sizes. We should note however that we are assuming constant values of the parameters over the duration of the sequence. If the parameter values were time-varying, we likely would need larger samples from the risk set at each observation to effectively capture these changes. Thus, the result of 5 to 10 samples is likely a lower bound, and in more complex datasets more samples may be required.

We do note some differences in power across statistics. The REM consistently detects activity and popularity with samples from the risk set, even when the effect is weak. For inertia, reciprocity, and transitivity, the REM has only moderate power for detecting weak effect sizes. Concerning accuracy, when we sample the risk set there is again a slight tendency to underestimate the effect size, regardless of the number of events, sample size, or effect. We do find that the bias is smaller in magnitude when the effect sizes are smaller and the bias is larger when the effect sizes are large—this suggests that the parameter estimate from the REM is approximately the same, regardless of the magnitude of the underlying effect. In terms of precision, we identify two trends. First, sequences with more events tend to lead to smaller standard errors, regardless of the sample size. Second, estimates of coefficients for larger effects have smaller errors in general.

Scaling

Finally, we consider the impact of various scaling methods on power, accuracy, and precision; the results are illustrated in Figure 4. A more detailed reporting of our findings is provided in Online Appendix C.

Figure 4.

Results for scaling of statistics. Note: Power, accuracy, and precision values for inertia, reciprocity, activity, popularity, and transitivity. Blue bars indicate a small effect, and red bars indicate a large effect. All values are averages across 50 simulations. All sequences have 500 events and 10 actors.

We generated sequences with 500 events and 10 actors, and then fit the REM using each of the three scaling methods. Overall, the only scaling method that leads to power close to 100% for all combinations is the sliding window approach. Proportional scaling has a power above 80% for all statistics with a large effect, and for statistics with a smaller effect this method has a power of around 50% for inertia and reciprocity. Power is at or close to 100% for activity, popularity, and transitivity, even for smaller effects. Exponential scaling has power between 60% and 90% for all statistics with small effects. With large effects, exponential scaling has power around 100% for inertia, reciprocity, and transitivity. However, power drops at or below 50% for activity and popularity.

In terms of model accuracy for various scaling approaches, we find that the REM significantly overestimates proportionately scaled statistics, regardless of effect or effect size. By contrast, an exponential or sliding window approach leads to a relatively small underestimation with smaller effects and a larger underestimation for activity, popularity, and transitivity for large effect sizes. Thus, all methods lead to some bias, but the inaccuracy is most severe under proportional scaling. Last, when different scaling approaches are used, we find that the standard errors are largest when applying proportional scaling. Again, this pattern holds across effects and effect sizes. Overall, proportional scaling leads to systematic inaccuracy in the REM, with a tendency to generate large parameters with large standard errors. However, given the relatively high power, we can conclude that the scaling issue could be addressed by standardizing the statistics.

Empirical Example: Corporate Communications

We obtained data from an organization in order to replicate the analyses that we conducted in the simulation regarding sampling the risk set and scaling the sufficient statistics in a real organization. The organization that we obtained data from is an IT recruitment company operating in Australia. We obtained all email communications between all members of the IT department (N = 33 employees) during a full month (October 2012: E = 5,391 email exchanges). We removed all emails sent to or received from email addresses external to the company.

The empirical example supplements the simulation in three key ways. First, the number of events in the dataset is much larger than in the simulation. Because of this large E, the need to conduct sampling of the risk set and scaling of the statistics is more salient. Second, in a real dataset, the specified effects do not exist in isolation, but rather interact with one another simultaneously. Finally, our empirical example provides a tangible context for interpreting REM results. However, a drawback of empirical data relative to simulated data is the lack of a “ground truth.” In other words, there is no way to tell what the real effect sizes are, and whether the statistics actually exert influence on the pattern of interactions.

Analysis Procedure

Our first step was to fit a model using all five sufficient statistics—inertia, reciprocity, activity, popularity, and transitivity—together with no scaling or sampling. The resulting model served as our baseline for evaluating sampling and scaling strategies. We next fit the REM to the same data using all five statistics, but only taking samples from the risk set. The number of samples varied from 1, 5, 10, 20, 30, 50, 100, and 200 to be consistent with the simulations. To ensure we accounted for sampling variability, we repeated each step 20 times. We then computed the average power, accuracy, and precision of the model, relative to the baseline result. For scaling, we fit the REM to the original dataset using the same five statistics, with each of the three scaling strategies implemented. We tested the proportional scaling method, a sliding window of 500 events,⁹ and exponential decay with a half-life of three days. The formula for the weight is $ω_{i j t -} = \sum_{e_{τ} : τ < t} 1 {e_{τ} = (i, j, τ)} \times exp (- \frac{(t - τ) log (2)}{T_{1 / 2}})$ where $T_{1 / 2}$ is the half-life, $1 {\dots}$ is the indicator function, and $log (\cdot)$ refers to the natural logarithm. The results of these models were compared against the original baseline results to assess power, accuracy, and precision.

Empirical Results

Our first REM measures the effects of inertia, reciprocity, activity, popularity, and transitivity on corporate communication. The results are presented in Table 5.

Table 5.

Relational Event Results for Organizational Data.

Variable	Coeff. (SE)
Inertia	0.028** (0.002)
Reciprocity	0.013** (0.002)
Activity	0.001* (0.000)
Popularity	0.001** (0.000)
Transitivity	0.019** (0.001)
Events (potential)	5,391 (5,692,896)
Null deviance	75,067
Deviance	70,280

Note: N = 33, E = 5,391.

*p < .01. **p < .001.

We find that all five statistics are positive and significant, and that the model is a significantly better fit to the data than the null model, that is, a random sequence. The positive effect of inertia indicates that members of the organization were more likely to send emails to other individuals whom they have contacted more in the past. Likewise, the positive effect of reciprocity suggests that people are significantly more likely to send emails to recipients from whom they have received many emails in the past. The activity and popularity effects indicate that emails are more likely to originate from active individuals, and are more likely to be targeted toward popular individuals. Finally, the positive transitivity effect can be translated as a tendency for email communications to occur as part of small groups, which may reflect the task structure of the organization (Quintane et al., 2013).

Sampling

We next turn to our analysis of sampling strategies. Given that the risk set is large (33*32-1)*5,391 = 5,687,505, it is costly to compute statistics for the entire sequence at once. Thus, by taking a small number of samples we can significantly reduce this computational burden and therefore handle much larger datasets. In Table 6 we present our findings for power, accuracy, and precision across various numbers of samples. All results are averaged across 20 runs.

Table 6.

Summary of Results for Sampling of Organizational Data.

Power
Samples	1	5	10	20	30	50	100	200
Inertia	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
Reciprocity	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
Activity	0.40	0.95	1.00	1.00	1.00	1.00	1.00	1.00
Popularity	0.25	0.95	1.00	1.00	1.00	1.00	1.00	1.00
Transitivity	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00
Accuracy
Samples	1	5	10	20	30	50	100	200
Inertia	2.342	1.571	1.156	0.757	0.576	0.380	0.197	0.101
Reciprocity	2.185	0.614	0.003	–0.141	–0.193	–0.164	–0.110	–0.055
Activity	0.026	0.423	0.364	0.315	0.209	0.166	0.088	0.057
Popularity	–0.323	0.072	0.073	0.077	0.009	0.017	–0.014	–0.010
Transitivity	–0.247	–0.340	–0.282	–0.217	–0.161	–0.116	–0.061	–0.036
Precision
Samples	1	5	10	20	30	50	100	200
Inertia	0.008	0.004	0.004	0.003	0.003	0.002	0.002	0.002
Reciprocity	0.008	0.004	0.003	0.003	0.003	0.002	0.002	0.002
Activity	0.001	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Popularity	0.001	0.000	0.000	0.000	0.000	0.000	0.000	0.000
Transitivity	0.003	0.002	0.001	0.001	0.001	0.001	0.001	0.001

Note: All values are averaged over 20 repetitions. Power is computed at 95% confidence level. Accuracy is normalized by estimate values from base model for comparison.

We first observe that for inertia, reciprocity, and transitivity we achieve 100% power at the 95% confidence level with any number of samples from the risk set. In other words, in each case these statistics were positive and statistically significant. Activity and popularity achieved poor power with one sample, but with five samples from the risk set they reached 95% power, and with 10 or more samples they reached 100% power.

Turning to accuracy, we find that in general, the model becomes more accurate with a larger number of samples from the risk set. With 10 samples from the risk set, all variables are within 50% of the baseline values except for inertia which requires 50 samples. Interestingly, inertia and activity are consistently overestimated, while the other variables are closer to the baseline values. At the extreme end of our test, all five variables are within approximately 10% of the original values with only 200 samples from the risk set. This finding indicates that with 200 samples, our estimates are effectively identical to the results of the full model. Finally, we consider the precision of the REM under sampling. For all five statistics, standard errors decline as the number of samples increases. Consistent with the baseline model, inertia and reciprocity have the least precision (largest standard errors), followed by transitivity, and then activity and popularity. Overall, we find that with any number of samples, the standard errors of all five statistics are comparable the baseline model.

Scaling

For the last phase of our analysis, we fit the REM to the full sequence, that is, with no sampling, using three alternative scaling strategies. We apply proportional scaling, a sliding window of 500 events, and a half-life decay function with a half-life of three days. Furthermore, we standardized the variables before the estimation of the model by creating z-scores (mean-centering the variables and dividing by their standard deviation), so that all variables have mean zero and standard deviation of one. The results of these REMs are presented in Table 7.

Table 7.

Relational Event Results for Organizational Data With Scaling and Standardization.

	No Scaling	Proportional Scaling	Sliding Window	Half-Life Scaling
Variable	Coeff. (SE)	Coeff. (SE)	Coeff. (SE)	Coeff. (SE)
Inertia	0.181** (0.010)	0.332** (0.006)	0.141** (0.008)	0.159** (0.009)
Reciprocity	0.083** (0.011)	0.164** (0.005)	0.156** (0.007)	0.146** (0.009)
Activity	0.064* (0.026)	0.418** (0.012)	0.137** (0.016)	0.100** (0.020)
Popularity	0.081** (0.023)	0.335** (0.016)	0.160** (0.015)	0.104** (0.018)
Transitivity	0.457** (0.027)	0.111** (0.007)	0.235** (0.016)	0.374** (0.019)
Events (potential)	5,391 (5,692,896)
Null deviance	75,067
Deviance	70,280	68,272	68,850	68,468

*p < .01. **p < .001.

We find that the parameter estimates are consistent in sign and significance across scaling methods; all five statistics have a positive and significant effect on predicting email events. Thus, scaling does not seem to negatively impact statistical power. Similarly, in terms of precision, the standard errors of all models with scaled statistics are either similar in magnitude or smaller than the unscaled benchmark, confirming the general high level of precision of REM. However, the accuracy of REM across different forms of scaling is problematic. All three scaling strategies overestimate reciprocity, activity, and popularity while they underestimate transitivity. This issue is particularly pronounced for proportional scaling, for which the parameter estimate for activity is more than six times higher than the unscaled baseline. Further, the parameter estimate for transitivity is four times lower than the unscaled standard. For inertia, proportional scaling overestimates the parameter estimate while the sliding window and half-life scaling approaches underestimate it.

Discussion

In this article, we used a series of simulations to examine systematically the power, accuracy, and precision of the REM under varying conditions of sequence length, network size and effect sizes, as well as using different sampling thresholds and scaling strategies. To complement the simulation study, we also analyzed a dataset containing over 5,000 emails exchanged between employees of an Australian IT company. The two sets of analyses provided distinct but consistent and complementary insights. In the simulation we varied the characteristics of the sequence N (number of actors) and E (number of events) in order to identify the lower boundaries of power, precision, and accuracy; that is, the smallest number of actors and number of events that can be used with the REM while providing reliable results. In the simulation we were able to provide results for each statistic individually. By contrast, the empirical example enabled us to examine the performance of scaling and sampling strategies in a real empirical context without isolating the effects of each statistic.

The main result identified by our analyses is that REM has generally good power and good precision. We found that REM requires relatively few events per actor to obtain good power and precision (as per Figure 2).¹⁰ At the same time, REM consistently displayed relatively poor accuracy, especially for large effects. For example, Figure 3 showed that a large effect of transitivity is underestimated by about 80%. Furthermore, we found that scaling strategies accentuate these accuracy issues, especially when using proportional scaling. Finally, sampling requires few events (5 to 10) taken randomly from the risk set to obtain good power and precision, however obtaining a satisfactory level of accuracy requires a much higher sample (around 20% of the risk set).

In Table 8, we provide specific and concrete guidance regarding the network size and sequence length needed to recover specific effect sizes, as well as the most common issues affecting power, accuracy, and precision in our simulation.

Table 8.

Summary of Findings.

	Network Size	Sequence Length	Sampling	Scaling	Guidance
Power	Some problems for small networks (5 people)	Some problems for short sequences (100 events)	Requires a minimum of 5 to 10 samples per event	No problem apart from when sequence is generated using proportional scaling	Aim to collect approximately 100 events per actor
Precision	No effect on precision	Improves for longer sequences	Need a longer sequence (5,000 events) when sampling (5/10 samples per event) for good precision	Good precision when using sliding window or half-life; poor precision when using proportional scaling	Precision is best for long sequences, but is impacted by proportional scaling
Accuracy	Slight underestimation (10%) of small effects but important underestimation (80%) of larger effects, regardless of network size or sequence length.		The larger the effect, the more it is likely to underestimate it when we sample	Overestimation when using proportional scaling	Effect sizes are generally underestimated; use caution when interpreting large parameter estimates

Based on our results, we propose the following guidelines for applying REM. First, researchers should consider using at least 100 events per actor to detect weak effects with sufficient power in small networks. For larger networks or stronger effects, fewer events can potentially be collected. Second, when researchers need to sample their risk set because they have a large number of actors, they can draw a minimum of five samples from the risk set, with a minimum sequence length of over 5,000 events to ensure good power and precision, being aware that accuracy may be a problem. Third, and by contrast, we urge researchers to exert caution when interpreting the magnitude of their parameter estimates, especially when sampling from the risk set and scaling their statistics. To improve accuracy, we recommend sampling at least 20% of the risk set and using either sliding window or exponential decay scaling over proportional scaling. In conclusion, the REM functions well as an explanatory model, given its ability to consistently detect significant effects with high precision. Thus, researchers can confidently apply the model for hypothesis testing. However, the REM functions relatively worse as a predictive model. Essentially, its inability to accurately differentiate between small and large effect sizes would make prediction of future events unreliable. In the following sections, we proceed to discuss our findings in greater depth and justify our recommendations.

Implications of Actors, Events, and Statistics

An important first step in conducting REM analysis is to decide on an appropriate number of events relative to the number of actors being studied. In our baseline simulations, we restricted our analyses to relatively few actors (up to N = 100 people) and shorter sequences (E = 1,000 events). Our findings suggest that for small networks (5 to 10 people) and short sequences (100 to 200 events), the REM may have difficulties in detecting weak or moderate effect sizes, particularly for dyadic effects like inertia and reciprocity. However, when networks reach 20 to 50 people and sequences are longer than 1,000 events, the model will detect most statistics with confidence. Nevertheless, across all approaches, data sizes, and effects, the REM tends to have a negative bias, that is, underestimation, which is more marked with larger effects and does not improve with more events. By contrast, precision of the REM is relatively strong (small SEs relative to effect size), and is better when measuring actor-level measures like activity and popularity. Further, the precision of the REM increases sharply as the number of events increases, particularly for larger effect sizes.

Our findings suggest that researchers can conduct successful studies with REM with relatively little data. Specifically, for research on small groups (e.g., Schecter et al., 2018), sufficient power can be obtained for moderate effect sizes with only a few hundred events. The challenge remains, however, when the effects in question are relatively weak. In such circumstances, the number of events should be increased to guarantee reliable results, particularly when the sequence includes a larger number of actors. On the other end of the spectrum, REM can be applied to datasets with more actors and events (N > 30, E > 1,000) and detect large effect sizes with over 90% power, albeit with a potential lack of accuracy regarding the true effect size. Finally, it is worth noting that our results demonstrate the need to consider both the number of actors and number of events. Specifically, for dyadic effects such as inertia or reciprocity, the REM does not consistently detect weak effects for E < 1,000 and N = 40, while it does detect these effects for the same E when N = 10. This conclusion diverges from cross-sectional network research where the emphasis is on actors as well as panel regression-style analyses where the emphasis is on the number of observations.

Implications of Sampling

We next consider the implications of sampling the risk set for research using REM. Though the networks we consider in this study are not particularly large (cf. Lerner & Lomi, 2020), even a network of N = 100 nodes with a sequence of E = 1,000 events implies a total risk set of 9,899,000 ((99*100 -1)*1,000) potential events, which carries a substantial computational burden. Examining the impact of different sampling strategies on networks of this size and sequences of this length is therefore already highly meaningful. As the characteristics E and N grow, this issue becomes even more salient.

For both our simulations and email dataset, we find that only a small number of samples are required to achieve sufficient power in most circumstances. In our simulation study, drawing 5 to 10 samples from the risk set was sufficient to achieve good power. Likewise, in our empirical analysis we found that with at least five samples from the risk set, we were able to achieve high power for all five statistics simultaneously. Precision was also strong in both cases, and improved somewhat with more samples. Further, our simulation findings suggest that precision will improve with longer sequences and larger effects, regardless of sample size. Interestingly, our results for accuracy diverged between the simulation and empirical analysis. The simulation study results suggest that the REM will consistently underestimate the effect size of statistics, regardless of the number of samples. When analyzing our empirical data, we find that some effects are overestimated, while others are underestimated. Accuracy improved with more samples, but the number of samples required to achieve high accuracy was relatively large (over 200 samples or 20% of the risk set for each event). These results could be due to the interactions between the variables which did not exist in the simulations, or to unobservable intricacies in the empirical data.

Taken collectively, our results highlight an interesting and important tradeoff when conducting studies with REM on large-scale data. Specifically, much of modern organizational research focuses on large datasets with many actors and numerous events. As a result, the risk sets associated with this data will make direct computation of the REM intractable, and some form of sampling strategy becomes necessary. On one hand, a small number of samples is sufficient to achieve good power. As a result, researchers can apply the REM to very large datasets and have confidence that they will not miss any significant effects. Further, we see little evidence of false positives.¹¹ On the other hand, sampling the risk set consistently produces results that are not accurate with regard to the baseline or ground truth values. Further, our empirical analyses suggest that any loss of accuracy could be an under- or over-estimation.

For example, Table 6 shows that with a sample size of 10 events, inertia, activity and transitivity are captured with full power and acceptable precision, but their accuracy is deficient. In our organizational example, this would mean overestimating the extent to which individuals keep communicating with the same partners, as well as the importance of central individuals that connect with alters across the network. By contrast, we would be underestimating the effect of transitivity in generating the sequence of events. More concretely, we might explain the pattern of communication due to the persistence of communication as well as the emergence of central actors, while in fact transitivity plays a more crucial role. This misspecification would be problematic if trying to predict which actors the next exchange will occur between (i.e., to identify transfer of knowledge or the spread of ideas).

Implications of Scaling

The simulation and the empirical example provided similar results in terms of scaling. Proportional scaling leads to an overestimation of parameter estimates and to problems of precision for inertia, reciprocity, and transitivity. Furthermore, the simulation signals problems of power with inertia and reciprocity for small effects with proportional scaling. Exponential scaling and sliding window have some punctual issues with accuracy, precision and power for large effects, but overall they provide comparable estimates.

The empirical example also reveals more subtle differences between the scaling methods, potentially reflecting the theoretical assumptions regarding time that are embedded in each scaling method. Proportional scaling tends to place more weight on long term patterns of past behavior, since the addition of each additional event has a smaller marginal impact on the value of the statistics. By contrast, the sliding window method places greater weight on more recent events (assuming that the researcher choses a relatively short sliding window as we do here). The overestimation of inertia by proportional scaling, compared to the unscaled baseline, and its underestimation when using the sliding window can be interpreted in light of this. Inertia has a tendency to build up continuously over time, which means that differences in long term rates of activity between individuals are relatively good predictors of future activity differences (Karsai et al., 2014). Hence, the longer the observation period over which inertia is observed, the stronger the effect of the statistic in predicting future events, which is consistent with how proportional scaling overemphasizes the longer-term patterns compared to a shorter sliding window. Half-life scaling also places more weight on recent events, but to a somewhat lesser extent than the sliding window approach. This result would suggest that identifying a scaling strategy that balances the importance given to longer-term trends compared to shorter-term variations in these trends is critical in capturing how social processes operate in a social setting. Future research should identify conditions under which the REM can detect information from different lengths of past histories (e.g., longer-term vs. shorter-term history of past events).

Limitations and Future Directions

There are a few limitations to this study that suggest avenues for further research. First, in this article we focus on the ordinal version of the REM. REMs can incorporate exact measurements of time (Butts, 2008), rather than ordinal information alone. However, we choose to focus on the ordinal version for a few reasons. First, many organizational studies utilizing event data do not use exact timing for substantive reasons, namely, that communication is happening asynchronously and thus exact times are less relevant. For instance, REM has been used to study software development (Brunswicker & Schecter, 2019; Quintane et al., 2014). Second, introducing a temporal variable brings in an additional factor which could influence power, accuracy, and precision. For instance, it is unclear what the effect of long time intervals versus short intervals is on the REM. Further, the distribution of interevent times (e.g., exponential, weibull, normal) could also have an impact. Because our focus is on the appropriate selection of dataset, sampling strategy, and scaling strategy, we argue that temporal information is beyond the scope of our article. Future work should consider the role of time in determining statistical power.

A second limitation of our study is that it focused on five statistics (inertia, reciprocity, activity, popularity, and transitivity) only. While empirical applications of the REM have used many other statistics such as participation shifts (Leenders et al., 2016), temporal effects (Vu et al., 2015), and actor fixed effects (Butts, 2008), we believe that these statistics are the most common and representative ones. Further, these statistics are most closely related with frequently used static network structures (e.g., Lusher et al., 2013). Finally, the statistics we tested are all time-varying count variables, that is, they are continuously changing over the course of a sequence. Because of this, we anticipate that they will be at least as difficult—if not more—to estimate accurately as other statistics.

Third, there are several related but distinct ways to model event networks. In particular, the REM is tie-oriented in that probabilities are based upon links forming in the network. Actor-oriented methods such as DyNAM (Stadtfeld & Block, 2017; Stadtfeld et al., 2017) operate under somewhat different mathematical and theoretical assumptions. Accordingly, it is not clear how our results would extend to an actor-oriented rather than tie-oriented model. We encourage future work to explore the differences in these two modeling paradigms.

Conclusion

REMs are statistical models specifically suited for sequences of relational events (i.e., time-stamped interactions between social actors, such as emails, phone calls, etc.). These data structures are becoming increasingly popular among organizational network researchers due to the depth and breadth of information that they contain about interactions between social actors. REMs enable researchers to estimate the full sequence of interactions using well known network concepts as well as individual or dyadic covariates. However, the REM is still relatively new, and there are no systematic studies of its power, accuracy, and precision. In this study, we explored the boundary conditions for achieving sufficient power, precision, and accuracy with REMs. Further, we determine the impact of using scaling and sampling strategies to estimate large sequences of relational events. Our results shed light on the utility of the REM and can serve as a foundation for future research.

Supplemental Material

Supplemental Material, REM_Power_Appendix - The Power, Accuracy, and Precision of the Relational Event Model

Supplemental Material, REM_Power_Appendix for The Power, Accuracy, and Precision of the Relational Event Model by Aaron Schecter and Eric Quintane in Organizational Research Methods

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Aaron Schecter

Eric Quintane

Supplemental Material

Supplemental material for this article is available online.

Notes

References

Aral

Van Alstyne

(2011). The diversity-bandwidth trade-off. American Journal of Sociology, 117(1), 90–171.

Block

Koskinen

Hollway

Steglich

Stadtfeld

(2018). Change we can believe in: Comparing longitudinal network models on consistency, interpretability and predictive power. Social Networks, 52, 180–191. https://doi.org/10.1016/j.socnet.2017.08.001

Brandenberger

(2019). Predicting network events to assess goodness of fit of relational event models. Political Analysis, 27, 556–571. https://doi.org/10.1017/pan.2019.10

Brandes

Lerner

Snijders

T. A.

(2009, July). Networks evolving step by step: Statistical analysis of dyadic event data. In 2009 international conference on advances in Social network analysis and mining (pp. 200–205). IEEE.

Brunswicker

Schecter

(2019). Coherence or flexibility? The paradox of change for developers’ digital innovation trajectory on open platforms. Research Policy, 48. https://doi.org/10.1016/j.respol.2019.03.016

Butts

C. T.

(2008). A relational event framework for social action. Sociological Methodology, 38(1), 155–200.

Butts

C. T.

Marcum

C. S.

(2017). A relational event approach to modeling behavioral dynamics. In Pilny

Poole

M. S.

(Eds.), Group processes (pp. 51–92). Springer.

Cohen

. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Lawrence Erlbaum.

Cohen

(1992). A power primer. Psychological Bulletin, 112(1), 155–159.

10.

DuBois

Butts

C. T.

Smyth

(2013). Stochastic blockmodeling of relational event dynamics. Proceedings of Machine Learning Research, 31, 238–246.

11.

Efron

(1977). The efficiency of Cox’s likelihood function for censored data. Journal of the American Statistical Association, 72(359), 557–565. https://doi.org/10.1080/01621459.1977.10480613

12.

Goldberg

Srivastava

S. B.

Manian

V. G.

Monroe

Potts

(2015). Fitting in or standing out? The tradeoffs of structural and cultural embeddedness. American Sociological Review, 49. https://doi.org/10.1097/00006199-200611000-00001

13.

Golder

S. A.

Macy

M. W.

(2014). Digital footprints: Opportunities and challenges for online social research. Annual Review of Sociology, 40(1), 129–152. https://doi.org/10.1146/annurev-soc-071913-043145

14.

Hsieh

F. Y.

Bloch

D. A.

Larsen

M. D

. (1998). A simple method of sample size calculation for linear and logistic regression. Statistics in Medicine, 17(14), 1623–1634. https://doi.org/10.1002/(SICI)1097-0258(19980730)17:14<1623::AID-SIM871>3.0.CO;2-S

15.

Kalish

. (2020). Stochastic actor-oriented models for the co-evolution of networks and behavior: An introduction and tutorial. Organizational Research Methods, 23(3), 511–534.

16.

Karsai

Perra

Vespignani

(2014). Time varying networks and the weakness of strong ties. Scientific Reports, 4, 4001.

17.

Kitts

J. A.

Lomi

Mascia

Pallotti

Quintane

(2017). Investigating the temporal dynamics of interorganizational exchange: Patient transfers among Italian hospitals. American Journal of Sociology, 123(3), 850–910. https://doi.org/10.1086/693704

18.

Kleinbaum

A. M.

(2012). Organizational misfits and the origins of brokerage in intra-firm networks. Administrative Science Quarterly, 57(3), 407–452. http://ssrn.com/abstract=2004502

19.

Kleinbaum

A. M.

Stuart

T. E.

Tushman

M. L.

(2011). Discretion within the constraints of opportunity: Gender homophily and structure in a formal organization. In Academy of Management 2011 Annual Meeting—West Meets East: Enlightening. Balancing. Transcending . https://doi.org/10.5464/AMBPP.2011.15.a

20.

Kossinets

Watts

D. J.

(2006). Empirical analysis of an evolving social network. Science, 311(5757), 88–90. https://doi.org/10.1126/science.1116869

21.

Lazer

Pentland

Adamic

Aral

Barabasi

A. L.

Brewer

Christakis

Contractor

Fowler

Gutmann

Jebara

King

Macy

Roy

Van Alstyne

(2009). Computational social science. Science, 323(5915), 721–723.

22.

Leenders

Contractor

DeChurch

(2016). Once upon a time: Understanding team processes as relational event networks. Organizational Psychology Review, 6, 92–115.

23.

Leonardi

Contractor

(2018). Better people analytics. Harvard Business Review, 96(6), 70–81.

24.

Lerner

Bussmann

Snijders

T. A. B.

Brandes

(2013). Modeling frequency and type of interaction in event networks. Corvinus Journal of Sociology and Social Policy, 1, 3–32.

25.

Lerner

Lomi

. (2020). Reliability of relational event model estimates under sampling: How to fit a relational event model to 360 million dyadic events. Network Science, 8(1), 97–135.

26.

Liu

C. C.

Srivastava

S. B.

Stuart

T. E.

(2016). An intraorganizational ecology of individual attainment. Organization Science, 27, 90–105.

27.

Lusher

Koskinen

Robins

(Eds.). (2013). Exponential random graph models for social networks: Theory, methods, and applications. Cambridge University Press.

28.

Marcum

C. S.

Butts

C. T.

(2015). Constructing and modifying sequence statistics for relevent using informR in R. Journal of Statistical Software, 64(5), 1–36.

29.

(2017). On experience and enterprise: Careers, organizations and entrepreneurship. https://dx-doi-org.web.bisu.edu.cn/10.2139/ssrn.3377141

30.

Olivier

May

W. L.

Bell

M. L.

(2017). Relative effect sizes for measures of risk. Communications in Statistics—Theory and Methods, 46(14), 6774–6781. https://doi.org/10.1080/03610926.2015.1134575

31.

Onnela

J.-P.

Waber

B. N.

Pentland

Schnorf

Lazer

(2014). Using sociometers to quantify social interaction patterns. Scientific Reports, 4, 5604.

32.

Pilny

Schecter

Poole

M. S.

Contractor

(2016). An illustration of the relational event model to analyze group interaction processes. Group Dynamics: Theory, Research, and Practice, 20, 181–195.

33.

Quintane

Carnabuci

(2016). How do brokers broker? Tertius gaudens, tertius iungens, and the temporality of structural holes. Organization Science, 27, 1343–1360.

34.

Quintane

Conaldi

Tonellato

Lomi

(2014). Modeling relational events: A case study on an open source software project. Organizational Research Methods, 17(1), 23–50.

35.

Quintane

Pattison

P. E.

Robins

G. L.

Mol

J. M.

(2013). Short-and long-term stability in organizational networks: Temporal structures of project teams. Social Networks, 35(4), 528–540.

36.

Schecter

Pilny

Leung

Poole

M. S.

Contractor

(2018). Step by step: Capturing the dynamics of work team process through relational event sequences. Journal of Organizational Behavior, 39, 1163–1181. https://doi.org/10.1002/job.2247

37.

Snijders

T. A

. (1996). Stochastic actor-oriented models for network change. Journal of Mathematical Sociology, 21(1–2), 149–172.

38.

Stadtfeld

Block

(2017).Interactions, actors, and time: Dynamic network actor models for relational events. Sociological Science, 4, 318–352.

39.

Stadtfeld

Hollway

Block

(2017). Dynamic network actor models: Investigating coordination ties through time. Sociological Methodology, 47, 1–40.

40.

Stadtfeld

Snijders

T. A. B.

Steglich

van Duijn

(2018). Statistical power in longitudinal network studies. Sociological Methods & Research . https://doi.org/10.1177/0049124118769113

41.

Tay

Malik

Zhang

Chae

Ebert

D. S.

Ding

Zhao

Kern

(2018). Big data visualizations in organizational science. Organization Research Methods, 21(3), 660–688. https://doi.org/10.1177/1094428117720014

42.

Tröster

Parker

Van Knippenberg

Sahlmüller

(2019). The coevolution of social networks and thoughts of quitting. Academy of Management Journal, 62(1), 22–43.

43.

Valeeva

Heemskerk

E. M.

Takes

F. W.

(2020). The duality of firms and directors in board interlock networks: A relational event modeling approach. Social Networks, 62, 68–79.

44.

Pattison

Robins

(2015). Relational event models for social learning in MOOCs. Social Networks, 43, 121–135.

45.

Wang

Neuman

E. J.

Newman

D. A.

(2014). Statistical power of the social network autocorrelation model. Social Networks, 38, 88–99.

46.

Wenzel

Van Quaquebeke

(2018). The double-edged sword of big data in organizational and management research: A review of opportunities and risks. Organizational Research Methods, 21(3), 548–591. https://doi.org/10.1177/1094428117718627

47.

Wimmer

Lewis

(2010). Beyond and below racial homophily: ERG models of a friendship network documented on Facebook. American Journal of Sociology, 116(2), 583–642.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.43 MB